I would really appreciate some feedback on this. If you have a few minutes to spare and don’t mind sharing your thoughts – then I would like to hear from you.
We are creating a data warehouse to store a good amount of genetics data. The data warehouse will include information about people, genetic variants and whether particular variants lead to increased or decreased risk of a particular disease. The schema is shown below:
Like any good data warehouse, the fact table (factGWAS) is the core table which holds the real data. The dimension tables add additional context and metadata to the fact table, and reduce the amount of redundancy (this is a fairly large data warehouse – petabytes of uncompressed text files).
I need to explain the tricky part in order to ask my question, so please stay with me. Each experiment is based around two conditions: a combination of a controlCondition (where control patients are stratified according to a defined criteria) and a caseCondition (where case patients are identified by a particular mechanism e.g. has been hospitalised for this disease, or is self-reported…).
For each experiment, all of the case patients that match the criteria are selected. A random sample of the controls (which match the criteria) are then included. The test (experiment) is then run, and the results recorded in factGWAS. For each experiment, we repeat this process many many times, and the results (and the iteration number) are recorded in factGWAS.
We have a finite pool of patients from which to draw from. These are recorded in dimPerson. We need to be able to match these people to specific experiments as well as iterations within each experiment. Patients can be involved in more than one experiment, as well as more than one iteration. So we have a classic many-to-many relationship. To resolve the many-to-many relationship, we have introduced the joining tables, dimPeopleCase and dimPeopleControl.
All of this is pretty straight forward – no issues here so far.
However, take a minute to look at the relationships which I have defined between factGWAS and dimPeopleCase / dimPeopleControl. Note that the relationships lead from the fact table to the joining tables.
In a standard relational database, this wouldn’t be an issue. But this is a data warehouse and in so, I have blatantly broken the typical star schema which has a central fact table and which depends on the surrounding dimensions. Instead, here I have two joining tables which depend on the fact table!!! It’s not quite right.
It all boils down to this. Should I reverse these relationships and ensure that the two joining tables are represented in the fact table?
It is purely theoretical. Whichever way I choose, the result (for all practical purposes) is exactly the same.
There is a reason that I chose to put the relationships this way around: the data in factGWAS is the most important and interesting data. It is necessary to be able to identify the people that were part of an experiment (so that we can reproduce the experiment if necessary), but this will be rarely queried, and not of critical importance to the normal functions of the data warehouse. Hence, I chose to break the rules.
Of course, having broken the rules, I am doubting the decision.
If anyone has thoughts, feedback, or anything to add – please chip in 🙂
Time is usually the answer 🙂 Time to reflect and evaluate a design. Turns out there was a quite large problem with the original schema – the relationships above do capture experiment / iteration / people quite well, BUT it also maps people with variants. This is a big issue, and would cause there to be billions of rows. Updated schema below:
Have corrected the relationships so that experiment / iteration / people are correctly captured. Removed the mapping between people / variants, and now have a traditional star schema.