Defining dimensional vs. normalized database design, dimension vs. fact tables

Find out if your existing database is in dimensional style or normalized database design style, learn the benefits and drawbacks of fact-dimension and normalized database schemas and read a comparison of dimension tables vs. fact tables.

Faced with an existing database implementation, is there a way to determine if it is in fact-dimension style or a normalized style? Is it shades of grey, or is there a unique features of one or the other? What is the difference between a dimension table and a fact table?

There is a great deal of overlap between these two questions, so we'll answer them together.

Let's start with normalized data. This is how we usually structure data for transactional systems such as an order entry system, an HR system and so on - in other words, traditional, operational databases.

In essence normalized data is held in a very simple structure. The data is stored in tables. Each table has a primary key and should contain data relating to one entity – so a normalized customer table contains only data about customers. We need to make logical connections between the entities (for example, this customer placed these orders). To do this we use a foreign key in the orders table to point to the primary key in the customer table.

To get a little more technical, there are three levels of normalization – first, second and third normal form and we can test a table to see if it meets these criteria. (To get even more precise, there are more that three levels but mostly we are only interested in the first three).

One advantage of normalized data is that there is very little data redundancy – each piece of data is stored once and once only.

Now in a dimensional design (fact-dimensional style) all of these rules go out of the window. The fact table holds the measures (usually numerical) that we want to analyze and the dimension tables contain data/information about the ways in which we want to analyze the data.

The dimension tables in particular are often highly de-normalized. For example, a customer table might store the zip code of the customer, their town and state. If you have fifty customers in Walla Walla, then the customer dimension table will store the fact that Walla Walla in is Washington a total of fifty times. So there is often massive data duplication. Take this Time dimension table for example (sorry, it's in UK date format, but the sense remains the same):

 

 Call Date  CalYear  Quarter  CalMonth  CalDay
01/08/2008 2008 Q3 8 1
02/08/2008 2008 Q3 8 2
03/08/2008 2008 Q3 8 3
04/08/2008 2008 Q3 8 4
05/08/2008 2008 Q3 8 5
06/08/2008 2008 Q3 8 6
07/08/2008 2008 Q3 8 7
08/08/2008 2008 Q3 8 8
09/08/2008 2008 Q3 8 9
10/08/2008 2008 Q3 8 10
11/08/2008 2008 Q3 8 11
12/08/2008 2008 Q3 8 12
13/08/2008 2008 Q3 8 13
14/08/2008 2008 Q3 8 14
15/08/2008 2008 Q3 8 15
16/08/2008 2008 Q3 8 16
17/08/2008 2008 Q3 8 17
18/08/2008 2008 Q3 8 18
19/08/2008 2008 Q3 8 19
20/08/2008 2008 Q3 8 20
21/08/2008 2008 Q3 8 21
22/08/2008 2008 Q3 8 22
23/08/2008 2008 Q3 8 23
24/08/2008 2008 Q3 8 24
25/08/2008 2008 Q3 8 25
26/08/2008 2008 Q3 8 26
27/08/2008 2008 Q3 8 28
28/08/2008 2008 Q3 8 29
29/08/2008 2008 Q3 8 30
30/08/2008 2008 Q3 8 31
01/09/2008 2008 Q3 8 1

It stores the information that calendar month eight (August) is in quarter three a total of 31 times for every year that is held in the table. But that is fine because these structures are not built to manage transactions they are built to help us to analyze the data as easily as possible.

Dimension tables have a column containing unique values (in this case CalDate). The data in this column is used to link the dimension table to the fact table (see below).

Each of the remaining columns is there because the users have said that they want to analyze by that attribute. So, in this case, the users wanted to analyze by Year, Quarter, Month and Day. Dimension tables effectively spell out the analysis that the users want to perform.

Fact tables, on the other hand, contain the numerical measures that the users want to analyze. So (a good question at this point is "Why aren't they called measure tables?") And the answer to that is that fact tables contain more than just measures, they contain measures placed in some kind of context.

Think about it this way - a measure is simply a numerical value. A fact is that same value with context applied to it. We could also say that a fact is a measure anchored in multidimensional space, which sounds wonderfully exotic but is in essence perfectly true.

So, if I tell you we sold three items; that is simply a measure. If I tell you we sold three copies of a CD called "Escape to Wapping" on the 23/08/2008 to Lilly Murray, you now have some context around that measure; so what you now have is a fact.

A fact table might look something like this:

 

 Item No  Customer ID  Employee Code  Order Date  Dispatch Date  Quantity
28 5851 19 23/10/2008 29/10/2008 2
40 5851 19 23/10/2008 29/10/2008 4
130 5851 19 23/10/2008 29/10/2008 4
68 4703 16 23/10/2008 23/10/2008 2
73 4703 16 23/10/2008 23/10/2008 2
31 358 162 23/10/2008 10/11/2008 2
41 358 162 23/10/2008 10/11/2008 2
57 358 162 23/10/2008 10/11/2008 3
122 358 162 23/10/2008 10/11/2008 2

We can hook both OrderDate and DispatchDate back to the time dimension so that we can analyze these facts over time. So, for example, if we wanted to know how many items had been sold during August 2008 we could scan down the time dimension table looking for the value eight in Cal month and the value 2008 in Cal year. That would return 31 rows with 31 unique date values. We could then query the OrderDate column in the fact table, identify all the appropriate rows, and sum the values in the Quantity column.

This may sound a little long-winded. You might argue that we could simply look for the dates in the fact table and find the appropriate ones. There are two reasons why we tend not to do this.

The first is the fact tables can hold quite literally billions of rows; which makes them slow to query. The second is that we would potentially have to perform a very large number of date function calls in order to do this. Of course, the intelligent use of indexing would mean we would not have to perform a full table scan on the fact table and billions of function calls; nevertheless it would be a relatively slow operation.

You'll notice that another way of thinking about the date dimension table is that it is a materialized store of the results of data function calls. This table is also a tiny compared with the fact table -- 10 years worth of data results in a table of only approximately 3650 rows.

And, going back to illustrating how the fact table works, if the Customer dimension table looks like this:

 

 Customer ID  Title  First Name   Last Name  Gender
352 Ms Mary Johnston F
353 Ms Lily Dora F
354 Ms Moira McLeod F
355 Mrs Margaret  Winterbottom F
356 Mr James Gall M
357 Mrs May Keith F
358 Mrs Lily Murray F
359 Mr Jason Fergusson M
360 Miss Maureen Manson F
361 Ms Anges Gary F
362 Mrs Helen Healy F
363 Ms Charlotte Morris F
364 Mr Edward Knight M
365 Mrs Davina Turnbull F

You can see that it was, indeed, Lily Murray who bought item 31. And if the Product dimension table is like this:

 

 Item Num  Classification  CD ROM Title
19 Travel The Canal
20 Other Mother smiles
21 Hobby Neckless
22 Hobby Collecting lorgnettes
23 Cars  Bentley, the man and his cars
24 Travel Aqueuos Amsterdam
25 Other Welcome to the party
26 Travel Seven go placid in Bangor
27 Hobby Ecclesiastical embroidery
28 Work Are you following me?
29 Other A chip on the shoulder
30 Other Basso profundo
31 Travel Escape to Wapping
32 Travel Ross' bush
33 Work Shipping for profit
34 Other An orchard in May
35 Cooking Eating well
36 Computing Database vol. 3
37 Hobby Dentistry for laypersons

You can see that she did indeed buy the CD called "Escape to Wapping".

As to why anyone would want to buy such a CD, sadly history does not relate.

 

Dig Deeper on Database management