Answer

Defining dimensional vs. normalized database design, dimension vs. fact tables

Find out if your existing database is in dimensional style or normalized database design style, learn the benefits and drawbacks of fact-dimension and normalized database schemas and read a comparison of dimension tables vs. fact tables.

Mark Whitehorn, University of Dundee

Published: 14 Apr 2009

Faced with an existing database implementation, is there a way to determine if it is in fact-dimension style or a normalized style? Is it shades of grey, or is there a unique features of one or the other? What is the difference between a dimension table and a fact table?

There is a great deal of overlap between these two questions, so we'll answer them together.

Let's start with normalized data. This is how we usually structure data for transactional systems such as an order entry system, an HR system and so on - in other words, traditional, operational databases.

In essence normalized data is held in a very simple structure. The data is stored in tables. Each table has a primary key and should contain data relating to one entity – so a normalized customer table contains only data about customers. We need to make logical connections between the entities (for example, this customer placed these orders). To do this we use a foreign key in the orders table to point to the primary key in the customer table.

To get a little more technical, there are three levels of normalization – first, second and third normal form and we can test a table to see if it meets these criteria. (To get even more precise, there are more that three levels but mostly we are only interested in the first three).

One advantage of normalized data is that there is very little data redundancy – each piece of data is stored once and once only.

Now in a dimensional design (fact-dimensional style) all of these rules go out of the window. The fact table holds the measures (usually numerical) that we want to analyze and the dimension tables contain data/information about the ways in which we want to analyze the data.

The dimension tables in particular are often highly de-normalized. For example, a customer table might store the zip code of the customer, their town and state. If you have fifty customers in Walla Walla, then the customer dimension table will store the fact that Walla Walla in is Washington a total of fifty times. So there is often massive data duplication. Take this Time dimension table for example (sorry, it's in UK date format, but the sense remains the same):

Call Date	CalYear	Quarter	CalMonth	CalDay
01/08/2008	2008	Q3	8	1
02/08/2008	2008	Q3	8	2
03/08/2008	2008	Q3	8	3
04/08/2008	2008	Q3	8	4
05/08/2008	2008	Q3	8	5
06/08/2008	2008	Q3	8	6
07/08/2008	2008	Q3	8	7
08/08/2008	2008	Q3	8	8
09/08/2008	2008	Q3	8	9
10/08/2008	2008	Q3	8	10
11/08/2008	2008	Q3	8	11
12/08/2008	2008	Q3	8	12
13/08/2008	2008	Q3	8	13
14/08/2008	2008	Q3	8	14
15/08/2008	2008	Q3	8	15
16/08/2008	2008	Q3	8	16
17/08/2008	2008	Q3	8	17
18/08/2008	2008	Q3	8	18
19/08/2008	2008	Q3	8	19
20/08/2008	2008	Q3	8	20
21/08/2008	2008	Q3	8	21
22/08/2008	2008	Q3	8	22
23/08/2008	2008	Q3	8	23
24/08/2008	2008	Q3	8	24
25/08/2008	2008	Q3	8	25
26/08/2008	2008	Q3	8	26
27/08/2008	2008	Q3	8	28
28/08/2008	2008	Q3	8	29
29/08/2008	2008	Q3	8	30
30/08/2008	2008	Q3	8	31
01/09/2008	2008	Q3	8	1

It stores the information that calendar month eight (August) is in quarter three a total of 31 times for every year that is held in the table. But that is fine because these structures are not built to manage transactions they are built to help us to analyze the data as easily as possible.

Dimension tables have a column containing unique values (in this case CalDate). The data in this column is used to link the dimension table to the fact table (see below).

Each of the remaining columns is there because the users have said that they want to analyze by that attribute. So, in this case, the users wanted to analyze by Year, Quarter, Month and Day. Dimension tables effectively spell out the analysis that the users want to perform.

Fact tables, on the other hand, contain the numerical measures that the users want to analyze. So (a good question at this point is "Why aren't they called measure tables?") And the answer to that is that fact tables contain more than just measures, they contain measures placed in some kind of context.

Think about it this way - a measure is simply a numerical value. A fact is that same value with context applied to it. We could also say that a fact is a measure anchored in multidimensional space, which sounds wonderfully exotic but is in essence perfectly true.

So, if I tell you we sold three items; that is simply a measure. If I tell you we sold three copies of a CD called "Escape to Wapping" on the 23/08/2008 to Lilly Murray, you now have some context around that measure; so what you now have is a fact.

A fact table might look something like this:

Item No	Customer ID	Employee Code	Order Date	Dispatch Date	Quantity
28	5851	19	23/10/2008	29/10/2008	2
40	5851	19	23/10/2008	29/10/2008	4
130	5851	19	23/10/2008	29/10/2008	4
68	4703	16	23/10/2008	23/10/2008	2
73	4703	16	23/10/2008	23/10/2008	2
31	358	162	23/10/2008	10/11/2008	2
41	358	162	23/10/2008	10/11/2008	2
57	358	162	23/10/2008	10/11/2008	3
122	358	162	23/10/2008	10/11/2008	2

We can hook both OrderDate and DispatchDate back to the time dimension so that we can analyze these facts over time. So, for example, if we wanted to know how many items had been sold during August 2008 we could scan down the time dimension table looking for the value eight in Cal month and the value 2008 in Cal year. That would return 31 rows with 31 unique date values. We could then query the OrderDate column in the fact table, identify all the appropriate rows, and sum the values in the Quantity column.

This may sound a little long-winded. You might argue that we could simply look for the dates in the fact table and find the appropriate ones. There are two reasons why we tend not to do this.

The first is the fact tables can hold quite literally billions of rows; which makes them slow to query. The second is that we would potentially have to perform a very large number of date function calls in order to do this. Of course, the intelligent use of indexing would mean we would not have to perform a full table scan on the fact table and billions of function calls; nevertheless it would be a relatively slow operation.

You'll notice that another way of thinking about the date dimension table is that it is a materialized store of the results of data function calls. This table is also a tiny compared with the fact table -- 10 years worth of data results in a table of only approximately 3650 rows.

And, going back to illustrating how the fact table works, if the Customer dimension table looks like this:

Customer ID	Title	First Name	Last Name	Gender
352	Ms	Mary	Johnston	F
353	Ms	Lily	Dora	F
354	Ms	Moira	McLeod	F
355	Mrs	Margaret	Winterbottom	F
356	Mr	James	Gall	M
357	Mrs	May	Keith	F
358	Mrs	Lily	Murray	F
359	Mr	Jason	Fergusson	M
360	Miss	Maureen	Manson	F
361	Ms	Anges	Gary	F
362	Mrs	Helen	Healy	F
363	Ms	Charlotte	Morris	F
364	Mr	Edward	Knight	M
365	Mrs	Davina	Turnbull	F

You can see that it was, indeed, Lily Murray who bought item 31. And if the Product dimension table is like this:

Item Num	Classification	CD ROM Title
19	Travel	The Canal
20	Other	Mother smiles
21	Hobby	Neckless
22	Hobby	Collecting lorgnettes
23	Cars	Bentley, the man and his cars
24	Travel	Aqueuos Amsterdam
25	Other	Welcome to the party
26	Travel	Seven go placid in Bangor
27	Hobby	Ecclesiastical embroidery
28	Work	Are you following me?
29	Other	A chip on the shoulder
30	Other	Basso profundo
31	Travel	Escape to Wapping
32	Travel	Ross' bush
33	Work	Shipping for profit
34	Other	An orchard in May
35	Cooking	Eating well
36	Computing	Database vol. 3
37	Hobby	Dentistry for laypersons

You can see that she did indeed buy the CD called "Escape to Wapping".

As to why anyone would want to buy such a CD, sadly history does not relate.

Defining dimensional vs. normalized database design, dimension vs. fact tables

Find out if your existing database is in dimensional style or normalized database design style, learn the benefits and drawbacks of fact-dimension and normalized database schemas and read a comparison of dimension tables vs. fact tables.

Dig Deeper on Database management

SQLite3 Basics: Create tables and insert records

star schema

fact table

dimension

Related Q&A from Mark Whitehorn

DBMS keys: 8 types of keys defined

Do predictive analytics projects require data scientists?

Key issues to consider when building a data warehouse