Feature

What is a data rollup?

Learn what a data rollup is and read about data rollup macro examples and case studies in this data mining tutorial from the book Data Mining: Know it All.

Published: 12 May 2009

Data Mining: Know it All

In this excerpt from Data Mining: Know it All, learn what data rollup is, discover how it relates to data mining projects and read examples of how to complete the data rollup process.

2.4 Data Rollup

The simplest definition of data rollup is that we convert categories to variables. Let us consider an illustrative example.

Table 2.1 shows some records from the transaction table of a bank where deposits are denoted by positive amounts and withdrawals are shown as negative amounts. We further assume that we are building the mining view as a customer view. Because the first requirement is to have one, and only one, row per customer, we create a new view such that each unique customer ID appears in one and only one row. To roll up the multiple records on the customer level, we create a set of new variables to represent the combination of the account type and the month of the transaction. This is illustrated in Figure 2.1. The result of the rollup is shown in Table 2.2.

Table 2.1 A Sample of Banking Transactions

Customer ID	Date	Amount	Account type
1100-55555	11Jun2003	114.56	Savings
1100-55555	21Jun2003	− 56.78	Checking
1100-55555	07Jul2003	359.31	Savings
1100-55555	19Jul2003	89.56	Checking
1100-55555	03Aug2003	1000.00	Savings
1100-55555	17Aug2003	− 1200.00	Checking
1100-88888	14June2003	122.51	Savings
1100-88888	27June2003	42.07	Checking
1100-88888	09July2003	− 146.30	Savings
1100-88888	09July2003	− 1254.48	Checking
1100-88888	10Aug2003	400.00	Savings
1100-88888	11Aug2003	500.00	Checking

Table 2.1 shows that we managed to aggregate the values of the transactions in the different accounts and months into new variables. The only issue is what to do when we have more than one transaction per account per month. In this case, which is the more realistic one, we have to summarize the data in some form. For example, we can calculate the sum of the transactions values, or their average, or even create a new set of variables giving the count of such transactions for each month – account type combination.

It is obvious that this process will lead to the generation of possibly hundreds, if not thousands, of variables in any data-rich business applications. Dealing with such a large number of fields could present a challenge for the data preparation and data mining software tools. It is therefore required that we keep the number of these new fields to a minimum while keeping as much information about the nature of the data as possible. Unfortunately, there is no magic recipe to achieve this balance. However, a closer look at the preceding data reveals that the key to controlling the number of new variables is to decide on the level of granularity required to perform the rollup. For example, is it necessary to roll up the transactions of each month, or is it enough to roll up the data per quarter? Similarly, in our simplified case, we had only two categories for the account type, but typically, there would be many more categories. Then comes the question of which categories we can group together, or even ignore, to reduce the number of new variables.

Figure 2.1 Data rollup
Data rollup example

Table 2.2 Result of Rolling up the Data of Table 2.1

Cust. ID	C-6	C-7	C-8	S-6	S-7	S-8
1100-55555	− 56.78	89.56	− 1200.00	114.56	359.30	1000.00
1100-88888	42.07	− 1254.00	500.00	122.51	− 146.30	400.00

In the end, even with careful selection of the categories and resolution of combining the different categories to form new variables, we usually end up with a relatively large number of variables, which most implementations of data mining algorithms cannot handle adequately. However, we should not worry too much about this problem for the moment because data reduction is a basic step in our planned approach. In later chapters, we will investigate techniques to reduce the number of variables.

Copyright info

Printed with permission from Morgan Kaufmann, a division of Elsevier. Copyright 2009. Data Mining: Know It All by Chakrabarti et all. For more information about this title and other similar books, please visit www.elsevierdirect.com.

In the last example demonstrating the rollup process, we performed the rollup on the level of two variables: the account type and the transaction month. This is usually called multilevel rollup. On the other hand, if we had had only one type of account, say only savings, then we could have performed a simpler rollup using only the transaction month as the summation variable. This type of rollup is called simple rollup. In fact, multilevel rollup is only an aggregation of several simple rollups on the row level, which is the customer ID in our example. Therefore, data preparation procedures, in either SAS or SQL, can use this property to simplify the implementation by performing several simple rollups for each combination of the summarization variables and combining them. This is the approach we will adopt in developing our macro to demonstrate the rollup of our sample dataset.

More on data mining:

Download a PDF of this chapter for free: "Data Acquisition and Integration"
Read other excerpts from data management books in the Chapter Download Library.

Now let us describe how to perform the rollup operation using SAS. We will do this using our simple example first and then generalize the code using macros to facilitate its use with other datasets. We stress again that in writing the code we preferred to keep the code simple and readable at the occasional price of efficiency of execution, and the use of memory resources. You are welcome to modify the code to make it more efficient or general as required.

We use Table 2.1 to create the dataset as follows:

Data Transaction; Informat CustID $10.; Informat TransDate date9.; format TransDate Date9.;


`input CustID $`	`TransDate`	`Amount`	`AccountType$; Cards;`
`55555`	`11Jun2003`	`114.56`	`Savings`
`55555`	`12Jun2003`	`119.56`	`Savings`
`55555`	`21Jun2003`	`− 56.78`	`Checking`
`55555`	`07Jul2003`	`359.31`	`Savings`
`55555`	`19Jul2003`	`89.56`	`Checking`
`55555`	`03Aug2003`	`1000.00`	`Savings`
`66666`	`22Feb2003`	`549.56`	`Checking`
`77777`	`03Dec2003`	`645.21`	`Savings`
`55555`	`17Aug2003`	`− 1200.00`	`Checking`
`88888`	`14Jun2003`	`122.51`	`Savings`
`88888`	`27Jun2003`	`2.07`	`Checking`
`88888`	`09Jul2003`	`− 146.30`	`Savings`
`88888`	`09Jul2003`	`− 1254.48`	`Checking`
`88888`	`10Aug2003`	`400.00`	`Savings`
`88888`	`11Aug2003`	`500.00`	`Checking`

; run;

The next step is to create the month field using the SAS Month function:

data Trans; set Transaction; Month = month(TransDate); run;

Then we accumulate the transactions into a new field to represent the balance in each account:

proc sort data = Trans; by CustID month AccountType; run; * Create cumulative balances for each of the accounts * / data Trans2; retain Balance 0; set Trans; by CustID month AccountType; if first.AccountType then Balance = 0; Balance = Balance + Amount; if last.AccountType then output; drop amount; run;

Finally, we use PROC TRANSPOSE to roll up the data in each account type and merge the two resulting datasets into the final file:

/ * Prepare for the transpose * / proc sort data = trans2; by CustID accounttype; run;

proc transpose data = Trans2 out = rolled C prefi x = C ; by CustID accounttype; ID month ; var balance ; where AccountType = ' Checking ' ; run;

proc transpose data = Trans2 out = rolled S prefi x = S ; by CustID accounttype; ID month ; var balance ; where AccountType = ' Savings ' ; run;

data Rollup; merge Rolled S Rolled C; by CustID; drop AccountType Name ; run;

To pack this procedure in a general macro using the combination of two variables, one for transaction categories and one for time, we simply replace the Month variable with a TimeVar, the customer ID with IDVar, and the AccountType with TypeVar. We also specify the number of characters to be used from the category variable to prefix the time values. Finally, we replace the two repeated TRANSPOSE code segments with a %do loop that iterates over the categories of the TypeVar (which requires extracting these categories and counting them). The following steps detail the resulting macro.

Figure 2.3 Parameters of TBRollup() Macro

Header Parameter	TBRollup (TDS, IDVar, TimeVar, TypeVar, Nchars, Value, RDS) Header Parameter Description
`TDS`	`Input transaction dataset`
`IDVar`	`ID variable`
`TimeVar`	`Time variable`
`TypeVar`	`Quantity being rolled up`
`Nchars`	`Number of characters to be used in rollup`
`Value`	`Values to be accumulated`
`RDS`	`The output rolled up dataset`

Step 1
Sort the transaction file using the ID, Time, and Type variables:

proc sort data = & TDS; by & IDVar & TimeVar & TypeVar; run;

Step 2
Accumulate the values over time to a temporary _Tot variable in the temporary table Temp1 (see Table 2.3 ). Then sort Temp1 using the ID and the Type variables: data Temp1; retain _TOT 0; set & TDS; by & IDVar & TimeVar & TypeVar; if fi rst. & TypeVar then _TOT_ = 0; _TOT = _TOT & Value; if last. & TypeVar then output; drop & Value; run; proc sort data= Temp1; by & IDVar & TypeVar; run;

Step 3
proc freq data = Temp1 noprint; tables & TypeVar /out = Types ; run; data null ; set Types nobs = Ncount; if & typeVar ne " then call symput( ' Cat ' | | left( n ), & TypeVar); if N = Ncount then call symput( ' N ' , Ncount); run;

Step 4
Loop over these N categories and generate their rollup part: %do i = 1 %to & N; proc transpose data = Temp1 out = R & i prefi x = %substr( & & Cat & i, 1, & Nchars) ; by & IDVar & TypeVar; ID & TimeVar ; var TOT ; where & TypeVar = " & & Cat & i " ; run; %end;

Step 5
Finally, assemble the parts using the ID variable:

data & RDS; merge %do i = 1 %to & N; R & i %end ; ; by & IDVar; drop & TypeVar Name ; run;

Step 6
Clean the workspace and finish the macro:

proc datasets library = work nodetails; delete Temp1 Types %do i=1 %to & N; R & I %end; ; run; quit; %mend;

We can now call this macro to roll up the previous example Transaction dataset using the following code:

data Trans; set Transaction; Month = month(TransDate); drop transdate; run;


%let IDVar	= CustID;	/ * The row ID variable * /
%let TimeVar	= Month;	/ * The time variable * /
%let TypeVar	= AccountType;	/ * The Type variable * /
%let Value	= Amount;	/ * The time measurement variable * /
%let Nchars	= 1;	/ * Number of letters in Prefix * /
%let TDS	= Trans;	/ * The value variable * /
%let RDS	= Rollup;	/ * the rollup file * /
= colspan="3"> %TBRollup( & TDS, & IDVar, & TimeVar, & TypeVar, & Nchars, & Value, & RDS);

The result of this call is shown in Table 2.4.

Table 2.4 Result of Rollup Macro

CustID	C6	C7	C8	C12	S6	S7	S8	S12
5555	− 56.78	89.56	− 1200	--	234.12	359.31	1000	---
6666	--	--	--	549.56	--	--	--	--
7777	--	--	--	--	--	--	--	645.21
8888	42.07	− 1254.48	500	--	122.51	− 146.3	400	--

What is a data rollup?

Learn what a data rollup is and read about data rollup macro examples and case studies in this data mining tutorial from the book Data Mining: Know it All.

2.4 Data Rollup

Table of contents:

Copyright info

More on data mining:

Dig Deeper on Data science and analytics

Patch Tuesday

Reveal Windows file server permissions with PowerShell's help

Researchers Develop Risk Model to Predict Brain Injury, Stroke in Neonates

Windows 2000

2.4 Data Rollup

Table of contents:

Copyright info

More on data mining:

Related Resources

Dig Deeper on Data science and analytics

Patch Tuesday

Reveal Windows file server permissions with PowerShell's help

Researchers Develop Risk Model to Predict Brain Injury, Stroke in Neonates

Windows 2000