Simple data mining examples and datasets

See data mining examples, including examples of data mining algorithms and simple datasets, that will help you learn how data mining works and how companies can make data-related decisions based on set rules.

Data Mining: Know it AllData Mining: Know it All

This excerpt from Data Mining: Know It All includes examples that show how data mining algorithms and datasets work. Learn how companies can make data-related decisions based on set rules.

1.2 Simple Examples: The Weather Problem and Others

We use a lot of examples in this book, which seems particularly appropriate considering that the book is all about learning from examples! There are several standard datasets that we will come back to repeatedly. Different datasets tend to expose new issues and challenges, and it is interesting and instructive to have in mind a variety of problems when considering learning methods. In fact, the need to work with different datasets is so important that a corpus containing around 100 example problems has been gathered together so that different algorithms can be tested and compared on the same set of problems.

Table of contents:

An introduction to data mining

Simple data mining examples and datasets

Fielded applications of data mining and machine learning

The difference between machine learning and statistics in data mining

Information and examples on data mining and ethics

Data acquisition and integration techniques

What is a data rollup?

Calculating mode in data mining projects

Using data merging and concatenation techniques to integrate data

The illustrations used here are all unrealistically simple. Serious application of data mining involves thousands, hundreds of thousands, or even millions of individual cases. But when explaining what algorithms do and how they work, we need simple examples that capture the essence of the problem but are small enough to be comprehensible in every detail. The illustrations we will be working with are intended to be "academic" in the sense that they will help us to understand what is going on. Some actual fielded applications of learning techniques are discussed in Section 1.3, and many more are covered in the books mentioned in the Further Reading section at the end of the chapter.

Copyright info

Printed with permission from Morgan Kaufmann, a division of Elsevier. Copyright 2009. Data Mining: Know It All by Chakrabarti, et al. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Another problem with actual real-life datasets is that they are often proprietary. No corporation is going to share its customer and product choice database with you so that you can understand the details of its data mining application and how it works. Corporate data is a valuable asset, one whose value has increased enormously with the development of data mining techniques such as those described in this book. Yet we are concerned here with understanding how the methods used for data mining work and understanding the details of these methods so that we can trace their operation on actual data. That is why our illustrations are simple ones. But they are not simplistic: they exhibit the features of real datasets.

1.2.1 The Weather Problem

The weather problem is a tiny dataset that we will use repeatedly to illustrate machine learning methods. Entirely fictitious, it supposedly concerns the conditions that are suitable for playing some unspecified game. In general, instances in a dataset are characterized by the values of features, or attributes, that measure different aspects of the instance. In this case there are four attributes: outlook, temperature, humidity, and windy. The outcome is whether or not to play.

In its simplest form, shown in Table 1.2 , all four attributes have values that are symbolic categories rather than numbers. Outlook can be sunny, overcast, or rainy; temperature can be hot, mild, or cool; humidity can be high or normal; and windy can be true or false. This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of input examples.

A set of rules learned from this information -- not necessarily a very good one -- might look as follows:

If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes

These rules are meant to be interpreted in order: the first one; then, if it doesn't apply, the second; and so on.

A set of rules intended to be interpreted in sequence is called a decision list. Interpreted as a decision list, the rules correctly classify all of the examples in the table, whereas taken individually, out of context, some of the rules are incorrect. For example, the rule if humidity = normal, then play = yes gets one of the examples wrong (check which one). The meaning of a set of rules depends on how it is interpreted -- not surprisingly!

More on this topic:

Learn about the new ethical dilemmas posed by the growing power of analytics

Find out how NASA has used text analytics to improve aviation safety

Read about the need for open organizational minds in predictive analytics programs

In the slightly more complex form shown in Table 1.3, two of the attributes -- temperature and humidity -- have numeric values. This means that any learning method must create inequalities involving these attributes rather than simple equality tests, as in the former case. This is called a numeric-attribute problem -- in this case, a mixed-attribute problem because not all attributes are numeric.

Now the first rule given earlier might take the following form:

If outlook = sunny and humidity > 83 then play = no

A slightly more complex process is required to come up with rules that involve numeric tests.

Table 1.2 The Weather Data

Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

The rules we have seen so far are classification rules: they predict the classification of the example in terms of whether or not to play. It is equally possible to disregard the classification and just look for any rules that strongly associate different attribute values. These are called association rules. Many association rules can be derived from the weather data in Table 1.2. Some good ones are as follows:

If temperature = cool then humidity = normal
If humidity = normal and windy = false then play = yes
If outlook = sunny and play = no then humidity = high
If windy = false and play = no then outlook = sunny
and humidity = high.

Table 1.3 Weather Data with Some Numeric Attribute

Outlook Temperature Humidity Windy Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 70 96 False Yes
Rainy 68 80 False Yes
Rainy 65 70 True No
Overcast 64 65 True Yes
Sunny 72 95 False No
Sunny 69 70 False Yes
Rainy 75 80 False Yes
Sunny 75 70 True Yes
Overcast 72 90 True Yes
Overcast 81 75 False Yes
Rainy 71 91 True No

All these rules are 100 percent correct on the given data; they make no false predictions. The first two apply to four examples in the dataset, the third to three examples, and the fourth to two examples. There are many other rules: in fact, nearly 60 association rules can be found that apply to two or more examples of the weather data and are completely correct on this data. If you look for rules that are less than 100 percent correct, then you will find many more. There are so many because unlike classification rules, association rules can "predict" any of the attributes, not just a specified class, and can even predict more than one thing. For example, the fourth rule predicts both that outlook will be sunny and that humidity will be high.

1.2.2 Contact Lenses: An Idealized Problem

The contact lens data introduced earlier tells you the kind of contact lens to prescribe, given certain information about a patient. Note that this example is intended for illustration only: it grossly oversimplifies the problem and should certainly not be used for diagnostic purposes!

The first column of Table 1.1 gives the age of the patient. In case you're wondering, presbyopia is a form of longsightedness that accompanies the onset of middle age. The second gives the spectacle prescription: myope means shortsighted and hypermetrope means longsighted. The third shows whether the patient is astigmatic, and the fourth relates to the rate of tear production, which is important in this context because tears lubricate contact lenses. The final column shows which kind of lenses to prescribe: hard, soft, or none. All possible combinations of the attribute values are represented in the table.

A sample set of rules learned from this information is shown in Figure 1.1 . This is a large set of rules, but they do correctly classify all the examples. These rules are complete and deterministic: they give a unique prescription for every conceivable example. Generally, this is not the case. Sometimes there are situations in which no rule applies; other times more than one rule may apply, resulting in conflicting recommendations. Sometimes probabilities or weights may be associated with the rules themselves to indicate that some are more important, or more reliable, than others.

You might be wondering whether there is a smaller rule set that performs as well. If so, would you be better off using the smaller rule set and, if so, why? These are exactly the kinds of questions that will occupy us in this book. Because the examples form a complete set for the problem space, the rules do no more than summarize all the information that is given, expressing it in a different and more concise way. Even though it involves no generalization, this is often a useful thing to do! People frequently use machine learning techniques to gain insight into the structure of their data rather than to make predictions for new cases. In fact, a prominent and successful line of research in machine learning began as an attempt to compress a huge database of possible chess endgames and their outcomes into a data structure of reasonable size. The data structure chosen for this enterprise was not a set of rules, but a decision tree.

Figure 1.1. Rules for the contact lenses data

If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no and
   tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no and
   tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope and
   astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no and
   tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes and
   tear production rate = normal then recommendation = hard
If age = young and astigmatic = yes and
   tear production rate = normal then recommendation = hard
If age = pre-presbyopic and
   spectacle prescription = hypermetrope and astigmatic = yes
   then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
   and astigmatic = yes then recommendation = none

Figure 1.2. Decision tree for the contact lenses data
Decision tree for the contact lenses data

Figure 1.2 presents a structural description for the contact lens data in the form of a decision tree, which for many purposes is a more concise and perspicuous representation of the rules and has the advantage that it can be visualized more easily. (However, this decision tree -- in contrast to the rule set given in Figure 1.1 -- classifies two examples incorrectly.) The tree calls first for a test on tear production rate, and the first two branches correspond to the two possible outcomes. If tear production rate is reduced (the left branch), the outcome is none. If it is normal (the right branch), a second test is made, this time on astigmatism. Eventually, whatever the outcome of the tests, a leaf of the tree is reached that dictates the contact lens recommendation for that case.

1.2.3 Irises: A Classic Numeric Dataset

The iris dataset, which dates back to seminal work by the eminent statistician R. A. Fisher in the mid-1930s and is arguably the most famous dataset used in data mining, contains 50 examples each of three types of plant: Iris setosa, Iris versicolor, and Iris virginica. It is excerpted in Table 1.4. There are four attributes: sepal length, sepal width, petal length, and petal width (all measured in centimeters). Unlike previous datasets, all attributes have numeric values.

Table 1.4 The Iris Data

  Sepal Length (cm) Sepal Width (cm) Petal Length (cm) Petal Width (cm) Type
1 5.1 3.5 1.4 0.2 Iris setosa
2 4.9 3.0 1.4 0.2 Iris setosa
3 4.7 3.2 1.3 0.2 Iris setosa
4 4.6 3.1 1.5 0.2 Iris setosa
5 5.0 3.6 1.4 0.2 Iris setosa
         
51 7.0 3.2 4.7 1.4 Iris versicolor
52 6.4 3.2 4.5 1.5 Iris versicolor
53 6.9 3.1 4.9 1.5 Iris versicolor
54 5.5 2.3 4.0 1.3 Iris versicolor
55 6.5 2.8 4.6 1.5 Iris versicolor
         
101 6.3 3.3 6.0 2.5 Iris virginica
102 5.8 2.7 5.1 1.9 Iris virginica
103 7.1 3.0 5.9 2.1 Iris virginica
104 6.3 2.9 5.6 1.8 Iris virginica
105 6.5 3.0 5.8 2.2 Iris virginica
         

The following set of rules might be learned from this dataset:

If petal length < 2.45 then Iris setosa
If sepal width < 2.10 then Iris versicolor
If sepal width < 2.45 and petal length < 4.55 then Iris versicolor
If sepal width < 2.95 and petal width < 1.35 then Iris versicolor
If petal length ≥ 2.45 and petal length < 4.45 then Iris versicolor
If sepal length ≥ 5.85 and petal length < 4.75 then Iris versicolor
If sepal width < 2.55 and petal length < 4.95 and
   petal width < 1.55 then Iris versicolor
If petal length ≥2.45 and petal length < 4.95 and
petal width < 1.55 then Iris versicolor
If sepal length ≥ 6.55 and petal length < 5.05 then Iris versicolor
If sepal width < 2.75 and petal width < 1.65 and
   sepal length < 6.05 then Iris versicolor
If sepal length ≥5.85 and sepal length < 5.95 and
   petal length < 4.85 then Iris versicolor
If petal length ≥ 5.15 then Iris virginica
If petal width ≥1.85 then Iris virginica
If petal width ≥ 1.75 and sepal width < 3.05 then Iris virginica
If petal length ≥ 4.95 and petal width < 1.55 then Iris virginica

These rules are very cumbersome; more compact rules can be expressed that convey the same information.

1.2.4 CPU Performance: Introducing Numeric Prediction

Although the iris dataset involves numeric attributes, the outcome -- the type of iris -- is a category, not a numeric value. Table 1.5 shows some data for which the outcome and the attributes are numeric. It concerns the relative performance of computer processing power on the basis of a number of relevant attributes; each row represents 1 of 209 different computer configurations.

The classic way of dealing with continuous prediction is to write the outcome as a linear sum of the attribute values with appropriate weights, for example:

PRP -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX + 0.6410 CACH
    - 0.2700 CHMIN + 1.480 CHMAX

Table 1.5 The CPU Performance Data

  Cycle Time (ns) MYCT Main Memory (KB) Cache (KB) CACH Channels Performance PRP
Minimum MMN Maximum MMAX Minimum CHMIN Maximum CHMAX  
1 125 256 6000 256 16 128 198
2 29 8000 32000 32 8 32 269
3 29 8000 32000 32 8 32 220
4 29 8000 32000 32 8 32 172
5 29 8000 16000 32 8 16 132
         
207 126 2000 8000 0 2 14 52
208 480 512 8000 32 0 0 67
209 480 1000 4000 0 0 0 45

(The abbreviated variable names are given in the second row of the table.) This is called a regression equation, and the process of determining the weights is called regression, a well-known procedure in statistics. However, the basic regression method is incapable of discovering nonlinear relationships (although variants do exist).

In the iris and central processing unit (CPU) performance data, all the attributes have numeric values. Practical situations frequently present a mixture of numeric and nonnumeric attributes.

1.2.5 Labor Negotiations: A More Realistic Example

The labor negotiations dataset in Table 1.6 summarizes the outcome of Canadian contract negotiations in 1987 and 1988. It includes all collective agreements reached in the business and personal services sector for organizations with at least 500 members (teachers, nurses, university staff, police, etc.). Each case concerns one contract, and the outcome is whether the contract is deemed acceptable or unacceptable. The acceptable contracts are ones in which agreements were accepted by both labor and management. The unacceptable ones are either known offers that fell through because one party would not accept them or acceptable contracts that had been significantly perturbed to the extent that, in the view of experts, they would not have been accepted.

Table 1.6 The Labor Negotiations Data

Attribute Type 1 2 3 40
Duration Years 1 2 3   2
Wage increase first year Percentage 2% 4% 4.3%   4.5
Wage increase second year Percentage ? 5% 4.4%   4.0
Wage increase third year Percentage ? ? ?   ?
Cost of living adjustment [none, tcf, tc] None TCF ?   None
Working hours per week Hours 28 35 38   40
Pension [none, ret-allw, empl-cntr] None ? ?   ?
Standby pay Percentage ? 13% ?   ?
Shift-work supplement Percentage ? 5% 4%   4%
Education allowance [yes, no] Yes ? ?   ?
Statutory holidays Days 11 15 12   12
Vacation [below-avg, avg, gen] Avg Gen Gen   Avg
Long-term disability insurance [yes, no] No ? ?   Yes
Dental plan contribution [none, half, full] None ? Full   Full
Bereavement assistance [yes, no] No ? ?   Yes
Health plan contribution [none, half, full] None ? Full   Half
Acceptability of contract [good, bad] Bad Good Good   Good

There are 40 examples in the dataset (plus another 17 that are normally reserved for test purposes). Unlike the other tables here, Table 1.6 presents the examples as columns rather than as rows; otherwise, it would have to be stretched over several pages. Many of the values are unknown or missing, as indicated by question marks.

This is a much more realistic dataset than the others we have seen. It contains many missing values, and it seems unlikely that an exact classification can be obtained.

Figure 1.3 shows two decision trees that represent the dataset. Figure 1.3(a) is simple and approximate: it doesn't represent the data exactly. For example, it will predict bad for some contracts that are actually marked good. But it does make intuitive sense: a contract is bad (for the employee!) if the wage increase in the first year is too small (less than 2.5 percent). If the first-year wage increase is larger than this, it is good if there are lots of statutory holidays (more than 10 days). Even if there are fewer statutory holidays, it is good if the first-year wage increase is large enough (more than 4 percent).

Figure 1.3(b) is a more complex decision tree that represents the same dataset. In fact, this is a more accurate representation of the actual dataset that was used to create the tree. But it is not necessarily a more accurate representation of the underlying concept of good versus bad contracts. Look down the left branch. It doesn't seem to make sense intuitively that, if the working hours exceed 36, a contract is bad if there is no health-plan contribution or a full health-plan contribution but is good if there is a half health-plan contribution. It is certainly reasonable that the health-plan contribution plays a role in the decision but not if half is good and both full and none are bad. It seems likely that this is an artifact of the particular values used to create the decision tree rather than a genuine feature of the good versus bad distinction.

The tree in Figure 1.3(b) is more accurate on the data that was used to train the classifier but will probably perform less well on an independent set of test data. It is "overfitted" to the training data -- it follows it too slavishly. The tree in Figure 1.3(a) is obtained from the one in Figure 1.3(b) by a process of pruning.

1.2.6 Soybean Classification: A Classic Machine Learning Success

An often-quoted early success story in the application of machine learning to practical problems is the identification of rules for diagnosing soybean diseases. Soybean Classification: A Classic Machine Learning Success
The data is taken from questionnaires describing plant diseases. There are about 680 examples, each representing a diseased plant. Plants were measured on 35 attributes, each one having a small set of possible values. Examples are labeled with the diagnosis of an expert in plant biology: there are 19 disease categories altogether -- horrible-sounding diseases, such as diaporthe stem canker, rhizoctonia root rot, and bacterial blight, to mention just a few.

More on data mining:

Table 1.7 gives the attributes, the number of different values that each can have, and a sample record for one particular plant. The attributes are placed into different categories just to make them easier to read.

Here are two example rules, learned from this data:

If         [leaf condition is normal and
        stem condition is abnormal and
        stem cankers is below soil line and
        canker lesion color is brown]
then
        diagnosis is rhizoctonia root rot
If         [leaf malformation is absent and
        stem condition is abnormal and
        stem cankers is below soil line and
        canker lesion color is brown]
then
        diagnosis is rhizoctonia root rot

These rules nicely illustrate the potential role of prior knowledge -- often called domain knowledge -- in machine learning, because the only difference between the two descriptions is leaf condition is normal versus leaf malformation is absent. In this domain, if the leaf condition is normal, then leaf malformation is necessarily absent, so one of these conditions happens to be a special case of the other. Thus, if the first rule is true, the second is necessarily true as well. The only time the second rule comes into play is when leaf malformation is absent but leaf condition is not normal -- that is, when something other than malformation is wrong with the leaf. This is certainly not apparent from a casual reading of the rules.

Table 1.7 The Soybean Data

  Attribute
Number of Values Sample Value
Environment Time of occurrence 7 July
  Precipitation 3 Above normal
  Temperature 3 Normal
  Cropping history 4 Same as last year
  Hail damage 2 Yes
  Damaged area 4 Scattered
  Severity 3 Severe
  Plant height 2 Normal
  Plant growth 2 Abnormal
  Seed treatment 3 Fungicide
  Germination 3 Less than 80%
Seed Condition 2 Normal
  Mold growth 2 Absent
  Discoloration 2 Absent
  Size 2 Normal
  Shriveling 2 Absent
Fruit Condition of fruit pods 3 Normal
  Fruit spots 5 ---
Leaf Condition 2 Abnormal
  Leaf spot size 3 ---
  Yellow leaf spot halo 3 Absent
  Leaf spot margins 3 ---
  Shredding 2 Absent
  Leaf malformation 2 Absent
  Leaf mildew growth 3 Absent
Stem Condition 2 Abnormal
  Stem lodging 2 Yes
  Stem cankers 4 Above soil line
  Canker lesion color 3 ---
  Fruiting bodies on stems 2 Present
  External decay of stem 3 Firm and dry
  Mycelium on stem 2 Absent
  Internal discoloration 3 None
  Sclerotia 2 Absent
Root Condition 3 Normal
Diagnosis     Diaporthe stem
    19 Canker

Research on this problem in the late 1970s found that these diagnostic rules could be generated by a machine learning algorithm, along with rules for every other disease category, from about 300 training examples. The examples were carefully selected from the corpus of cases as being quite different from one another -- "far apart" in the example space. At the same time, the plant pathologist who had produced the diagnoses was interviewed, and his expertise was translated into diagnostic rules. Surprisingly, the computer-generated rules outperformed the expert's rules on the remaining test examples. They gave the correct disease top ranking 97.5 percent of the time compared with only 72 percent for the expert-derived rules. Furthermore, not only did the learning algorithm find rules that outperformed those of the expert collaborator, but the same expert was so impressed that he allegedly adopted the discovered rules in place of his own!

Dig Deeper on Data science and analytics