data set
What is a data set?
A data set, sometimes spelled dataset, is a collection of related data that's usually organized in a standardized format. Data sets are used for analytics, business intelligence, artificial intelligence (AI) model training and a variety of other use cases. Data sets can vary significantly in both size and type of data. For example, a data set might contain information about tree species, ocean temperatures, regional sales totals, fruit prices, lottery winners, diseases or just about any other type of data.
Although formats differ from one data set to another, their underlying organization can often be conceptualized as columns and rows, such as those found in spreadsheets or database tables. Each column represents a variable that describes the data, and each row represents a record that contains a related set of variable values. A value within a data set is sometimes referred to as datum or data point.
Many data sets are freely available online. They can be used to develop and test applications, train AI models, perform analytics or carry out other projects. For example, the figure below shows the air quality data set from Data.gov, which offers a wide range of free data sets. The air quality data set contains air quality surveillance data for New York City.
In the figure, the air quality data set is displayed in a Microsoft Excel spreadsheet. However, the data originated as a comma-separated values (CSV) file downloaded from Data.gov. The data set includes columns such as Unique ID, Geo Place Name and Time Period, which are three of the data set's variables.
The data set also includes rows for each air quality measurement, specific to a place and time. That is, each row is a record of a specific air quality measurement. The record is made up of a set of related values, with each value corresponding to a column, i.e., variable. For example, the value in the Start_Date column for the first record is 12/1/2010.
Data set vs. database
The term data set is sometimes confused with the term database, but the two have different meanings. A database is used to store and manage data. It is part of a larger management platform that includes features for securing, accessing, updating and in other ways working with and protecting data. A data set is simply a file or other structure that contains the data values in a specific format. A database might contain the data from one or more data sets, but the two are not the same.
Data set formats
Data sets are available in a variety of formats, such as JavaScript Object Notation (JSON) and Extensible Markup Language (XML). Such formats provide a standardized structure for sharing data across multiple platforms and applications. The data itself is usually written in plain text, so it can be easily filtered, updated and in other ways transformed to meet specific requirements.
Some data sets are available in more than one format. For example, the air quality data set shown above can be downloaded from Data.gov as a CSV, JSON, XML or Resource Description Framework (RDF) file. When a data set is available in multiple formats, the expectation is that each file contains the same set of records, with each record formatted according to the applicable standard.
A good way to demonstrate how this works is to look at the same air quality record in each of the four formats. For instance, one of the records has a unique ID value of 172653, which identifies that record from all other records. The following four script samples show the record in each format:
CSV record:
172653,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,203,Bedford Stuyvesant – Crown Heights,Annual Average 2011,12/01/2010,25.3
JSON record:
[ "row-frzi_7bar_4cbg", "00000000-0000-0000-AF08-C339B5581012", 0, 1698955938, null, 1698955938, null, "{ }", "172653", "375", "Nitrogen dioxide (NO2)", "Mean", "ppb", "UHF34", "203", "Bedford Stuyvesant – Crown Heights", "Annual Average 2011", "2010-12-01T00:00:00", "25.30", null ]
XML record:
<row _id="row-frzi_7bar_4cbg" _uuid="00000000-0000-0000-AF08-C339B5581012" _position="0" _address="https://data.cityofnewyork.us/resource/c3uy-2p5r/172653"><unique_id>172653</unique_id><indicator_id>375</indicator_id><name>Nitrogen dioxide (NO2)</name><measure>Mean</measure><measure_info>ppb</measure_info><geo_type_name>UHF34</geo_type_name><geo_join_id>203</geo_join_id><geo_place_name>Bedford Stuyvesant – Crown Heights</geo_place_name><time_period>Annual Average 2011</time_period><start_date>2010-12-01T00:00:00</start_date><data_value>25.30</data_value></row>
RDF record:
<rdf:Description rdf:about="https://data.cityofnewyork.us/resource/c3uy-2p5r/172653">
-
- <socrata:rowID>row-frzi_7bar_4cbg</socrata:rowID>
- <rdfs:member rdf:resource="https://data.cityofnewyork.us/resource/c3uy-2p5r"/>
- <ds:unique_id>172653</ds:unique_id>
- <ds:indicator_id>375</ds:indicator_id>
- <ds:name>Nitrogen dioxide (NO2)</ds:name>
- <ds:measure>Mean</ds:measure>
- <ds:measure_info>ppb</ds:measure_info>
- <ds:geo_type_name>UHF34</ds:geo_type_name>
- <ds:geo_join_id>203</ds:geo_join_id>
- <ds:geo_place_name>Bedford Stuyvesant – Crown Heights</ds:geo_place_name>
- <ds:time_period>Annual Average 2011</ds:time_period>
- <ds:start_date>2010-12-01T00:00:00</ds:start_date>
- <ds:data_value>25.30</ds:data_value></rdf:Description>
Each format provides the same core information but does so in a way different from the others. When a data set is available in multiple formats, data scientists and other users can choose whichever format best meets their needs for a specific project or environment. Because the formats are standardized, users can load the data into a system that supports the format, making it relatively simple to view, modify and manipulate data from multiple sources.
Types of data sets
Data sets can be categorized in different ways. One common approach, which is often used in statistics, is to break them down into the following categories:
- Numerical. All the values within the data set are numerical. Numerical data sets are used for a variety of analytics, ranging from customer sales to weather station readings. This type of data set is also called quantitative.
- Bivariate. The data set contains two variables that express a relationship between the data. For example, a data set might include a temperature variable and a time variable. Together the variables provide insight into how temperature fluctuations are related to the time of day.
- Multivariate. This type of data set contains three or more variables that are somehow related. For example, a data set might include variables that describe a product's color, size, weight and other characteristics. Multivariate data sets often define complex relationships between the data.
- Categorical. A categorical data set divides the data into distinct groups based on the specific qualities of people or objects. There are two types of categorical data: dichotomous and polytomous. Dichotomous data contains only two values, such as true and false. Polytomous data can contain more than two values, although still a limited number, such as hair colors or shirt sizes.
- Correlation. This data set contains variables that are in some way related and have a dependency between them. For instance, the variables in a data set related to ice cream sales might show a correlation between the outside temperature and amount of sales. Correlations can be positive (variables move in the same direction), negative (variables move in opposite directions) or zero (variables don't impact each other).
The term data set originated with IBM, where its meaning was similar to that of file. In an IBM mainframe operating system, a data set is a named group of records that contains individual data units formatted in an IBM-prescribed way and accessed by a specific access method based on the data set format. Format types include sequential, relative sequential, indexed sequential and partitioned. Access methods include the Virtual Sequential Access Method (VSAM) and the Indexed Sequential Access Method (ISAM).
A data set is also an older and now deprecated term for a modem.
Working with numerical data
Numerical data within a data set is often characterized by specific measures that are used in statistics and analytics to describe the properties of a statistical distribution. Such a distribution reflects the set of possible values within the target data. The most common measures include the following:
- Mean. The mean is the average of all the values in the data set, determined by adding the values together and then dividing by the total number of values.
- Median. This is the data set's middle value, based on the values being sorted in ascending or descending order. If the data set contains an even number of values, the median value is determined by finding the mean of the two middle numbers.
- Mode. Mode is the value that occurs most often. A data set can contain multiple modes if each set of repeated values occurs at the same frequency, such as a data set that includes three instances of 5 and three instances of 6. If there is only one instance of each value, the data set is said to include no modes.
- Range. The difference between the minimum value and maximum value in the data set is the range.
- Minimum. This represents the lowest value in the data set.
- Maximum. This is the highest value in the data set.
- Sum. The total of all values in the data set is the sum.
- Count. Count represents the number of values in the data set.
To better understand how these measures work, consider the following numerical data set:
{2,4,4,6,8,10,13,14,16,18,20,22}
This is a very small numerical data set that contains 12 values, with only one value repeated. All of the values are integers. When the measures are applied to the data, they return the following properties:
- Mean = 11.417.
- Median = 11.5.
- Mode = 4 (two instances).
- Range = 20.
- Minimum = 2.
- Maximum = 22.
- Sum = 137.
- Count = 12.
If the data set had contained another pair of duplicate numbers, such as two instances of 10, there would have been two modes: 4 and 10. However, if there had been three instances of 4 and only two instances of 10, 4 would have been the only mode.
Data quality directly influences the success of machine learning models and AI initiatives. But a comprehensive approach requires considering real-world outcomes and data privacy. See how data quality shapes machine learning and AI outcomes.