Definition

data gravity

Scott Robinson, New Era Technology
Ivy Wigmore

What is data gravity?

Data gravity is the ability of a body of data to attract applications, services and other data. The force of gravity, in this context, can be thought of as the way software, services and business logic are drawn to data relative to its mass, or the amount of data. The larger the amount of data, the more applications, services and other data will be attracted to it and drawn into its repository. Data lakes and data warehouses are two prime examples of data gravity.

Data gravity has both an upside and a downside. The upside is that data sets with high gravity attract more data, and modern analytics have greatest utility when an abundance of data is available -- hence the term big data. Moreover, very large data sets tend to be useful across a broader range of applications. On the other hand, the larger the amount of data, the more difficult and expensive it can be to maintain.

Why is data gravity important?

Data gravity is important for several reasons. An intentional and well-planned growth in the gravity of data sets can greatly boost their utility and value. It can also have the downline effect of increasing the accuracy and applicability of analyses the data might yield.

It's also important to monitor the gravity of growing bodies of data to curb negative effects, to ensure that the data doesn't become too unwieldy to be maintained.

In practical terms, moving data farther and more frequently impacts workload performance, so it makes sense for data to be amassed and for associated applications and services to be located nearby. This is one reason why internet of things (IoT) applications must be hosted as close as possible to where the data they use is being generated and stored. Increasing data gravity, then, is a matter of configuring it and storing it in such a way as to optimize its utility and accessibility.

Hyperconvergence is often used to illustrate the concept of data gravity. In a hyperconverged infrastructure, compute, networking and virtualization resources are tightly integrated with data storage within a commodity hardware box. The greater the amount of data, and the more other data might be connected to it, the more value the data has for analytics.

Developers and managers of high-volume cloud applications and IoT systems are among the IT professionals who maintain a keen awareness of data gravity and actively cultivate data sources with configurations that optimize it. Data sources optimized for high gravity strike a balance between maximum utility and the diminishing returns of burdensome maintenance.

Implications of data gravity

Data gravity can be a friend or an enemy. If it isn't carefully monitored and planned for, it can easily become the latter. The two biggest issues tend to be increased latency and diminished portability.

Very large data sets generally need to be close to the applications using them, particularly in on-premises deployments and scenarios using complex workflows. When applications are farther away from the data centers hosting the data they need, latency increases and performance suffers.

For this reason, cloud providers are often the right choice for hosting data sets that are likely to achieve high gravity. Data hosted in data lakes, for instance, scale more easily as they grow, reducing the complications that can arise with rapid growth. Cloud data, in general, can be managed effectively to balance throughput and workload, though this can become expensive.

The larger a data set becomes, the more difficult it can be to move if that becomes necessary. Cloud storage egress fees are often high and the more data an organization stores, the more expensive it is to move it, to the point where it can be uneconomical to move between platforms. Data gravity, then, must be considered when choosing a host environment for the data. It would be wise to have migration plans in place, even if no migration is expected anytime soon, and those plans should reflect the data set's eventual size, rather than its current volume.

There's also the problem of the dependencies of applications accessing the data set, which would have to change that access in the event of a migration; the more applications, the more access adaptations are required.

Artificial intelligence (AI) and IoT applications also present data gravity challenges. Forrester points out that new sources and applications -- including machine learning, AI, edge devices or IoT -- risk creating their own data gravity, especially if organizations fail to plan for data growth.

The growth of data at the enterprise edge poses a challenge when locating services and applications unless firms can filter out or analyze data in situ or possibly in transit. Centralizing that data is likely to be expensive, and wasteful if much of it isn't needed.

How to manage data gravity

Managing big data gravity is often challenging but can be well worth the effort. Taking care to keep applications and data sets close by, co-located on premises, is an example. Cloud data deployments are often wise, due to the capacity of most cloud services to easily scale and fine-tune performance.

Other steps can be taken to manage data gravity well. Well-defined data management standards and policies are a positive step, ensuring the proper use of the data sets involved and regulating their access effectively. Good data management also boosts data integrity, a particular concern when the data is used to generate analytics.

Strong data governance also enhances data gravity management, ensuring meaningful accountability and responsibility for the data.

Well-planned and executed data integration is also a data gravity management advantage. When disparate data sets can be effectively integrated into a single data source, both access and maintenance are simplified, and potential error can be reduced. Infrastructure supporting data lakes can often provide opportunity for such integration.

The history of data gravity

IT expert Dave McCrory first coined the term data gravity in 2010 as an analogy for the physical way that objects with more mass naturally attract objects with less mass.

According to McCrory, data gravity is moving to the cloud. As more and more internal and external business data is moved to the cloud or generated there, data analytics tools are also increasingly cloud-based. His explanation of the term differentiates between naturally occurring data gravity and similar changes created through external forces such as legislation, throttling and manipulative pricing, which McCrory refers to as artificial data gravity.

In 2020, McCrory released the "Data Gravity Index," a report that measures, quantifies and predicts the intensity of data gravity for the Forbes Global 2000 enterprises across 53 metros and 23 industries. The report includes a patent-pending formula for data gravity and a methodology based on thousands of attributes of Global 2000 companies' presence in each location, along with variables for each location including the following:

Gross domestic product (GDP).
Population.
Number of employees.
Technographic data.
IT spend.
Average bandwidth and latency.
Data flows.

The latest "Data Gravity Index 2.0" report, continues the examination of global data proliferation as correlated with GDP across more than 190 countries and 500 metros.

Learn how to navigate challenges such as data gravity when deploying a cloud architecture.

This was last updated in April 2024