Tips to plan storage elements of artificial intelligence
Author and data engineer Chinmay Arankalle explores how organizations can manage storage for AI. Management and storage options include data lakes and high-performance storage.
Organizations create more data than ever before. Newer, faster technologies and storage systems have risen to the challenge, but the storage elements of artificial intelligence can be complex.
AI data often requires high-performance, scalable storage and long retention periods. Organizations must find cost-effective storage systems to protect, manage and analyze large amounts of data; to ensure short and long-term success, it's crucial for organizations to assess their storage and data management needs throughout an AI project.
Chinmay Arankalle, author and data engineer, has spent years working on big data systems. In his recent book, The Artificial Intelligence Infrastructure Workshop, Arankalle and his co-authors discuss the complexities of AI and how organizations can navigate them. The book discusses data center architecture for AI workloads, including machine learning and large data sets.
In this Q&A, Energy Exemplar Senior Data Engineer Arankalle discusses some of the factors organizations should consider in their AI storage plans. These factors include cost management, priority setting and scalability.
Editor's note: This transcript has been edited for length and clarity.
What do you think are some of the common storage and compute challenges we're facing in this big data era?
Chinmay Arankalle: The data field is growing at a high speed. One of the main challenges we see in front of us [is] the various kinds of data we can come across. Currently, we divide data into three categories: structured, semistructured and unstructured data. We follow ELT [extract, load, transform] in these types of high-volume data stores.
Chinmay Arankalle
The end goal is not fixed, usually. Maybe now, the data we have has some use, [but] after 10 years, the data might have some different use altogether. Depending on the usage, we decide what format the data should be stored in. For example, if you want to query the data, the obvious choice would be a columnar data format, like Parquet, which supports ad hoc pairing. It's supported by different parallel processing frameworks, like Apache Spark. But, in the future, new use cases might come up. As time progresses, the need of the data could change [for new AI models]. And that will give birth to the new formats of the data.
The second challenge ahead of us is how we can utilize the older formats along with the newer ones. Since we can't suddenly get rid of the old data, there should be some harmony between older data and the new data.
The third challenge is mostly about availability of the data. For example, if we have stored the data in a partition format, and we would really like to have subsecond latency for the particular snapshot of the data, it becomes quite difficult if the partitioning strategy is not appropriate.
Similarly, the fourth challenge in front of us is using retention features. Storage nodes have some cost behind them. We need to make some segregation and push down the required data to archives. The main part here is the retention of the data. It's very, very difficult to keep track of stored data and delete a particular customer's data. That is where retention comes into the picture.
The last piece is quality of the data. Usually, in these types of storage, the quality is undermined when we load the data. We load the data as is -- this is the practice I have seen, which should be avoided.
What's the significance of data lakes and data lakehouses compared to data warehouses in AI?
Arankalle: Data warehousing is a 25- or 30-year-old concept where we basically store data at one place and it is cross-referenced everywhere. If we must update that piece of data, then we just must update it once at that single place; it will be cross-referenced.
But, soon, the data warehouse concept began to fall short for new latency requirements when data grew and new data formats came into existence. To cope with that, we needed to build a system where we don't need to validate the records, but directly store the data. And, after loading the data, we start background processes to read the required data, parse it, process it, validate it and then store it.
That was exactly the opposite of a data warehouse, where we validate everything before we store it. This was done because the write speed of the new applications was increased a lot. If we use a data warehouse for a high number of users to write their data, then it would be really difficult because a data warehouse, like relational databases, puts a lock on the tables for writing and removes it after writing is done. And, if we use that locking mechanism, then we are basically waiting forever, for millions of records, to store it at the same time.
To avoid that, we use nonshared architecture to not have any locks and to have an append-only architecture and to basically load the data. A data lake is a kind of boon in disguise because it gives us high-speed writes. High-speed read was also possible due to MongoDB or certain NoSQL databases where you can just query with an ID and that ID will fetch all the records for a customer. But, when it comes to updating the data or modifying the data, the distributed architecture was a nightmare; you'd have to find where the data is used, and you must update it everywhere.
To bridge this gap, data lakehouses came into the picture, where you could have goods from both worlds. You could have consistency, just like data warehouses, but we don't have any shared architecture so that we have data availability and high write speeds.
More on The Artificial Intelligence Infrastructure Workshop
To get started with storage for AI, check out Chapter 2 of The Artificial Intelligence Infrastructure Workshop.
Do you think technologies like NVMe-oF and flash will be adopted further to accommodate for this extra need for speed and availability when it comes to AI?
Arankalle: Absolutely. We do have systems that validate and store data, and we do have production-grade storage systems. And we do have frameworks that can process petabytes of data.
But the main thing is: How do you prioritize your latencies? And which data would you like to show to your customers? Which of them are on top priority? Which fall on a lower priority?
It's fine if some data is taking time to load because that's a part of the process that goes behind that particular request. In the future, it will be all about priorities of the data.
Who is taking the lead to enable organizations to manage AI storage?
Arankalle: There are a lot of vendors in the space, but there are a few of them I would like to name. The first one is Databricks. They're doing an amazing job in providing serverless compute at cheap prices.
The first thing which I would like to point out in Databricks is their data lake, which we can directly use for a small price. We could easily apply best storage practices directly to our data.
On the downside, I would say Databricks doesn't have a super powerful metadata system. If you would like to build a proper data management system, then Databricks would be a single part of the whole ecosystem -- like storing metadata and then pointing to the particular data and displaying it. Databricks has yet to cover these pieces.
AWS has been in the compute space for ages when they started Amazon Elastic MapReduce and on-demand computing. There are a lot of different tools they have come up with. For example, AWS Glue, which provides us not only ways to catalog data, but also we can build our data pipelines and use S3 for cold storage and use Redshift as our hot storage, for example.
The issue with all tools in one vendor is vendor lock-in. To avoid that, usually, I've seen companies go with a hybrid model, where they use Databricks for compute space or Azure Data Factory for compute. Through Azure Data Factory, they can trigger Databricks notebooks, and they use AWS for storage, like S3 and Redshift.
These are major competitors in the compute and storage space, but I believe, from a storage point of view, AWS is the clear winner. And, from an ad hoc reading point of view, Google BigQuery is amazing with the way they have built that abstraction layer on top of storage.