Getty Images
Anomalo brings data quality platform to Snowflake
Elliot Shmukler, co-founder and CEO of Anomalo, details how he was inspired to build his data startup after a critical data failure while he was working at Instacart.
After experiencing a number of what he said were data reliability and trust problems while working at online food delivery provider Instacart, Elliot Shmukler -- with Jeremy Stanley -- decided to found his own startup in 2018 to improve the state of data.
Anomalo has grown over the last few years and on Jan. 5 revealed a partnership with cloud data platform provider Snowflake that adds an integration with Anomalo's data quality technology.
Based in Palo Alto, Calif., Anomalo raised $33 million in a Series A round of funding in October 2021 for its data quality platform.
The market for data quality is competitive, with multiple vendors including Talend and Informatica as well as the open source Great Expectation technology led by Superconductive.
In this Q&A, Shmukler, co-founder and CEO of Anomalo, details his view on data quality and how it fits into modern data architectures.
Why did you decide that it was a good idea to create a data quality company?
Elliot Shmukler: My co-founder, Jeremy Stanley, and I have both been executives at high-growth companies and we had to solve the problem of data quality over and over again at every company where we worked.
Most recently, we were at Instacart and data quality would literally bring us down to our knees every once in a while. I came in one morning to the office and everyone on my team was going crazy, because the volume of orders to Costco, which was one of the biggest retailers on Instacart, had dropped in half. Everyone was asking what was going on and it turned out it was literally a data quality issue.
Instacart gets inventory feeds from grocery stores across North America, for what's in the store and on the shelf, so that we can then make those items available for delivery. That morning we got an inventory feed from Costco that was missing a bunch of items, including all meat. If only we were watching the quality of that feed in some way and noticing when that feed came in that it had missing data, we would have prevented what was effectively a pretty massive outage for us and a drop in revenue. So that's the kind of thing that inspired us.
Elliot ShmuklerCo-founder and CEO, Anomalo
At the core of what Anomalo does, we actually ingest your data sets and build machine learning models that essentially encapsulate the structure of ideas that describe the structure of that data set. So when new data arrives, the machine learning model will be able to identify if the new data doesn't look like the old data and if it deviates in significant ways. That's how we're able to detect issues for our customers.
How do you define data quality?
Shmukler: Data freshness is a critical piece of data quality. Stale data is not quality data. So that's in our definition of data quality.
The second piece of our definition is really about making sure the data is consistent with historical patterns. If you got new data today and the new set of data is inconsistent with the general patterns of the data set in its history, that's likely an issue with data.
You could have missing pieces of data, null values in a particular column, or it could be other things like the correlation between columns in data set, which could indicate an issue.
To be sure, there is a distinction between data observability for a data pipeline and what we think of as data quality. Data pipeline observability mostly encompasses understanding how data is moving from point A to point B. We're actually looking inside the data, at the values of the rows, and looking for things that may have become different in those values to indicate a data quality issue.
Looking back at the Instacart Costco incident, that data got ingested and processed through the data pipeline. The problem was that inside the values in the data rows there was clearly something wrong. At Anomalo, we are actually looking deeper with data quality.