Convert unstructured data to structured data with machine learning
With access to powerful compute power and advances in machine learning, unstructured data is becoming easier and cheaper for businesses to turn into usable sources of insight.
Billions of people shop online. They use social media. They stream movies and send texts and pictures to the other side of the world. Each second, a huge amount of data is created and collected. But, still, businesses have a data problem; specifically, an unstructured data to structured data problem.
By a large margin, most of the data that organizations collect is unstructured -- data that doesn't easily conform to an existing data model like structured data or even semi-structured data can. For many organizations, unstructured data is, more or less, useless.
Imagine needing a new wardrobe. You order some shirts and pants online, but when the boxes arrive, you only see one pair of pants and one shirt. The rest of the boxes are filled with lumps of wool, cotton, some thread and a couple of disassociated buttons.
Technically, it's all the materials that would have made up your clothes, but in that moment, it's not useable. To actually make something out of it, it would take a lot of time and likely a good deal of money for the tools and training. That's the issue with unstructured data; there's no good way to use it or to get useful insights out of it.
Combing through machine learning, unstructured data
Nav KesherFacebook
For organizations to get use out of such data "requires significant time and money investment," said Nav Kesher, head of data sciences for the Facebook Marketplace Experience.
Some 80% of all digital data is unstructured, Kesher said during a keynote at the AI Summit in San Francisco. But while businesses have, in the past, ignored or forgotten about such data, that is slowly starting to change.
Compute power has become cheap, said Kesher, enabling organizations to more easily and cost-effectively power the algorithms needed to turn unstructured data into structured data. Those algorithms, too, have become more advanced, with more focus and funding going to AI and machine learning tools and technologies.
"Unstructured data is worthless without machine learning," Kesher said.
Machine learning models, after some training, can be used to automatically and quickly move through, label and categorize unstructured data. It's not a seamless process, and it is still certainly expensive and time-consuming, but changing unstructured data to structured data is easier now than ever before.
For businesses looking to finally make use of their unused data with the help of machine learning tools, now might be the time to invest. Getting started, at least at the business level, can be as deceptively simple as setting a business goal.
Step by step
Organizations starting out to tackle their unstructured data problem should begin by setting a business goal -- something that can be said in 10 words or less and can connect business goals to analysis goals, according to Kesher. The goal should answer questions like, "Do I need classification or do I need clustering?" The answer will ultimately set the course of the processes, Kesher said.
With the goal in mind, data sources should be evaluated. Move fast and be smart, Kesher said, and pick out data that is specific and relevant to the goal. Ruthlessly prioritize what will eventually go from unstructured data to structured data.
Admins should also evaluate analytic methods, log analytics tools and data storehouse platforms, according to Kesher. Keep your goals in mind when comparing different systems and vendors.
The next step, data cleaning, the process of identifying and fixing errors in the data, such as typos or formatting issues, can be a lot of work. Look for broad errors and create and apply a machine learning model to automatically correct those errors. The whole experience can be frustrating, Kesher said, but it "feels good when your model runs."
Model and visualize
Now that you're well on your way to changing unstructured data to structured data, the next step is data modeling. Relationships in the data are identified and marked during what can be a lengthy process, but it is an important one, as those relationships contain the keys to accurately using the data later on.
Data modeling is "very case-based. You have to figure out for yourselves the accuracy you need," Kesher said.
The final step when turning unstructured data to structured data is data visualization, a step that might not seem important, but is essential, according to Kesher.
"I think if you're not able to present your analysis with good visualizations and good stories, it will be very, very hard for you to convince your execs to take action on the analysis," he said.
There are numerous graphs and charts to use to visualize the data, so an evaluation here is important, Kesher said. Ultimately, he said, "data science isn't just about building models." It's about taking raw information and making it mean something to someone. It's about "being simple and making other people understand." At its roots, "it is art," he said.