Sergey Nivens - Fotolia
Data science platforms boost automation, collaboration
Data scientists can choose from a growing list of commercial and open source platforms that ease data access, analytics, model building and management in a collaborative way.
Data science platforms can help teams of data scientists collaborate on advanced analytics problems, allow them to pull in data from disparate sources, and choose from a variety of analytics and machine learning tools to produce analytical models at scale and make useful predictions from them. By bringing big data and advanced analytics techniques together, these platforms can play a critical role in helping companies get a handle on vast amounts of data and speed up the modeling process.
Data science platforms should support automation, collaboration and ongoing management of the models by analytics teams in cooperation with data engineers and application developers. Yet "most of the platforms emerging stop at model development and deployment," said Doug Henschen, analyst at Constellation Research. "Monitoring and ongoing lifecycle management is usually a separate matter, and support for IT and app developer roles varies."
The overall data science platform market is expected to grow to more than $101 billion by 2021 at a compound annual growth rate (CAGR) of 39%, according to MarketsandMarkets. Cognitive and AI software platforms are cited by IDC as two of the fastest growing big data and analytics technology categories at a projected 36.5% CAGR over the next few years.
Among workbench-style platforms, IBM's Watson Studio, Cloudera Data Science Workbench, Domino Data Lab Inc. and Oracle's DataScience.com technology are major commercial players for collaboration and workflow applications. And there are many other types of data science platforms for various applications -- for example, Google, DataRobot Inc., H2O.ai and SAS Institute Inc. automated machine learning platforms are used to develop and train models for analytics projects.
Open source opens more opportunities
Since the rise of open source languages and frameworks, many data science platforms typically allow data scientists to use open source languages and tools of their choice. Popular languages include Python, R and Scala, Henschen noted, and popular libraries are TensorFlow, Pytorch, Keras and Scikit-learn.
"Options that support open source development are now seeing the fastest growth," he explained. "Both SAS and IBM support the use of open source for development, but models are typically converted for runtime operation on their commercial software -- a route that many organizations are now trying to avoid."
Ken SeierNational practice lead, Insight
Some companies find commercial platforms to be too restrictive, partly due to their relative newness. "These platforms all require deep platform commitments that quickly become vendor lock-in," said Ken Seier, national practice lead at Insight, a system integration and technical services consultancy. "So far, our customers have consistently chosen an open source approach."
Then there are companies that straddle the line between open source and commercial. Anaconda Inc. offers an open source, on-premises tool to develop machine learning models on smaller data sets, said Peter Wang, the company's co-founder and CTO. He said the Anaconda Enterprise platform allows data scientists "to train models on the full data set at scale."
IDC expects data worldwide to grow from last year's 33 zettabytes to 175 zettabytes -- or 175 trillion gigabytes -- by 2025, when nearly half of all data stored is expected to reside in public cloud environments. Data science platforms can help simplify and speed up connections to data stored in Hadoop and other big data systems or conventional data stores. "[Data scientists] are freed from having to work with IT and data engineers at every turn to get access to data," Henschen said.
Some data science platforms also allow for integration with third-party data sources and data preparation tools. "The whole idea is to advance from the old data-sampling and data-movement requirements," Henschen explained, and companies can scale up their use of analytics, accelerate the iterative process and expand model development.
Platforms for data scientists and their wannabes
As common sense would have it, data science platforms are primarily built for data scientists. "Some vendors, including DataRobot and SAS, bring a degree of abstraction and automation to bear to make things more accessible to non-data scientists," Henschen acknowledged. "But at a minimum, users would still be data-savvy and analytics-savvy power users."
SAS is working to make its platform easier to use for a wider variety of people. "The platform offers a choice of interfaces from drag-and-drop for business users to programming interfaces for data scientists," said Saurabh Gupta, director of advanced analytics and artificial intelligence product management at SAS. That allows a company to put together a project team with people from a variety of backgrounds, including data scientists, business analysts, developers and executives, he added.
For cloud-based data science platforms, data privacy and security remain concerns, making on-premises deployments a viable alternative, especially in industries like banking and health sciences. Domino Data Lab provides cloud-based and on-premises versions of its workbench platform, said Jon Rooney, the company's vice president of marketing. In addition, data scientists can use open source tools such as R and Python as well as commercial tools from SAS and DataRobot within the platform.
SAS and IBM have historically had the largest customer bases and revenue in advanced analytics, according to Henschen. "That's still the case," he said, "thanks to existing investments and license and maintenance revenues." In addition, software reviews site G2 Crowd placed IBM Watson Studio for managing data science workflows as the highest-rated platform overall by its users.
Doug HenschenAnalyst, Constellation Research
A Gartner Magic Quadrant report released in January reviewed 17 commercially licensed data science and machine learning platforms. The leaders with the highest ability to execute combined with the completeness of their technology vision include RapidMiner, Tibco Software, Knime and SAS. Gartner sees Alteryx and Dataiku as challengers that it ranked highly on their ability to execute, while DataRobot, Google, H2O.ai, IBM, Microsoft and two other vendors were classified as visionaries.
The availability of cutting-edge platforms for automated machine learning like Google Cloud AutoML, DataRobot and H2O Driverless AI doesn't mean data scientists will be out of work, Gartner analyst Svetlana Sicular conjectured. "They can work on the higher-value problems," she said. "There's such a huge interest in machine learning, and there is so much work for data scientists."