Sergey - stock.adobe.com
How to find the best machine learning frameworks for you
There's no shortage of open source deep learning tools today, and evaluating them can be a challenge. But there are some primary considerations to keep in mind.
Several machine learning frameworks have emerged to streamline the development and deployment of AI applications. These frameworks help abstract away the grunt work of testing and configuring AI workloads for experimentation, optimization and production.
However, developers need to make some hard choices when it comes to picking the right framework. Some may want to focus on ease of use when training a new AI algorithm, while others may prioritize parameter optimization and production deployment. Different frameworks have different strengths and weaknesses in these diverse areas.
Popular machine learning frameworks include TensorFlow, MXNet, scikit-learn, Keras and PyTorch. These are commonly used by data scientists to train algorithms for various use cases, including prediction, image recognition and recommendation.
The data scientists that lead these initiatives may want to choose machine learning frameworks that make it easy to build algorithms. But this is only one small, albeit important part of the overall AI development pipeline. Enterprises often spend more time on ancillary efforts, like preparing data, moving algorithms into production, configuring machine learning parameters, and troubleshooting discrepancies between new research and production models.
Start with the right questions
According to Mike Gualtieri, vice president and principal analyst covering AI at Forrester Research, there are three high-level considerations for guiding AI developers and IT teams to select the best machine learning framework for their needs.
- Will the framework be used for deep learning or classical machine learning algorithms?
- What is the preferred programming language for developing AI models?
- What kind of hardware, software and cloud services will be used for scaling?
The programming languages Python and R are popular in machine learning, although other languages like C, Java and Scala can be used, as well. Most of today's machine learning applications are written in Python rather than R because R was designed by statisticians and is not the most elegant programming language, Gualtieri said. Python is a more modern programming language, and it is much cleaner and easier to use.
"Python is increasingly becoming the language of choice," Gualtieri said.
Caffe and TensorFlow are also popular choices among Python coders developing machine learning models.
Deep or classic learning
Although deep learning algorithms have been getting a good deal of attention lately, classic machine learning algorithms can outperform them in many applications. Machine learning frameworks tend to be better at one or the other, although some frameworks do support both to some extent.
Deep learning frameworks specifically have support for coding neural networks, and TensorFlow is the most well-known. Other good machine learning framework choices for deep learning include MXNet and Caffe. These frameworks support the ability to write algorithms for image labeling and advanced natural language processing, among other applications.
Deep learning frameworks can be trained to interpret structured data, but classic machine learning frameworks don't work with unstructured data. When choosing a machine learning framework, it's important to understand what type of data you have and what type of applications you want to build.
"The reason why people talk about deep learning and unstructured data is that it is the only form of machine learning that can do that today," said Gualtieri.
Classical machine learning algorithms are good for various kinds of optimization and statistical analysis. The most popular classic machine learning framework is scikit-learn. It is followed by a variety of programming packages written for the R programming language accessed via the Comprehensive R Archive Network, also known as CRAN, which consists of a library of R functions for machine learning. Scikit-learn is good for writing in Python, but CRAN may be better for writing applications in R.
Other popular packages include Apache Spark MLlib and H2O.ai, which has a set of machine learning algorithms that are open source and perform well.
Small experimentation versus production-ready heavy lifting
In the early stages of development, AI engineers are likely to experiment to find what works and what doesn't work on small data sets. However, when you want to run a model in production against an entire data set, it makes sense to look at a framework that supports a distributed architecture, like Apache Spark's MLlib or H20.
"In heavy lifting, scalability is a bigger issue," Gualtieri said.
There are similar scenarios in deep learning. For example, AI developers might want to do image labeling. They can download something like TensorFlow and run it on their desktops to train algorithms and experiment with different models. Once they have sorted out their chosen approach, they might want to do real work against a larger data set.
In that case, they may want to choose a different framework that works well with enterprise hardware, like a Nvidia GeForce GTX box or a cloud service, like a GPU service, running on AWS or Google's TensorFlow cloud service.
Scaling training and deployment
During the training phase of developing AI algorithms, scalability is all about how much data can be analyzed and the speed at which it can be analyzed. This performance can be improved with distributed algorithms and distributed processing.
In the deployment phase of AI projects, scalability has more to do with the number of concurrent users or applications that can hit the model at once.
Asaf Somekh, founder and CEO at Iguazio Ltd., an AI services company, said this is one of the central tensions in deep learning and machine learning.
"The issue with many AI projects is that the training environment is very different than the production environment, and often data scientists are working with their own set of tools that are completely different than the ones that are used in production."
This dissonance often requires enterprises to take models that were developed in one environment -- for example, a Python-based machine learning framework -- and run them in a distributed environment with strict requirements for performance and high availability. When choosing a framework, it's important to consider whether it supports this kind of scaling.
Consider parameter optimization
Another key consideration when choosing a machine learning framework is parameter optimization. Every algorithm takes a different approach to analyzing training data and applying what it learns to new examples.
Each parameter can be tuned by different combinations of knobs and dials, so to speak, which adjust the weighting of different variables and the degree to which the outliers are considered, to name a few possible adjustments. When choosing a machine learning framework, it's important to consider whether you want this tuning to be automated or manual.
"The more knobs and dials that need to be turned, the more difficult it is to find the right combination," Gualtieri said.
Framework alternatives abound
One final question enterprises need to ask is whether a machine learning framework is even the best approach for a particular class of problems. Gualtieri breaks the field of machine learning development tools into three categories: notebook-based, multi-modal and automated.
Notebook-based development -- using a tool like Python-based Jupyter -- provides intricate control over all the aspects of machine learning model customization. Machine learning frameworks use this concept to reduce the grunt work when making these customizations.
A multi-modal approach is essentially a low-code way of combining data science and purpose-built tools like Salesforce Einstein to enable developers to extend a core AI model to a particular use case.
An automated approach uses a tool to automatically try out a variety of possible algorithms for a given set of input data until it identifies the best candidates for a particular use case. These tools include offerings like Google AutoML, DataRobot and H20.ai.
"[An automated approach] is attractive because the skill level of the data scientist does not have to be as high," Gualtieri said.
In some cases, a statistician can drive this process rather than a data scientist. But Gualtieri does not expect these tools to replace data scientists driving machine learning frameworks, just as low-code tools have not replaced Java programmers.
Frameworks are better suited for traditional data scientists than automated exploration tools. They give data scientists greater control over functionality via flexible coding environments like Python or R and Jupyter notebooks.
Open source machine learning provides rich community
Though vendors are developing many proprietary AI technologies, open source frameworks will continue to dominate the field, benefitting from the intellectual contributions of experts from around the world, said Chad Meley, vice president of marketing at Teradata.
Of these deep learning frameworks, the ones backed by digital giants represent the best options for enterprises, he said. These frameworks have the benefit of scale, as well as sponsors with vested interests in improving and supporting them.
The major cloud providers all have their preferred frameworks. Depending on an organization's IT strategy, CIOs may either want to align with a particular cloud vendor or emphasize portability across multiple clouds and on-premises deployments, said Meley. Gaps in support, however, will remain for open source software as these cloud vendors fight to grow their businesses by only offering support within their cloud.
TensorFlow has the highest mind share. After that, there are different perceptions of momentum and community around Caffe, Keras, MXNet, CNTK, and PyTorch. Theano is waning, Meley said.
"We recommend that our clients use an API like Keras that runs on top of leading machine learning frameworks, providing a more intuitive level of abstraction," said Meley.
Iguazio uses various frameworks, such as Spark, TensorFlow and H20. Spark, for example, is very comprehensive in terms of its ecosystem. In addition to Spark MLlib, it provides capabilities for SQL, graph processing and stream processing, which open up a broad set of use cases for different kinds of teams at Iguazio.
Another benefit of Spark is its ease of use and large user base, which ensures that the technology steadily improves and will be around for the foreseeable future, said Somekh.