Weissblick - Fotolia
Deep learning projects: Cloud-based AI or dedicated hardware?
Are deep learning projects part of your AI agenda this year? Here's how to evaluate the tradeoffs between using cloud-based AI infrastructure versus dedicated hardware.
Chip and system vendors are developing -- and rapidly innovating -- new AI processors designed for deep learning projects that use neural networks, the computing systems designed to approximate how human brains work.
At the same time, many cloud vendors have also been introducing these processing capabilities via dedicated GPUs and field programmable gate arrays (FPGAs), the integrated circuits designed to be customized after manufacturing. Google, which has stated that AI is strategic across all its businesses, is offering dedicated AI services built on its custom Tensor Processing Unit (TPU), the company's application-specific integrated circuit developed specifically for neural network deep learning projects.
"Cloud providers are betting that, over time, all companies will use deep learning and want to get a head start," said Sid J. Reddy, chief scientist at Conversica, which develops AI software for marketing and sales.
As CIOs begin mapping out their AI strategies -- in particular, their need and ability to do deep learning projects -- they must consider a variety of tradeoffs between using faster, more efficient private AI infrastructure, the operational efficiencies of the cloud, and their anticipated AI development lifecycle.
In general, private AI infrastructure is cost-effective for companies doing multiple, highly customized AI projects. If those companies are using data from applications running in the cloud, however, the cost of moving data into an on-premises AI system could offset the value of having dedicated hardware, making cloud-based AI cheaper. But, for many deep learning projects in this incredible fast-moving field, the economics could quickly change. Here's a breakdown.
Take small steps
Private AI infrastructure requires a large investment in fixed costs and ongoing maintenance costs. Because of the capital expense related to building and maintaining private AI infrastructure, cloud-based AI services -- even when they cost in aggregate more than private infrastructure -- can be the smart economic choice as enterprises flesh out their AI strategy before making a bigger commitment.
For small companies, fears about the high price of using this new AI infrastructure shouldn't be the reason to not try deep learning projects, Reddy said. As deep learning becomes more accepted as state-of-the-art for a wide range of tasks, he believes that more AI algorithms will transition to it. This is because deep learning promises to reduce some of the overhead in preparing data and optimizing new AI models.
Enterprises and small companies, alike, also need to determine if they have enough data to train the models for their deep learning projects without "overfitting," or creating a model that does not make accurate predictions for new data. Reddy said this is easier for a startup like Conversica that has data from hundreds of millions of conversations to work with. "It might not be the case with other startups that have limited aggregated data to begin with," he said.
Going beyond the basics
Some cloud providers like Microsoft with its Cognitive Services in Azure use FPGA chips under the hood for improving specific AI services. This approach hides the complexity of the FPGA from the customer, while providing some of the cost savings that FPGA chips provide on the back end. AWS has taken a different approach, becoming the first provider to allow enterprises to directly access FPGAs for some applications. And enterprises are starting to experiment with these.
For example, Understory, a weather forecasting service, has started moving some of its heavier machine learning algorithms into the cloud using AWS' new FPGA service to help with the analysis.
"Given our expansion of stations and our plan for growth, we will need to become smarter about the types of processors and metal we run our analyses and algorithms on," said Eric Hewitt, vice president of technology at Understory. "We would not push this type of power to our edge computing layer, but for real-time algorithms running on a network of data, it's feasible that we would use them."
Private AI, good for specialized needs
Some IT executives believe significant cost savings and performance improvements can be reaped by customizing AI-related hardware.
"I use a private infrastructure because my very specific needs are sold at a premium in the cloud," said Rix Ryskamp, CEO of UseAIble, an AI algorithm vendor. "If I had more general needs (typically, not machine learning), I would use cloud-only solutions for simplicity."
CIOs also need to think about the different components in the AI development lifecycle when deciding how to architect their deep learning projects. In the early research and development stages of an AI lifecycle, enterprises analyze large data sets to optimize a production-ready set of AI models. These models require less processing power when done in an on-premises production system than in cloud-based AI infrastructure. Therefore, Ryskamp recommended companies use private infrastructure for R&D.
The cloud, on the other hand, is often a better fit for production apps as long as requirements -- like intensive processing power -- do not make cost a problem.
"CIOs who already prefer the cloud should use it so long as their AI/[machine learning] workloads do not require so much custom hardware that cloud vendors cannot be competitive," Ryskamp said.
Energy efficiency, a red herring in deep learning projects?
"In general, the economics of doing large-scale deep learning projects in the public cloud are not favorable," said Robert Lee, chief architect with FlashBlade at Pure Storage, a data storage provider.
On the flip side, Lee agreed that training is most cost-effective where data is collected or situated. So, if an enterprise is drawing on a large pool of SaaS data, or using a cloud-based data lake, then he said it does makes more sense to implement the deep learning project in the cloud.
Indeed, the economic calculus of on-premises versus using cloud-based AI infrastructure will also vary according to a company's resources and timetable. The attraction of deploying private infrastructure, so that it can take advantage of the greater power efficiency of FPGAs and new AI-chips, is only one benefit, Lee argued.
"The bigger Opex lever is in making data science teams more productive by optimizing and streamlining the process of data collection, curation, transformation and training," he argued.
Tremendous time and effort is often spent in the extract, transform and load-like phases of deep learning projects, which create delays to data science teams, rather than running the AI algorithms themselves.
Continuous learning blurs choice between cloud-based AI and private
The other consideration is that as AI systems mature and evolve, continuous or active learning will become more important. Initial approaches to AI have centered around training models to do prediction/classification, then deploying them into production to analyze data as it's generated.
"We are starting to realize that in most use-cases, we are never actually done training and that there's no clear break between learning and practicing," Lee said.
In the long run, CIOs will need to see that AI models in deep learning projects are very much like humans who continuously learn. A good model is like the undergraduate with an engineering degree who was trained in basic concepts and has a good basic understanding about how to think about engineering. But expertise is developed over time and with experience, while learning on the job. Implementing these kinds of learning loops will blur the lines around distinctions such as doing the R&D component on private infrastructure versus in cloud-based AI infrastructure.
"Just like their human counterparts, AI systems need to continuously learn -- they need to be fed a constant pipeline of data collection/inference/evaluation/retraining wherever possible," Lee said.