ktsdesign - stock.adobe.com
Google's Cloud TPU v2, v3 Pods accelerate ML training
For the first time, Google Cloud TPU Pods are available in public beta, enabling machine learning developers and engineers to more quickly deploy and train models.
Google's Cloud TPU v2 Pods and Cloud TPU v3 Pods -- cloud-run supercomputers designed to dramatically speed up the time needed to train and deploy machine learning models -- are now publicly available in beta.
Previously, the two products -- each comprised of multiple Tensor Processing Unit devices, hardware from Google designed specifically for machine learning -- were available only in a private alpha. A Cloud TPU Pod is vastly larger and more powerful than a single Cloud TPU device.
By comparison, each device contains four TPU chips and eight cores. The Cloud TPU v2 Pod consists of 64 TPU devices, making for a total of 256 TPU chips, with 512 cores, connected together.
Google introduced the new products in a May 7 blog post coinciding with the first day of the Google I/O 2019 conference, held in Mountain View, Calif.
First of the tech giants
While Amazon and Facebook have been working on AI chips of their own -- albeit ones focused more on inference -- Google is the first of the tech giants to make such a processor publicly available, said Peter Rutten, research director at IDC.
Many startups are working on AI-centric chips as well, Rutten said, but few have gone to market yet.
"Apart from what the other vendors are planning, the Google TPU Pods appear to be extremely powerful," Rutten said.
Peter RuttenIDC
In general, the benefits of application-specific integrated circuits (ASICs), such as Google's TPUs, are their speed, low energy use and low unit cost, Rutten explained. However, ASICs are fixed and cannot be adjusted as AI algorithms change.
"If Google keeps delivering new versions at the speed that they have been until now, though, that may not be a problem," Rutten said. "Bottom line: Data scientists are getting a lot of performance for AI model training from this offering."
Google Cloud TPUs represent a faster and more cost-effective way to handle large machine learning workloads, Pete Voss, a Google spokesman for cloud AI and machine learning, said in a phone interview.
"Developers can iterate in minutes and train large production models in hours instead of days," Voss said.
A variety of customers, including eBay, Recursion Pharmaceuticals, Lyft and Two Sigma, use Cloud TPU products, Voss said.
In a recent case study, Recursion Pharmaceuticals dramatically lowered its training time for a model that iteratively tests the viability of synthesized molecules to treat rare illnesses, Voss said. It took the company more than 24 hours to train the model on an on-premises cluster and 15 minutes to train it on a Cloud TPU Pod, he said.
The price of machine learning power
Both the Cloud TPU v2 and v3 Pods primarily perform the same functions, providing users with shorter routes to insights, higher model training accuracy and the ability to do frequent model retraining, according to Google. Cloud TPU v3, the more expensive service option, features upgraded hardware and provides faster results.
Alone, both Cloud TPU v2 and v3 devices price in the single digits per hour. The Cloud TPU v2 Pod, meanwhile, ranges from $24 an hour to rent a 16 TPU chip pod slice to a little under $400 an hour to rent the entire Pod, with two in-between options.
The Cloud TPU v3 Pod is priced at $32 an hour for a 16 TPU chip pod slice. Users may also pay to rent any of the pod slice options by the year or by three years, for hundreds of thousands of dollars.
Rutten said the accelerated server market, while currently dominated by GPUs, will start to see more and more AI ASICs over time.
Yet, he said, GPUs shouldn't be eliminated or ruled out.
"There are a lot of benefits with GPUs: their flexibility, programmability and, most of all, the software stack -- think CUDA -- and ecosystem around them," he said. "So, don't just look at the benchmarking results when comparing AI processors."