Sergey Nivens - Fotolia
Nvidia GPU Cloud marries data science and containers
With its GPU Cloud, Nvidia offers a way to deploy data science workloads that span on-premises systems and the public cloud -- with containers playing a big role.
GPUs, which can accelerate all manner of machine learning and deep learning algorithms, are fueling the next generation of data science and AI workloads.
There are a few reasons for the connection between GPUs and data science. For example, the image-rendering algorithms that GPUs have traditionally used to simulate 3D environments involve the same sort of mathematical model as those in data science. What's more, the tensors -- or multidimensional matrices -- common in machine learning and deep learning calculations are well-suited to parallel execution on GPU cores. The result has been an explosion in data science and machine learning programming frameworks that facilitate the development of sophisticated algorithms and are optimized for GPU execution.
However, for newcomers to data science and GPU computing -- especially those who are unfamiliar with system internals, dependencies and configuration parameters -- it can be complicated to build, configure and integrate machine learning frameworks, such as Caffe2, PyTorch or TensorFlow. It's a challenge, in general, to find the right set of data science tools and integrate them into a cohesive ecosystem.
Now, offerings such as Nvidia GPU Cloud aim to ease this challenge and make GPU-based data science more accessible -- with a little help from containers.
An intro to Nvidia GPU Cloud
Nvidia GPU Cloud is a library of containerized, GPU-optimized and integrated packages that contain data science and deep learning development frameworks and are suitable for cloud deployment. Each container image, which is built to work on both single- and multi-GPU systems, includes all necessary dependencies.
Nvidia GPU Cloud currently includes the following frameworks:
- CUDA Toolkit: A native development environment for CUDA, Nvidia's GPU application framework. The toolkit includes a C/C++ compiler, libraries, debugger and optimization tools.
- DIGITS (Deep Learning GPU Training System): A deep learning GPU training system that includes a dashboard, real-time monitoring and deep learning network visualization.
- NVCaffe and Caffe2: Deep learning development frameworks with C++ and Python interfaces that support a variety of deep learning model types.
- Microsoft Cognitive Toolkit: An open source toolkit that also supports C++ and Python but includes a higher-level model description language -- Branscript -- and supports the Open Neural Network Exchange.
- MXNet: An Apache open source project that includes a higher-level library of neural network building blocks.
- PyTorch: A Python package that supports GPU-optimized tensor calculations and dynamic neural networks.
- TensorFlow: An open source library for tensor calculation, developed by Google.
- Theano: A Python library designed for general-purpose math and statistical calculations.
- Torch: A framework for scientific computing that uses a fast scripting language and supports array processing, linear algebra and neural networks with an extensive library of contributed packages.
Build a hybrid data science platform
These data science container packages work on any system with an Nvidia GPU, whether it's an on-premises server with GPU cards, a dedicated GPU development system, a data science workstation or a GPU-capable cloud instance.
The three major public cloud providers currently offer the following GPU compute instances:
- AWS: P3 instances with one, four or eight Nvidia Tesla V100 GPUs, and P2 instances with one, eight or 16 Nvidia Tesla K80 GPUs.
- Microsoft Azure: NCv3 with one, two or four V100 GPUs; NCv2 with one, two or four P100 GPUs; NC with one, two or four K80 GPUs; and ND with one, two or four P40 GPUs.
- Google Cloud: Compute Engine instances with one, two, four or eight V100 GPUs; one, two or four P100 GPUs; and one, two, four or eight K80 GPUs.
Because Nvidia GPU Cloud images are standard Docker containers, they can run on any system, local or remote, with a container runtime. Container portability facilitates hybrid data science platforms, where researchers can develop, train and test models on local systems and then deploy production models to cloud GPU instances.
More data science options with containers
Data science applications often use massive data sets for model training and then are deployed in parallel to analyze data streams in real time, which makes them an especially good fit for containerization, since models can run on a cluster of machines. Additionally, enterprises can configure container orchestrators, like Kubernetes, to automatically scale workloads up and down to match usage. Furthermore, Kubernetes can now schedule workloads on mixed clusters that contain both GPU and non-GPU nodes and direct data science workloads to those with GPUs. Some cloud services, such as Google Kubernetes Engine, also support this feature.
Docker Hub also includes container images that complement data science frameworks, many with the necessary GPU drivers and libraries built in. Unfortunately, unlike Nvidia GPU Cloud, the quality of these images is unknown, as experts do not curate them. The frameworks themselves are also unlikely to have customizations and optimizations for GPUs, and if even they do, they don't match the standards set by Nvidia.