Getty Images

How to run ML workloads with Apache Spark on Kubernetes

IT staff looking for ways to maintain ML workloads with ease are increasingly turning to Apache Spark. Follow these simple steps to set up a Spark cluster on Kubernetes.

The dynamic pairing of Spark on Kubernetes can lead to a wide range of benefits. To get Spark up and running on Kubernetes, IT teams just need some easy-to-learn commands.

Spark doesn't have to run on Kubernetes. But in many use cases, pairing the two can simplify Spark deployment while running machine learning (ML) workloads efficiently in a distributed environment.

What is Apache Spark?

Apache Spark is an open source data processing platform designed for ML workloads. Spark's main features include the following:

  • The ability to process large volumes of data quickly, especially when the data is stored in memory.
  • Support for real-time processing of data streams.
  • Highly customizable data processing workflows.
  • Multiple deployment models, which means that Spark can run on top of a Hadoop cluster if desired or operate on its own.

Thanks to these features, especially its fast data processing capabilities, Spark has become the de facto open source tool for powering ML workloads that require large-scale data processing.

The benefits of running Spark on Kubernetes

Kubernetes is not required to run Spark. But choosing to run Spark on top of Kubernetes can provide several advantages:

  • The ability to move Spark applications easily between different Kubernetes clusters, which is a benefit if you don't want to be locked into a particular infrastructure platform.
  • Support for segmenting Spark applications from each other while still housing them all within a single Kubernetes cluster.
  • A unified approach to application deployment and management, since you can manage everything through Kubernetes.
  • The ability to use Kubernetes ResourceQuotas to manage the resources allocated to Spark.

Apache Spark gained native support for Kubernetes starting with Spark 2.3. Native support means that you can deploy and manage Spark applications just like any other Kubernetes application by using container images and pods. You don't need any special tools or extensions to make Spark compatible with Kubernetes.

Steps for deploying Spark on Kubernetes

To deploy Spark on Kubernetes, start by creating a Deployment for a Spark Master.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: spark-master
spec:
  template:
     metadata:
       labels:
         component: spark-master
     spec:
       containers:
         - name: spark-master
                     image: apache/spark:v3.3.0
           command: ["/spark-master"]
             ports:
           - containerPort: 6000
           - containerPort: 8080

Save this file as "spark-master.yml."

Next, create a Service.

kind: Service
apiVersion: v1
metadata:
  name: spark-master
spec:
  ports:
     - name: ui
     port: 8080
     targetPort: 8080
     - name: spark
     port: 6000
     targetPort: 6000
  selector:
     component: spark-master

Save your Service configuration as "spark-master-service.yml."

Then, create the Deployment and Service in your Kubernetes clusters by running the following.

kubectl create -f ./kubernetes/spark-master.yml
kubectl create -f ./kubernetes/spark-master-service.yml

At this point, you have a Spark master running. Now you can create a Spark worker to complete your cluster.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: worker
  template:
     metadata:
     labels:
     component: spark-worker
     spec:
       containers:
         - name: worker
               image: apache/spark:v3.3.0
           command: ["/worker"]
           ports:
           - containerPort: 8081

Save this file as "spark-worker.yml" and deploy it with the following.

kubectl create -f ./kubernetes/spark-worker.yml

Now a basic Spark cluster is up and running on Kubernetes. You can begin submitting Spark workloads to your master container.

To move toward a production environment, you could deploy additional workers to scale up the cluster. You may also want to open your cluster to external resources by setting up a Service that exposes a public IP address or creating an ingress rule. Refer to the Spark documentation for additional details.

Next Steps

How and why to run machine learning workloads on Kubernetes

Dig Deeper on Containers and virtualization