Running big data systems in containers has become a feasible option. Hadoop, Spark and other big data platforms can be deployed in clusters of Docker containers managed by orchestration frameworks like Kubernetes. Even mainstream databases are embracing big data containers. For example, SQL Server 2019 lets users mix Spark, Hadoop and relational databases in Kubernetes-based containers.
But the available technologies are still maturing, and they're being used primarily by early adopters willing to do some development work themselves. JD.com Inc. is a case in point. The Beijing-based online retailer has built a large Kubernetes-based container architecture that runs AI and big data analytics applications in various systems, including Spark, Flink, Storm and TensorFlow.
Applications can now be "automatically packaged into images and deployed onto containers in near real time," Haifeng Liu, JD.com's vice president of engineering and chief architect, said in a Q&A posted on the Cloud Native Computing Foundation's website in August 2018. But, he added, the company had to customize Kubernetes to fix performance issues -- a step that included adding new features, removing unnecessary ones and optimizing the technology's scheduler.
This handbook explores the use of big data containers and offers advice on how to deploy and manage them. First, we take a more in-depth look at potential applications and the hurdles that users face. Next, we detail a list of to-do items for containerizing big data systems. We close with a consultant's view of how big data, microservices and containers fit together.