Tips to prevent machine learning scalability problems
Addressing ML scalability challenges involves selecting the right models, planning resource usage and managing network connectivity to support expanding applications and data load.
Adopting a new technology always comes with risks, especially when users have limited technical and practical experience with it.
Surely machine learning fits this description -- in particular, emerging concepts like large language models and generative AI, which are built on ML foundations. As any technology evolves, its applications broaden, and this broadening introduces scalability challenges. Can early approaches withstand success and expanded use?
ML scalability factors to consider
Ensuring scalability in ML strategies hinges on accommodating additional load -- in other words, changes in usage. The major sources of additional load are as follows:
- More users or increased user activity; for example, more prompts and queries to an LLM per user.
- A new ML model that requires greater processing power or other resources.
- An increase in data volume, stemming either from more processes contributing data or more data collected and analyzed per process.
- More rules to apply in analysis.
- Reduced time allotted to complete an ML analysis and create a response.
- Repeating or redoing the ML model's learning or training process.
The role of model complexity
Model complexity is the fundamental question in ML scalability. ML resources and resource scaling practices for simpler models will more closely mirror traditional data center or cloud techniques.
As ML models approach the complexity of LLMs, hosting becomes more specialized, requiring different planning and management tools and practices. Don't pick a highly complex model thinking it will improve scalability; it could increase difficulty instead.
This article is part of
What is machine learning? Guide, definition and examples
Self-hosting vs. cloud hosting
Scalability management for ML differs significantly for users who host models in house versus those who use cloud resources. For cloud users, the primary concern is cost management, as resources can be dynamically adapted to meet demand.
However, concerns about data security, data sovereignty and cost are driving enterprises toward self-hosting. Rapid changes in application load, whether increases or decreases, can exhaust available hosting and connection capacity or lead to wasted resources and money, respectively.
For most enterprises, the risk of overbuilding ML capability lies in the training phase. The main challenge is managing expanded load that stresses the deployed ML configuration. Currently, enterprises generally host most ML on their own servers.
How to ensure enterprise ML scalability
The first step in preventing ML scalability issues is proper planning. It's important to look ahead a reasonable amount of time -- for most enterprises, around three years -- to estimate ML usage and select appropriate models, hosting resources and network connectivity. This foresight is vital for both self-hosted and cloud-hosted ML, as it helps predict technology needs and costs.
Next, consider the required response time for ML answers. Is the ML model involved in real-time analysis, or in planning and analytics missions? For real-time analysis, it's critical to manage response time by ensuring adequate resources are provided at any expected level of usage or data complexity. For planning and analytics tasks, trading resources for a somewhat longer response time might be perfectly acceptable, reducing the need for extensive scaling.
Model selection
It's advisable to select the simplest model that will serve current and future needs. Simple models, such as regression analysis models and small-scale neural networks, are easier to scale. While complex LLM-level models offer greater analytical capabilities, simpler models enhance scalability without requiring extensive planning and simulation. They are also cheaper to run in the cloud and reduce the risk of unexpected cost overruns with changing usage patterns.
Some simple ML models might not require GPUs or can use cheaper, simpler GPUs. In such cases, ML applications can share server resources in the data center to manage variable loads or use less expensive cloud server resources, provided that proper priority control is in place.
ML cluster management
The next step is to plan an ML cluster. ML involves more than just GPUs and servers; it requires a cluster with an input network for user connections, an internal cluster network for GPU servers, internal databases, connections to external databases for training and modeling, and management and operational services. Some high-usage scenarios with multiple ML applications might call for multiple ML clusters, with the ability to shift servers among those clusters as needed.
Don't focus too much on the CPU or GPU when it comes to performance. Memory bandwidth, cache size, bus speed and interface direct memory access speed are equally important for performance and scalability, particularly with many servers in the cluster.
Scaling the cluster might involve drawing on server resources from the company's data center resource pool, provided that these servers are properly equipped with GPUs -- or, model permitting, CPUs. In the cloud, configuration control is limited, but teams can define their ML cloud cluster in terms of the range of GPUs that can be committed and the storage of models and data.
After planning a cluster model, review the types of hosting to be used within it. Generally, bare metal is the most efficient way to host ML. If virtualization is needed to separate applications, VMs usually offer better performance than containers, but containers facilitate resource allocation across a larger pool of ML applications.
Network connectivity planning
Next, plan network connectivity using the fastest interfaces and highest device capacities practicable. Any increase in ML activity will disproportionately increase network load, particularly in the cluster interconnect role. Enterprises often use Ethernet, with features for explicit congestion control and prioritization, to connect GPU servers, as it's generally less expensive and easier to manage than InfiniBand.
In general, only two-thirds of interfaces on any switch should be occupied in initial deployment. This leaves space for GPU server expansion and avoids the need to replace switches if load increases -- a step especially relevant for ML self-hosting.
Staging and caching training data
It's helpful to stage or cache training and model data in local cluster storage rather than accessing it from its normal database location. This approach reduces cross-loading between ML data usage and other enterprise applications.
Most ML applications require only a small portion of the overall database content for training, so staging only necessary data also reduces network load from scanning unneeded data. Many enterprises also opt to adopt an AI/ML-optimized database, such as a vector database. Cloud ML users should consider staging data to the cloud to avoid access and traffic charges.
Training infrastructure considerations
For training, consider using a cloud-hosted resource, such as a public AI service, public cloud GPU hosting or an emerging GPU-as-a-service provider. While this entails surrendering data security to the provider, training often uses historical information, posing a smaller business risk.
Using a bare-metal GPU resource for training can be faster and more secure. If opting for a hosted training resource, ensure that the ML cluster is built and test the model in both that cluster and the GPU host service before starting full-scale training. This includes verifying compatibility by testing a small-scale training result in the cluster.
Tom Nolle is founder and principal analyst at Andover Intel, a consulting and analysis firm that looks at evolving technologies and applications first from the perspective of the buyer and the buyer's needs. By background, Nolle is a programmer, software architect, and manager of software and network products, and he has provided consulting services and technology analysis for decades.