Getty Images

Tip

How to train an LLM on your own data

Retraining or fine-tuning an LLM on organization-specific data offers many benefits. Learn how to start enhancing your LLM's performance for specialized business use cases.

General-purpose large language models are convenient because businesses can use them without any special setup or customization. However, to get the most out of LLMs in business settings, organizations can customize these models by training them on the enterprise's own data.

Customized LLMs excel at organization-specific tasks that generic LLMs, such as those that power OpenAI's ChatGPT or Google's Gemini, might not handle as effectively. Training an LLM to meet specific business needs can result in an array of benefits. For example, a retrained LLM can generate responses that are tailored to specific products or workflows.

To decide whether to train an LLM on organization-specific data, start by exploring the different types of LLMs and the benefits of fine-tuning one on a custom data set. Next, walk through the steps required to get started: identifying data sources, cleaning and formatting data, customizing model parameters, retraining the model, and finally testing the model in production.

Generic vs. retrained LLMs

LLMs can be divided into two categories:

  • Generic LLMs. Designed to support a wide array of use cases, these LLMs are typically trained on broad sets of data. For the biggest LLMs, such as those built by OpenAI and Google, this can include virtually the entire expanse of information available on the internet.
  • Retrained or fine-tuned LLMs. These LLMs are trained, at least in part, on custom, purpose-built data sets. In a business context, this might include documentation or emails specific to a particular corporation.

Training an LLM using custom data doesn't mean the LLM is trained exclusively on that custom data. In many cases, the optimal approach is to take a model that has been pretrained on a larger, more generic data set and perform some additional training using custom data.

That approach, known as fine-tuning, is distinct from retraining the entire model from scratch using entirely new data. But complete retraining could be desirable in cases where the original data does not align at all with the use cases the business aims to support.

Benefits of training an LLM on custom data

Why might someone want to retrain or fine-tune an LLM instead of using a generic one that is readily available? The most common reason is that retrained or fine-tuned LLMs can outperform their more generic counterparts on business-specific use cases.

For instance, an organization looking to deploy a chatbot that can help customers troubleshoot problems with the company's product will need an LLM with extensive training on how the product works. Even if a generic LLM has some familiarity with the product -- for example, through training data that includes product mentions in public data sources -- it's not likely to have been trained exhaustively on all data sources relevant to the product. The company that owns that product, however, is likely to have internal product documentation that the generic LLM did not train on.

Without all the right data, a generic LLM doesn't have the complete context necessary to generate the best responses about the product when engaging with customers. When developers at large AI labs train generic models, they prioritize parameters that will drive the best model behavior across a wide range of scenarios and conversation types. While this is useful for consumer-facing products, it means that the model won't be customized for the specific types of conversations a business chatbot will have.

Organizations can address these limitations by retraining or fine-tuning the LLM using information about their products and services. In addition, during custom training, the organization's AI team can adjust parameters like weights to steer the model toward the types of output that are most relevant for the custom use cases it needs to support.

Training LLMs on custom data: A step-by-step guide

Take the following steps to train an LLM on custom data, along with some of the tools available to assist.

1. Identify data sources

First, choose relevant data sources for model retraining. The goal should be to find data that meets the following criteria:

  • Sufficient in volume to enable effective retraining. Exactly how much custom data is needed will vary depending on factors like the complexity of the use case and the pretrained model's existing awareness of the relevant information. But in general, expect to need thousands of data records at minimum. In some cases, custom LLM training might require hundreds of thousands or millions of new records.
  • Relevant to the custom use cases the LLM will support. Only use data that focuses directly on the target use case; extraneous data will confuse the model.
  • Relatively high in quality. The data doesn't need to be perfect because some data quality deficiencies can be addressed through cleaning, but it should be decent. For instance, a model might not be able to effectively interpret emails that are rife with typos.
  • Available in a mode that the LLM supports. While some models are multimodal, meaning they can accept multiple types of data, others can only train on a specific data type, such as text or images.

2. Clean data

Before retraining the model, clean up the data by mitigating data quality deficiencies. This includes handling issues such as the following:

  • Corrupt data should be removed from training data sets.
  • Duplicate copies of the same data should be reduced to a single copy prior to retraining.
  • Incomplete data should either be removed or completed (when feasible) by adding missing information.

The data used for retraining doesn't need to be perfect, since LLMs can typically tolerate some data quality problems. But the higher in quality the data is, the better the model is likely to perform. Open source tools like OpenRefine can assist in cleaning data, and a variety of proprietary data quality and cleaning tools are available as well.

3. Format data

Depending on the types of data used in model retraining, it might be necessary to format the data in a specific way. While some LLMs can be retrained using data that is inconsistently structured, such as emails or Microsoft Word documents, models are often better able to recognize relevant patterns and input-output relationships if the training data is structured in a specific way.

For example, to help a model understand how a customer support team responds to customer requests, an organization could format training data by extracting information from email exchanges between the support team and customers. Customer questions would be structured as input, while the support team's response would be output. The data could then be stored in a file or set of files using a standardized format, such as JSON.

Formatting data is often the most complicated step in the process of training an LLM on custom data, because there are currently few tools available to automate the process. One way to streamline this work is to use an existing generative AI tool, such as ChatGPT, to inspect the source data and reformat it based on specified guidelines. But even then, some manual tweaking and cleanup will probably be necessary, and it might be helpful to write custom scripts to expedite the process of restructuring data.

4. Customize parameters

Consider customizing the parameters used during model training. Parameters control how models interpret data, including how they identify relationships between multiple data records. By tweaking parameters, organizations can guide the model toward behaving in certain ways.

During model retraining, teams commonly customize model weights: variables that indicate the strength of a relationship between two types of data within a training data set. For example, imagine an organization is retraining a model to provide customer support about a product, and the product documentation includes specialized technical jargon. The organization's AI team could adjust model weights to connect technical terminology used by the support team with the everyday language customers typically use. Without this adjustment, a generic model might not understand the special jargon related to the product and thus might fail to recognize these relationships.

Exactly which parameters to customize, and the best way to customize them, varies between models. In general, however, parameter customization involves changing values in a configuration file -- which means that actually applying the changes is not very difficult. Rather, determining which custom parameter values to configure is usually what's challenging. Methods like LoRA can help with parameter customization by reducing the number of parameters teams need to change as part of the fine-tuning process.

5. Retrain the model

With all the prep work complete, it's time to perform the model retraining. In many respects, this is the easiest part of the overall process.

Again, the technical process for model retraining will vary depending on the base model. But in most cases, it boils down to running code that ingests the custom data set into the model and retrains the model based on the parameters set in the previous step.

The time required for training can vary widely depending on the amount of custom data in the training set and the hardware used for retraining. The process could take anywhere from under an hour for very small data sets or weeks for something more intensive.

6. Test the customized model

The final step is to test the retrained model by deploying it and experimenting with the output it generates. The complexity of AI training makes it virtually impossible to guarantee that the model will always work as expected, no matter how carefully the AI team selected and prepared the retraining data.

If the retrained model doesn't behave with the required level of accuracy or consistency, one option is to retrain it again using different data or parameters. Getting the best possible custom model is often a matter of trial and error.

Chris Tozzi is a freelance writer, research adviser, and professor of IT and society who has previously worked as a journalist and Linux systems administrator.

Next Steps

Explore the role of training data in AI and machine learning

Dig Deeper on AI business strategies