Small language models emerge for domain-specific use cases
Because LLMs suffer from accuracy and security problems, some organizations are developing generative AI systems trained with their own company data to address specific use cases.
While many are using large language models to write content and improve search results, enterprises are developing domain-specific models trained on their own data to address specific business problems.
Generative AI and their large language models (LLM) was invented in the 1960s.
However, it wasn't until OpenAI launched ChatGPT in November 2022 -- which represented significant improvement in LLMs' capabilities -- that they became advanced enough to potentially help people become more efficient in both everyday life and in their work.
Since then, generative AI has been the dominant trend in both analytics and data management, with hordes of venders unveiling plans to develop tools incorporating generative AI.
Due to their extensive vocabularies, LLMs have the potential to enable natural language processing with freeform rather than business-specific language, which could widen the use of analytics. In addition, because they can translate text to code, they have the potential to make data engineers more efficient as they build and manage data pipelines.
But LLMs sometimes suffer "hallucinations" -- including inaccurate and misleading responses -- and they're subject to security risks. They're also trained on public data so have no understanding of the many nuances of a given organization's operations.
Enterprises, therefore, are realizing there may be a better way to use generative AI in their business: training their own language models. Using their own data, the models are designed to address problems specific to an organization's industry, such as finance, healthcare or supply chain management.
Kevin Petrie, an analyst at Eckerson Group, calls them small language models or domain-specific language models.
Recently, he discussed the rising interest in small, domain-specific language models, including how they differ from LLMs, what types of organizations are developing them and how they can be applied. In addition, he spoke about how long it will take before such models are ready for use and what barriers -- particularly data quality -- organizations need to overcome to get them into production.
Editor's note: The following was edited for length and clarity.
What is a large language model?
Kevin Petrie: Large language model is a category of generative AI that generates text, or potentially other types of content, based on natural language prompts. It's based on a neural network that studies the relationships of concepts and texts.
The process by which you train a large language model is that you tokenize unstructured text, meaning you convert specific words, punctuation or characters to numbers. Then you study a lot of text to understand how those different words relate to each other in context. It boils down to a massive number cruncher that predicts the next word or phrase or sentence in a string of words or phrases or sentences based on what came before and based on what it knows about how those words and phrases and sentences relate to each other from its training data.
Examples of large language models are the ones getting the headlines right now -- ChatGPT from OpenAI, Bard from Google, Bloom from Hugging Face and others. They're trained on lots of text and billions of parameters, which are essentially values that help describe the interrelationships of words.
What is a small language model? How is it different from an LLM?
Petrie: A small language model applies the same methodology but goes a step further to tackle domain-specific data and domain-specific use cases, often using internal enterprise data.
Large language models and small language models are two ends of a spectrum. People are realizing that to solve hard problems and gain productivity benefits without all the attendant risks related to data quality and hallucinations, they need to get more domain specific. They need to fine-tune the training of their models to handle deep domain data, which often exists within enterprise firewalls.
If small language models are trained largely with an organization's own data, how are they different from other AI models?
Petrie: A small language model could have some starter code that was trained in LLM context but is getting fine-tuned to look at domain-specific data. Also, it's a different type of architecture from some other AI because it's based on a neural network and it's studying tokenized text. A lot of existing AI models that enterprises have used to date are more focused on machine learning model use cases, such as cluster analysis, linear regressions and anomaly detection. That's different than studying text, understanding how texts relate to each other and generating text.
What do small language models enable that LLMs do not?
Petrie: They enable companies to improve their productivity and their creativity while incurring fewer risks in terms of data quality, hallucinations, handling of intellectual property, privacy and bias. If you're training on domain-specific data, there are fewer gaps in the knowledge that the language model must work with. Hallucinations often arise when language models try to fill in gaps -- they surmise and make things up.
You can have companies building their own language models to engage more proactively with customers. In the data management sphere, data pipeline vendors, such as Informatica and others, are starting to develop small language models -- alongside large language models -- to help data engineers build pipelines, document their environments, test data quality and so forth.
What's a real-world example of a small language model in production?
Petrie: A startup named Illumex was training its own language models before ChatGPT made it cool with its debut in November 2022. They use it to create descriptions of different assets they are presenting to users in their data catalog. Now they're also enriching that with ChatGPT of descriptions of assets in their data catalog as well.
Kevin PetrieAnalyst, Eckerson Group
Could organizations build their own small language models before the latest generation of generative AI, or do they need to pull some capabilities only now available to develop a small language model?
Petrie: What ChatGPT and Bard and so forth did was demonstrate the power, breadth and speed of these outputs. They, along with a whole suite of open source communities, also provided code that can be fine-tuned on their domain-specific data. As a result, a lot of companies are building their own language models.
We did a survey at Eckerson Group that showed about 30% of companies say they're building their own language models. We'll have to see what success they have and how much data quality interferes with that success over the next year or two. But there's certainly a lot of interest in building them.
The signal of that interest is that Databricks was willing to pay $1.3 billion for a startup called MosaicML that helps companies build and train these language models.
You mentioned that 30% of the companies you surveyed are developing their own language models. Are they already realizing benefits from those models?
Petrie: It's going to take some time to produce high-quality language models that are ready for production and companies feel will incur an acceptable level of risk and enough upside. But the effort is underway.
That 30% does include some data vendors that are building their own language models. Data-savvy software companies are more likely to be early adopters than mainstream Fortune 2000 companies.
Is the ability to develop domain-specific language models something that is exclusively the domain of data vendors and large organizations or can mid-sized enterprises do so as well?
Petrie: Initially, there will be two categories of early adopters. One is data-savvy or AI/ML-savvy software companies. The other is large organizations that have the extensive resources to put into a venture like this.
What is the outlook for small language models -- will we eventually hear more about domain-specific models than LLMs?
Petrie: Small language models will become more widespread and generate more long-term productivity gains. It could be as simple as companies getting more scientific about how they feed prompts -- rich, detailed, domain-specific prompts -- into a public LLM. That's a domain-specific, small language model approach because it's getting deep into enterprise data.
I think companies will get more long-term productivity gains from the small language models and developing domain-specific applications of language models.
Will developing language models be a smooth process, or are there barriers that need to be overcome before organizations can easily build domain-specific models?
Petrie: We're in an inflated hype cycle right now. As we know, enthusiasm tends to overstate the initial benefits and impacts of a new technology as well as understate the long-term benefits and impact. That will happen here.
The wall that a lot of companies will hit is a wall that we've been dealing with for decades, which is data quality. You'll see companies renew their investments in data quality, data observability, master data management, labeling and metadata management to ensure they have a handle on governed training inputs and prompts for these language models.
That must be a precursor to longer term productivity gains.
How long will it take to ensure enough data to quality for language models to be moved into production?
Petrie: I don't know. Chief data officers are struggling with old problems like master data management and data quality. Data teams are struggling to stay on top of these things because data sources are proliferating and data volume is proliferating. I think that within a couple of years, we'll see companies that have created viable fenced-off areas where they have good, clean data for language models. It won't be widespread across the organization, but there will be pockets.
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.