Getty Images

Tip

The importance and limitations of open source AI models

Despite rising interest in more transparent, accessible AI, a scarcity of public training data and compute infrastructure presents significant hurdles for open source AI projects.

Chris Tozzi

By

Chris Tozzi

Published: 07 Feb 2024

Because AI models differ in some key respects from other types of software, applying open source concepts and practices to AI models isn't always a straightforward task. It's one thing to create open source code; it's another to actually train and deploy an AI model.

Some of the most popular AI engines, such as the GPT model series underlying OpenAI's ChatGPT, have recently come under fire for being proprietary technology that operates as a black box. Amid these debates and the costs associated with proprietary AI, there is growing interest in open source AI models, which promise greater transparency and flexibility.

What is an open source AI model?

Anyone can freely access, modify and share the source code for open source software. In contrast, the source code for closed-source software -- also known as proprietary software -- is private, and only the original creators can alter and distribute it.

Similarly, an open source AI model is one whose source code is publicly available, meaning that anyone can download, view and modify the raw code that powers the model's algorithms. This accessibility ensures a level of transparency and customizability that closed-source models lack.

Many of today's best-known generative AI tools, including ChatGPT and Midjourney, are closed source, but there is also a growing open source generative AI ecosystem. Popular open source large language models available today include Meta's Llama models and models from French startup Mistral AI.

Challenges in developing open source AI models

The source code that powers an AI model is similar in many ways to the source code for any other algorithm or application. AI models can be built in any programming language, using the same development processes as for standard applications. In this sense, creating an open source AI model is straightforward.

Where matters get tricky for open source AI developers, however, is the model training stage. Training AI models requires access to vast amounts of data, and few open source AI companies or projects release that data publicly.

For example, in response to a question on a Hugging Face forum about training data for the Mistral 7B LLM, Mistral AI stated, "Unfortunately we're unable to share details about the training and the datasets (extracted from the open Web) due to the highly competitive nature of the field." In other words, the company releases its models' code, but not the data used to train those models.

This lack of publicly available training data for open source AI models creates two main challenges:

Developers seeking to modify and retrain a model might struggle to do so. Although changing the model's source code is relatively straightforward, lack of access to the training data prevents developers from retraining their altered model on the original data set.
Lack of transparency about a model's training data can obscure the model's operations, even when the model's source code is viewable. Because training data plays a major role in the inferences a model generates, open sourcing a model's code without also sharing the training data only provides partial transparency into the model's inner workings.

These challenges are unique to open source AI projects, as other types of applications don't require training. The need for training, coupled with the difficulty of finding publicly available open source data, is a significant technical hurdle that sets open source AI development apart from other areas of software.

The role of fine-tuning

Many open source models can be fine-tuned, which can alter how they operate. This usually involves modifying parameters and weights without fully retraining the model. Although fine-tuning offers some opportunity for customization, it's different from modifying the core code of the model itself or having complete control over the training data.

Where to get training data for open source AI

The fact that open source AI companies and projects rarely release their data sets doesn't mean that developers seeking to build, or build upon, open source AI models are totally out of luck. Those looking to develop or enhance their own AI models can acquire their own training data and retrain the model as needed.

One approach is simply to use the vast volumes of content available on the web for training purposes. This could lead to copyright challenges, however, as evidenced by recent lawsuits from content owners such as The New York Times alleging that AI developers trained on their data without permission. Plus, scraping and preparing extensive web data for training purposes is tremendously challenging; there is a massive amount of data to deduplicate, for example.

Another strategy is to use open data sets, such as those listed in this GitHub reference, which are licensed for training purposes. The drawback is that this data might not be sufficient in volume or scope for training, depending on the model's intended purpose.

The compute issue

Even when open source AI developers can obtain training data for their models, the actual training process requires enormous amounts of compute. Those resources cost money -- something that most open source projects don't have in abundance, if at all.

Some open source AI companies, including Mistral, have managed to address this problem by raising funding, which they use in part to pay for model training. But this approach might not be feasible for smaller-scale open source projects or companies that want to use open source AI, but can't justify investing large sums of cash in model training.

Comparison of the differences between open source and closed source, or proprietary, AI.

The future of open source AI models

In light of these challenges, developers and organizations that want to use open source AI will likely need to settle for pretrained models from companies such as Meta and Mistral for the time being. The inability to retrain those models restricts their transparency and flexibility, presenting limitations not typically encountered with other types of open source software.

This could change if organizations with deep pockets invest in creating open source data sets tailored for AI training, while also providing the compute resources necessary to perform that training. But for now, the challenges associated with model training mean that access to open source AI models doesn't typically confer all the advantages associated with other types of open source software.

Chris Tozzi is a freelance writer, research adviser, and professor of IT and society who has previously worked as a journalist and Linux systems administrator.

Dig Deeper on AI technologies

Part of: Open source AI: Everything you need to know

Up Next

The importance and limitations of open source AI models

Despite rising interest in more transparent, accessible AI, a scarcity of public training data and compute infrastructure presents significant hurdles for open source AI projects.

A look at open source AI models

Open source AI models have advantages over generative AI services offered by major cloud providers. But enterprises have to weigh the benefits against the costs.

Compare proprietary vs. open source for enterprise AI

Unsure whether to choose proprietary or open source AI for your enterprise deployment? Compare the pros and cons of both software models, including how each can benefit businesses.

Does AI-generated code violate open source licenses?

Unresolved questions about how open source licenses apply to AI models trained on public code are forcing users, vendors and open source developers to navigate legal gray areas.

Search Business Analytics

Domo doubles down on AI with latest platform additions
New features such as an AI Library, a framework for building agents and an MCP server are all aimed at better enabling customers ...
ThoughtSpot domain-specific Spotter agents target AI success
With many enterprises struggling to successfully develop AI tools, the vendor's latest capabilities help AI applications access ...
8 benefits of using big data for businesses
Big data is a valuable resource for improving business processes and driving innovation. Here are eight ways big data ...

Search CIO

AI talent strategy: How Rayburn Electric builds AI-ready teams
Rayburn Electric doesn't need to build its own AI models, but it needs AI champions who can implement the technology across ...
Why AI backlash is a leadership problem -- not a tech one
AI backlash is shaping up to be more of a trust issue than a tech one. Open communication and accountability can help overcome ...
Smart glasses as an enterprise risk: What CIOs should know
Once experimental tech, smart glasses now pose serious risks to businesses through covert recording, data leaks and compliance ...

Search Data Management

The database is the new battleground for enterprise AI
Agentic AI puts new pressure on enterprise databases, exposing gaps in the data access, security enforcement and tooling ...
Oracle AI Database update aims to ease developing agents
New vector indexing capabilities and prebuilt agents are designed to simplify building cutting-edge applications and could help ...
Q&A: The gap between AI ambitions and data readiness
Ataccama CEO Mike McKee discusses why most organizations aren't ready for AI, why ROI failures are largely a people problem and ...

Search ERP

8 AI use cases in manufacturing
Manufacturers are moving from AI experimentation to practical use cases that improve productivity, product quality, maintenance ...
Enterprise platforms are evolving piece by piece
Enterprise platforms are increasingly implemented in phases as modular architectures reshape how organizations deploy and evolve ...
Oracle calls Fusion Agentic Applications next-level AI for ERP
Analysts concur that the 22 new applications represent a significant leap from narrowly focused AI to teams of agents that ...

Close