Nabugu - stock.adobe.com

New Databricks open source LLM targets custom development

The data platform vendor's new language model was designed to provide open source users with AI development capabilities similar to those provided by closed source models.

Databricks on Wednesday launched DBRX, a new open source large language model designed to match the performance of closed source models and help users develop customized AI models.

DBRX is not the first open source LLM. For example, Llama 2 from Meta, Bloom, and Mixtral from Mistral AI are popular open source LLMs available to any developer.

Open source LLMs, however, do not historically perform as well as closed source LLMs, such as GPT 3.5 and Google Gemini, in benchmark testing. Perhaps most critically, they aren't as good at understanding language and delivering accurate responses to queries. In addition, they aren't as fast and efficient.

Databricks designed its new LLM to alter that paradigm.

DBRX was built to meet or exceed closed source LLMs in benchmark testing to provide users with similar capabilities. Independent benchmark testing on DBRX is not yet available, but the LLM compares favorably to GPT 3.5 from OpenAI and Gemini from Google, according to Databricks.

That mix of open source and high performance is important, according to Donald Farmer, founder and principal of TreeHive Strategy.

"DBRX is significant for sure," he said. "Performance is a problem for many LLM use cases … so this should enable customizable GenAI apps without vendor tie-in, trained on your data and creating your IP, all with excellent performance."

In fact, Farmer called developing DBRX the most significant thing Databricks has done so far to enable generative AI development.

David Menninger, an analyst at ISG's Ventana Research, noted that as new LLMs are developed, they inevitably surpass the performance of older ones. DBRX continues that trend.

"We are in the midst of LLM leapfrog right now," he said. "As new LLMs are released, they will offer incremental improvements on previously available models."

Whether DBRX is the most significant thing Databricks has developed so far to enable generative AI development remains to be seen, Menninger continued.

"The measure of impact for an open-source project is adoption," he said. We'll have to wait and see how widely it is adopted, but it will certainly be welcomed by Databricks customers."

Based in San Francisco, Databricks is a data platform vendor that helped pioneer the data lakehouse. Over the past year, the vendor has made AI -- including generative AI -- a focal point of its product development and acquisition strategies, putting together a suite of tools aimed at enabling users to develop and train AI applications in a secure environment.

The launch of DBRX comes less than a week after Databricks acquired Lilac AI to add new generative AI development capabilities and 13 days after the vendor partnered with Mistral AI to provide users with more model choice when creating AI applications.

In addition, Databricks acquired MosaicML in June 2023 to add generative AI development capabilities. It subsequently launched Mosaic AI, an environment for the creation of AI models and applications. Later in 2023, the vendor introduced new LLM and GPU optimization capabilities and unveiled a set of tools, including retrieval-augmented generation and vector search, aimed at enabling customers to train AI models.

Customization is key

Part of the power of generative AI is that LLMs enable anyone to ask questions of data in natural language and receive responses in natural language.

LLMs do not, however, have any knowledge of an individual business.

They are trained on public data, which makes them useful for asking questions about historical events. In addition, they are trained to generate language and images based on prompts, making them able to write prose on their own or create illustrations.

But they are not trained on proprietary data. They cannot, for example, project a retail outlet's sales over the next month because they have no historical sales data from that retail outlet.

To make generative AI effective for business purposes, businesses need to either create their own smaller language models or retrain LLMs with proprietary data.

As a result, many enterprises have begun customizing models to meet their domain-specific needs.

Databricks developed DBRX to enable that customization, just as other LLMs enable customization, according to the vendor. The company also, however, developed its new LLM to be open source, perform better than existing open source LLMs and perform as well as existing closed source LLMs.

That combination of open source and high performance sets it apart, according to Stephen Catanzano, an analyst at TechTarget's Enterprise Strategy Group.

"They have introduced a new LLM as an open source project to use your data and build models faster [than with existing open source LLMs]," he said. "Others like Llama 2 do this now, but … the claim is that this is faster."

DBRX was developed using Mosaic AI and trained on NVIDIA DGX Cloud, according to Databricks. In addition, Databricks used a mix-of-experts architecture using the MegaBlocks open source project to address the LLMs efficiency.

In benchmark testing data provided by Databricks, DBRX slightly outperformed open source models Llama2, Mixtral and Grok-1 in language understanding and substantially outperformed the other open source models in programming and math.

Similarly, DBRX slightly outperformed GPT 3.5 in language understanding and more substantially outperformed GPT 3.5 in programming and math. Databricks did not provide similar comparisons with Gemini and other closed source LLMs.

Openness

Databricks has its roots in the open source community and open source development remains important to the vendor, according to Joel Minnick, Databricks' vice president of product marketing.

The vendor's lakehouse platform was built on the open source Apache Spark framework and over the past 11 years Databricks has continually shared its development with the open source community. For example, in 2023 the vendor worked in collaboration with the open source community to create Delta Lake 3.0, an open source storage format that aims to help users unify their data across numerous systems.

Similarly, Databricks' partnership with Mistral AI enables users to access Mistral AI's open source LLMs but does not include Mistral Large, the Paris-based AI vendor's closed source LLM.

The open source nature of Databricks' new LLM follows in that vein.

"We're a team of academics at heart, so being [part of] AI innovation and making a lasting impact on the open source community is very important to us," Minnick said. "Our choice to make DBRX open source is rooted in our ultimate goal to democratize the efficient and cost-effective training of custom LLMs."

The advantages of open source compared with closed source include security, control and cost savings.

When using open source software, all the code is transparent. That enables enterprises to understand exactly how proprietary data is integrated and gives them complete control over interactions between their systems and open source systems.

Closed source software does not make its code transparent. Therefore, integrations between an enterprise's systems and closed source systems require sacrificing control and putting faith in the effectiveness of a vendor's security measures.

As a result of some of the advantages of open source technology, a survey of a small sampling of AI executives published March 21 by Andreessen Horowitz found that nearly 60% of the respondents would prefer to use open source LLMs once open source models match the performance of closed source models.

Magnanimity, however, is likely not Databricks' only motivation for making its new LLM open source, according to Catanzano.

By attracting users to DBRX, the vendor can potentially use the new LLM as an entry point for use of its broader platform.

"Databricks has the goal of driving workloads to their lakehouses, where they make money," Catanzano said. "I don't think they want to be in the LLM business. But by making LLMs and other tech in this space open source and then providing support, it will … drive more use of their solutions."

Next steps

Before Databricks developed DBRX, it launched Dolly in March 2023.

Dolly, like DBRX, is an LLM. However, Dolly lacks the broad capabilities and performance of DBRX, according to Minnick. Databricks developed Dolly to demonstrate that LLMs could be customized with proprietary data to meet business needs but did not emphasize the speed and accuracy capabilities that were built into DBRX.

"Dolly was a proof of concept built on top of an existing base model," Minnick said. "As a proof of concept, Dolly lacked the broad capabilities and performance to be truly viable for most production use cases. In contrast, DBRX is an original LLM built from the ground up on the latest research."

DBRX is significant, for sure. Performance is a problem for many LLM use cases … so this should enable customizable GenAI apps without vendor tie-in, trained on your data and creating your IP, all with excellent performance.
Donald FarmerFounder and principal, TreeHive Strategy

Following the launch of DBRX and Databricks' recent partnership with Mistral AI, Menninger said that Databricks would be wise to focus on LLM operations.

MLOps has been a focus for vendors such as Databricks, whose platforms enable traditional machine learning development. Now with LLM development gaining momentum, there should be a similar emphasis on LLM operations.

"There are still a lot of details to work through when operationalizing GenAI," Menninger said. "As an industry we're further along with MLOps because we've had more time and more experience. We need to apply the same level of discipline to LLMOps."

Farmer, meanwhile, noted that Databricks needs to continue improving the capabilities of its open source LLM.

"In the future, Databricks needs to continue advancing the performance, efficiency and accessibility of open source LLMs," Farmer said. "They can't now fall behind if they want to maintain competitiveness."

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Data science and analytics