sdecoret - stock.adobe.com
Devs get first look at next Google Gemini model
Google Gemini 2.0 Flash makes its first appearance in Google developer tools, boasting updated multimodal features and fresh training for agentic workflows.
Users of Google developer tools will get the first glimpse of the cloud provider's next large language model, Gemini 2.0, as an experimental Flash version appears this month in Google AI Studio and the Gemini API.
Gemini 2.0 Flash will have three important new capabilities: additional output modalities, including audio and images; as well as support for native tool usage, which means the model understands how to use tools in a multi-stage workflow and what tools are beneficial to use when. Finally, the new Gemini model will feature a new multimodal bidirectional API that can output audio or text in an immediate response to audio or video prompts.
An experimental version of Gemini 2.0 Flash will be available in Google AI Studio and Gemini APIs next week. If Gemini 2.0 follows a similar roadmap to Gemini 1.0 and 1.5, a higher-scale Pro version will eventually follow. The new Gemini model already outperforms Gemini 1.5 Flash and Pro on standard benchmarks for AI accuracy at twice the speed, according to Google officials.
One user of Google's Vertex AI Studio, which will get access to Gemini 2.0 Flash via the Gemini API, said he's interested in testing out the new features, especially since they're becoming available in the manageably sized Flash version.
"What stands out is the emphasis on their Flash model, which is efficient and fast," said David Strauss, CTO at WebOps service provider Pantheon. "Most industry announcements focus on frontier models, which are great for showing the limits of AI capability but are inefficient to run at scale."
Google officials declined to disclose pricing for the new model. If Gemini 2.0 follows Gemini 1.5's development, Gemini 2.0 Flash would eventually become available as part of Google's free AI offerings, Strauss said.
Audio and video output from the new Gemini model, which previously supported only text responses, will allow developers to create new AI-driven application interfaces, such as voice-enabled assistants with visual aids. It will also be able to generate a mix of text and images in response to voice commands. These outputs will be "steerable," meaning developers can build on and refine outputs using conversational natural language.
Google product managers demonstrated the new Gemini model during a press briefing this week, conducting short audio and video conversations with the model to prompt various audio and text responses, including the text of a recipe with embedded AI-generated images.
Multimodal support within a single large language model (LLM) is still relatively rare, said Andy Thurai, an analyst at Constellation Research.
"The same model can understand text, code, audio, etc., and output different modalities based on need" with Gemini 2.0 Flash, he said. "Most other offerings do model switching based on need. While that is not a big deal, as a model gateway can … route requests to an appropriate model, for enterprises that strictly use one vetted model, this can be useful."
'Natively agentic'
In addition to Google AI Studio and the Gemini API, the experimental version of Gemini 2.0 Flash will be supported immediately in non-mobile versions of Google's Gemini AI assistant. Later on, the new Gemini model will become available in various beta-stage Google agents due in 2025, including the Project Astra multimodal agent, Project Mariner research prototype and Colab data science agent.
Developers might be most interested in another early agent project that will soon support the new Gemini model: Jules, an asynchronous coding agent.
"Gemini 2.0 Flash has been natively trained to see code as the language with which you chain together multiple tools and take complex actions, and that makes it overall more powerful," for code-driven agentic workflows, said Shrestha Basu Mallick, group product manager for Gemini API and AI Studio at Google, during the briefing.
Gemini 2.0's native tool usage feature makes it "natively agentic," according to product managers during the press briefing. This refers to agentic AI, in which multiple AI-driven agents take action on multi-stage workflows. Agentic AI is widely considered the next big trend in generative AI applications, and other vendors, including AWS, Microsoft, Atlassian and Salesforce, have rolled out support for AI agents over the last six months.
Chirag DekateAnalyst, Gartner
However, not all AI agents are created equal, according to one industry analyst. Gemini 2.0's training in tool use driven by code prompts stands out from other approaches that merely string together multiple chatbots, for example, said Chirag Dekate, an analyst at Gartner.
"There is a lot of agent-washing in the industry today," Dekate said. "Gemini now raises the bar on frontier models that enable native multimodality, extremely large context and multistage workflow capabilities."
Still, as AI workflows grow more complex, it will take time for enterprises to trust them. In addition to ongoing concerns about the accuracy and quality of LLM responses, the interactions between AI agents create a new cybersecurity attack surface that isn't well understood yet.
"I would trust an agentic system that formulates prompts into proposed, structured actions, subject to review and approval," Strauss said. "My standard for professional use continues to be a test: 'Would I trust an intern to do this?' If the guardrails around the work would be sufficient to trust an intern, then I would trust an AI with the same responsibilities."
Beth Pariseau, senior news writer for TechTarget Editorial, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.