Getty Images/iStockphoto

News

Researchers bust ChatGPT guardrails, question gen AI safety

Researchers find they can trick AI chatbots including OpenAI ChatGPT, Anthropic Claude and Google Bard into providing disinformation, hate speech, and other harmful content.

Antone Gonsalves, Editor at Large

Published: 31 Jul 2023

Listen to this article. This audio was generated by AI.

Researchers easily bypassed the filters preventing chatbots' generative AI models from spewing toxic content, adding to enterprises' concerns that ChatGPT and other large language models aren't safe to use.

Researchers from Carnegie Mellon University, the Center for A.I. Safety in San Francisco, and the Bosch Center for AI found they could trick LLMs underpinning chatbots into producing a nearly unlimited amount of disinformation, hate speech and other harmful content.

Examples of harmful content the chatbots provided included how to build a bomb, swipe someone's identity or steal from a charity.

The experimenters automated the process, proving it is possible to launch unlimited LLM attacks.

The technique applies to publicly available chatbots such as OpenAI's ChatGPT, Anthropic Claude and Google Bard, according to the researchers. For enterprises, the discovery reemphasized the threat of generative AI-powered chatbots.

"[Organizations] don't worry much about whether employees can find out how to make a bomb. But when they read this, it just emphasizes that you can't trust the output and that the guardrails can always be broken," Gartner analyst Avivah Litan said.

The researchers added to chatbot queries specific sequences of characters that caused the system to provide answers that its guardrails should have prevented. The testers used the neural network weights of open-source LLMs to help choose the most likely successful character string.

While other testers have used similar techniques, the latest research proved malicious actors could conduct automated attacks against the major LLM chatbot providers as well as open source models, including LLaMA-2-Chat, Pythia and Falcon.

"We find that the strings transfer to many closed-source, publicly available chatbots like ChatGPT, Bard, and Claude," the researchers wrote in their report. "This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion."

Tom Nolle, founder and analyst for consulting firm Andover Intel, spoke with many enterprises that have held back on the use of generative AI until vendors provide adequate data protection.

"I know of several dozen users who have looked at generative AI for operations, and none of them believed it was ready yet," Nolle said.

The researchers are uncertain that the LLM providers can fully patch the flaw. So-called "analogous adversarial attacks" have been used against computer vision for the last 10 years. Computer vision uses AI to extract information from digital images and video.

"It is possible that the very nature of deep learning models makes such threats inevitable," the researchers wrote. "Thus, we believe that these considerations should be taken into account as we increase usage and reliance on such AI models."

The testers provided their research results to Anthropic, Google and OpenAI.

Generative AI's key business challenges.

Enterprises are weary of generative AI

Enterprises have grown less confident in the safety of generative AI chatbots despite the attractiveness of their fast, human-like responses to natural language questions, Litan said. Many organizations are figuring out how to safely use the technology with their data while protecting intellectual property.

"If you share data with an unknown LLM algorithm, then, how do you know that the algorithm is not stealing your data?" Litan said.

Factions within companies are often split on using generative AI applications, even from incumbent vendors like Microsoft, which recently introduced AI-supported Copilot. Businesspeople within a company want to go full steam ahead with the new technologies, while IT security pros want time to deploy guardrails.

Companies are investigating content engineering techniques to restrict the responses from their data, Litan said. Microsoft Azure, for example, has an API that lets companies control responses from their data, and companies including AWS, Cohere and Salesforce have also provided customers with data controls.

Antone Gonsalves is networking news director for TechTarget Editorial. He has deep and wide experience in tech journalism. Since the mid-1990s, he has worked for UBM's InformationWeek, TechWeb and Computer Reseller News. He has also written for Ziff Davis' PC Week, IDG's CSOonline and IBTMedia's CruxialCIO and rounded all of that out by covering startups for Bloomberg News. He started his journalism career at United Press International, working as a reporter and editor in California, Texas, Kansas and Florida. Have a news tip? Please drop him an email.

Researchers bust ChatGPT guardrails, question gen AI safety

Researchers find they can trick AI chatbots including OpenAI ChatGPT, Anthropic Claude and Google Bard into providing disinformation, hate speech, and other harmful content.

Enterprises are weary of generative AI

Dig Deeper on AI infrastructure

How docs can manage patients consulting AI medical advice

AI agent vs. chatbot: Breaking down the differences

Artificial intelligence glossary: 60+ terms to know

Domo unveils agentic AI toolkit to simplify development

Enterprises are weary of generative AI

Related Resources

Dig Deeper on AI infrastructure

How docs can manage patients consulting AI medical advice

AI agent vs. chatbot: Breaking down the differences

Artificial intelligence glossary: 60+ terms to know

Domo unveils agentic AI toolkit to simplify development