putilov_denis - stock.adobe.com

Tip

3 key generative AI data privacy and security concerns

Those charged with protecting and ensuring the privacy of user data are facing new challenges in the age of generative AI.

Mike Pedrick

Published: 08 Nov 2024

Even as generative AI captures society's interest, its implications remain very much in flux. Professionals, casual technology users, students and hundreds of other constituencies today use GenAI tools ranging from ChatGPT to Microsoft Copilot. Use cases span the gamut, from the creation of AI art to the distillation of large works.

The technology is proliferating at an alarming pace -- particularly for information security and privacy professionals whose focus is on data governance. Many such practitioners still hold GenAI at arm's length.

GenAI learns from data and has a voracious appetite. AI developers, backers and users are often all too eager to forklift heaping helpings of data into large language models (LLMs) to get unique and profound results from the platform.

Despite the benefits, this exposes three major generative AI data privacy and security concerns.

1. Who owns the data?

In the European Union, a primary principle of GDPR is that the data subject owns their data without question. In the United States, however, despite a spate of state-level regulations modeled after GDPR, ownership remains a gray area. Possession of data is not the same as ownership, and while GenAI users are able to upload data into the model, it may or may not belong to them. Such indiscretions with third-party data could lead to liabilities on the part of the LLM provider.

This is a new arena of litigation that remains to be explored, but hiding in the shadows is a mountain of prior intellectual property cases that might inform precedent. Major players in the tech space, including Slack, Reddit and LinkedIn, have all experienced significant resistance from consumers when faced with the prospect of having their data used to train the companies' respective AI models.

2. What data can be derived from LLM output?

GenAI ostensibly lacks guile or duplicity; its purpose is to be helpful. Yet, given correct prompting, the data generated by a GenAI provider can potentially be weaponized. Any information that has been submitted to an LLM could also be used as output, causing many people to be nervous about having their sensitive or critical information become a part of the model.

Data tokenization, anonymization and pseudonymization can effectively mitigate these risks, but they could also compromise the quality of the data used by the model. GenAI advocates stress that the accuracy and legitimacy of data, regardless of classification, is paramount. Without that, they say, current AI models aren't as effective as they could be.

3. Can the output be trusted?

An interesting term has come into popularity with GenAI: hallucination. A hallucination is the all-too-frequent occurrence where a GenAI model makes up an answer that is completely wrong. Whether this is the result of poor training or good training with bad data -- "bad data" being an entire subcategory that sparks questions of intent -- GenAI is still early enough in its lifecycle that mistakes happen. Depending on the use case being employed, the consequence of a hallucination can vary from a minor inconvenience to a much more dangerous result.

Where GenAI gets its power

GenAI gets its power from information. But those who manage that information -- among them information security, consumer privacy and data governance practitioners -- must answer important questions that range from understanding who owns the data used to train LLMs to determining how the data is used by the model and who can extract it.

The generative AI data privacy and security stakes are high, and there is no meaningful opportunity to put the genie back in the bottle once intellectual property transgressions have occurred.

We are on the precipice of a bold, new world, and as has been seen throughout history, no such leaps forward come without some bumps.

Mike Pedrick is a vCISO and consultant, advisor, mentor and trainer. He has been on both sides of the IT, IS and GRC consulting/client table for more than 20 years.

Next Steps

Generative AI security best practices to mitigate risks

Who owns AI-generated content?

3 key generative AI data privacy and security concerns

Those charged with protecting and ensuring the privacy of user data are facing new challenges in the age of generative AI.

1. Who owns the data?

2. What data can be derived from LLM output?

3. Can the output be trusted?

Where GenAI gets its power

Next Steps

Dig Deeper on Data security and privacy

Pegasystems founder and CEO Alan Trefler on future of GenAI

Why does AI hallucinate, and can we prevent it?

Amazon Q, Bedrock updates make case for cloud in agentic AI

What is an AI prompt?

1. Who owns the data?

2. What data can be derived from LLM output?

3. Can the output be trusted?

Where GenAI gets its power

Next Steps

Related Resources

Dig Deeper on Data security and privacy

Pegasystems founder and CEO Alan Trefler on future of GenAI

Why does AI hallucinate, and can we prevent it?

Amazon Q, Bedrock updates make case for cloud in agentic AI

What is an AI prompt?