AI’s Data Appetite Is Huge. That’s a Problem for Privacy Laws

July 24, 2024, 9:03 AM UTC

Generative AI’s voracious consumption of data is starting to run up against strict rules protecting individuals’ rights to data privacy in Europe and around the world.

Meta was met with swift scrutiny after it revised Facebook’s European privacy policy in May. By June, privacy advocates launched complaints in 11 countries against the company’s plans, and the Data Protection Commissioner in Ireland—home of Meta’s EU headquarters—opened an “engagement” with the company. The response prompted the social media giant to mothball its plans.

And earlier this month, authorities in Brazil blocked Meta from collecting data for AI training under a privacy law that resembles the EU’s. More scrutiny of AI’s privacy implications may even be on the horizon in the US, where California is contemplating stricter regulation of AI training.

In the EU and several countries with robust privacy laws, companies collecting personal data generally need to ask for individual consent first—or have a really good reason to collect that information.

Here’s what that means for generative AI companies: If there is personal information in the training data sets, then AI platforms need consent from all the individuals whose data is collected. Alternately, companies can get around consent by relying on a principle called legitimate interest, a balancing test that weighs a company’s interests against the rights of individuals.

Now, regulators will have to decide whether to accept that argument from tech firms training large language models on troves of data.

“This is an AI issue, not a Facebook issue,” said Maria Villegas Bravo, a law fellow at the Electronic Privacy Information Center in Washington, D.C.

“You need a lawful basis for collecting data,” she added. “That’s what’s at issue now.”

Obtaining Consent

The EU’s sweeping privacy law, the General Data Protection Regulation, sets strict guidelines for how companies must collect, process, store, and transfer individuals’ information. Companies may only process personal data if they have consent, or need to do so—such as to fulfill a contractual or legal obligation.

GDPR requires that individuals are asked for their consent in plain language, give it freely, and can withdraw it at any time. And unlike most US state privacy laws, consent under the GDPR is opt-in. That means not collecting the data is the default—compared with the US, where data is usually collected unless the user checks a box.

“Obtaining a consent in a GDPR-valid form is very, very complicated,” said Charles-Albert Helleputte, head of the EU Data Privacy, Cybersecurity & Digital Assets Practice at Squire Patton Boggs, based in Brussels and Paris.

Companies also need to be transparent about the purpose for processing the data when they ask for consent, Helleputte said. Simply disclosing that data will be used to train an AI model, for example, “probably won’t work, or at least you are at risk of that consent not being a valid one,” he added.

In a June 14 blog post, Meta said it sent more than 2 billion in-app notifications and emails to European users explaining its approach since May 22, and offered users the chance to submit an objection form that would stop their data from being used in model training.

If a form is submitted before training begins, “that person’s data won’t be used to train those models, either in the current training round or in the future,” Meta said in its blog post.

Noyb—European Center for Digital Rights, one of the EU’s leading privacy advocacy groups, filed complaints against Meta in 11 EU countries in June. The group called the company’s plans the “opposite” of GDPR compliance. The group flagged the collection of about 400 million EU users’ data for “undefined” purposes.

Once data is in a company’s training set, users can’t get it removed, noyb said—violating a GDPR principle known as “the right to be forgotten.” Data subjects are also deprived of exercising their right to opt-in, and are instead just given the option to object, the complaints said.

‘Legitimate Interest’

Another lawful avenue to process personal data under GDPR is legitimate interest, which allows a company to use personal data from its users if it has a good enough reason to do so. If it’s going to rely on legitimate interest, the company must apply a proportionality test weighing the rights and freedoms of data subjects against the interests of the business.

In the blog post, Meta suggested that it’s no longer certain legitimate interest will cover AI training.

Meta said it needed to train on Europeans’ public posts, otherwise “models and the AI features they power won’t accurately understand important regional languages, cultures or trending topics on social media.”

Without including that “local information” in its model, Meta said it won’t launch its generative AI-powered assistant Meta AI in Europe, for now. In a statement to Bloomberg Law, Meta said it will also release its next generation of models over the coming months, but “not in the EU due to the unpredictable nature of the European regulatory environment.”

The legitimate interest provision has been tested numerous times by Big Tech, but how it will be applied to AI providers’ use of personal data for model training is unclear.

“When it comes to AI and when it comes to the idea that it’s being used to give a flavor to a certain product in a certain region, I cannot say that the legitimate interest has failed because that’s not true,” said Sara Susnjar, Paris-based partner at Winston & Strawn. “But it’s hard to advance the argument that that is a legitimate interest in itself.”

In a May report the European Data Protection Board said that “adequate safeguards” to diminish the impact on data subjects could help tilt the balance in favor of companies collecting data. Those safeguards would include ensuring that certain data categories are not collected or that certain sources, “such as social media profiles,” are excluded from collection. The burden of proof for showing that these measures are effective lies with the company collecting data, the EDPB said in the non-binding guidance.

The legitimate interest principle has already failed to cover the tech company’s practices around data and advertising at the highest European court, Max Schrems, the Austrian privacy advocate and founder of noyb, said in a statement. “Yet the company is trying to use the same arguments for the training of undefined ‘AI technology.’”

Ultimately, all data collection for generative AI training likely violates GDPR in some way, Schrems said in an email. “There may be a ‘preliminary view’ by the regulators to allow the ingestion of (accidental) personal data under the ‘legitimate interest’ legal basis (so without consent),” he added. “But this can be questioned.”

Generative AI also raises problems under other GDPR principles, like giving individuals access to their information and allowing them to correct mistakes in the data, he added.

What’s Next?

Meta isn’t the first AI provider to bump up against the EU’s privacy law. Italy’s privacy authority is investigating OpenAI’s ChatGPT and text-to-video platform Sora for GDPR compliance. Officials are looking at the data Sora is trained on, especially whether sensitive data categories—such as information about someone’s genetics, health, or religious beliefs—are included.

Earlier this year, the Italian agency said its initial investigation into OpenAI’s chatbot found “breaches” of the GDPR.

And noyb has also filed a complaint against OpenAI under GDPR’s “right to rectification.” According to that complaint, the company said it couldn’t correct mistaken information about an individual in a ChatGPT answer.

In Brazil this month, privacy officials cited issues that echo those being discussed in the EU in pausing Meta’s AI training: inadequate grounds for processing personal data, lack of clear disclosure, limits on the rights of consumers, and the processing of children’s data.

In the US, the Federal Trade Commission is also considering data scraping as an antitrust issue—arguing that the companies building AI tools have so much access to data that it could raise competition concerns. Still, so far most of the litigation against scraped data in training data sets remains in the intellectual property space.

And in the absence of a federal privacy law in the US, states—including California—are starting to look more closely at the privacy issues associated with AI.

Eventually, privacy law and generative AI will have a “reckoning,” Susnjar said. “The two will need to exist the same way laws and technology have existed since the beginning of time.”

To contact the reporters on this story: Isabel Gottlieb in New York at igottlieb@bloombergindustry.com; Cassandre Coyer in Washington at ccoyer@bloombergindustry.com

To contact the editors responsible for this story: Kartikay Mehrotra at kmehrotra@bloombergindustry.com; Gregory Henderson at ghenderson@bloombergindustry.com

Learn more about Bloomberg Law or Log In to keep reading:

See Breaking News in Context

Bloomberg Law provides trusted coverage of current events enhanced with legal analysis.

Already a subscriber?

Log in to keep reading or access research tools and resources.