Nice to meet you.

Enter your email to receive our weekly G2 Tea newsletter with the hottest marketing news, trends, and expert opinions.

From Studio Ghibli to Reddit: Who’s Fighting AI Privacy Concerns?

April 8, 2025

ai privacy concerns

AI isn’t just creating; it’s collecting. 

Everything we’ve ever posted, painted, written, or said is up for grabs. As a result, the debate around AI privacy concerns is heating up, with severe backlash against the tech using people’s creative work without permission. 

From indie artists to global newsrooms, creators across industries are discovering that their work has been scraped and fed into AI systems, often without consent (think AI-generated Studio Ghibli images flooding the internet.) 

In some cases, the bots quote artists and creators; in others, they mimic them. The consequence is a wave of lawsuits, licensing battles, and digital defenses. 

The message is clear: people want more control over how AI uses their data, identity, and creativity.

The AI privacy concern: why the pushback?

Behind every large language model (LLM) or AI image generator is a massive, often opaque dataset. These models are trained on books, blogs, artwork, forum threads, song lyrics, and even voices, usually scraped without notice or consent. 

The conversation has shifted from philosophical musings to a concrete battle over who owns and controls the internet’s large database of knowledge, culture, and creativity. 

Do AI systems deserve unrestricted access without permission? Until recently, training AI on publicly available data was treated like fair game. But that assumption is starting to collapse under legal, ethical, and economic pressure.

Here’s what’s driving the shift:

  • Economic survival: When AI tools repackage your content, it can eat into your audience, traffic, and revenue model.
  • Legal uncertainty: Courts are considering whether training AI on copyrighted content qualifies as “fair use,” but no broad legal consensus has emerged. Many companies act preemptively — striking licensing deals or changing data practices as legal risks grow.
  • Ethical clarity: As creators and brands, some companies are drawing boundaries: just because it’s public doesn’t mean it’s free to use.
  • Future precedent: Today’s decisions could shape licensing models, platform policies, and how AI companies engage with data owners long-term.

The scale is so large that even non-personal data becomes sensitive. What feels like open data often contains elements of personal identity, creative ownership, or emotional labor, especially when aggregated or mimicked.

Some companies are reacting to specific harm, like revenue loss or content mimicry. Others are taking a stand to protect creative ownership and set new norms.

14 real-world AI privacy concerns from creators, publishers, and platforms

Entity AI privacy concern Type of pushback Summary
Studio Ghibli Style mimicry and visual IP used by AI generators Public condemnation Studio Ghibli publicly denounced the use of its art style in AI-generated images but has not pursued legal action.
Reddit Data scraping of user-generated content API Restriction Reddit restricted API access and signed a licensing deal with Google to control how AI companies access and use its data.
Stack Overflow Unlicensed reuse of community answers Legal Threat + API Monetization Stack Overflow issued legal warnings and began charging AI companies to access its data following unauthorized use.
Getty Images Use of copyrighted photos in training datasets Lawsuit + Licensed Dataset Getty Images sued Stability AI for using millions of its photos without permission and launched a licensed dataset for ethical AI training.
YouTube Creators AI-generated impersonations using creator voices Takedowns + Platform Advocacy YouTube creators issued takedown requests and called for better platform policies after AI tools mimicked their voices without consent.
Medium Use of blog content in AI tools AI Crawler Block Medium quietly blocked AI bots from scraping its blog content by updating its robots.txt file.
Tumblr AI scraping of user-created content AI Crawler Block Tumblr blocked AI bots from accessing its site to protect user-generated content from being scraped for training purposes.
News Publishers Blocking AI Web Crawlers Unauthorized scraping of journalism by AI bots Technical Restrictions Major newsrooms like CNN, Reuters, and The Washington Post updated their robots.txt files to block OpenAI’s GPTBot and other AI scrapers, rejecting unlicensed use of their content for model training.
Anthropic Use of copyrighted books to train language models Lawsuit Authors filed a class-action lawsuit accusing Anthropic of using pirated versions of their books to train Claude without permission or compensation.
Clearview AI Unauthorized scraping of biometric facial data Class-Action Lawsuit Settlement Faced a class-action suit over facial recognition scraping; settled in court with restrictions on private use and oversight but no financial payouts.
Cohere Scraping and training on copyrighted journalism Lawsuit Condé Nast, Vox, and The Atlantic sued Cohere for scraping thousands of articles without permission to train its AI models, bypassing attribution and licensing.
Common Crawl Large-scale data scraping without consent Public criticism + site blocks Several publishers and sites blocked Common Crawl’s web scrapers and criticized its datasets being used in AI training without consent.
OpenAI Opt-Out Backlash Lack of rollback or control over scraped content Community + Publisher Backlash OpenAI faced backlash for unclear opt-out policies and continued use of data scraped before opt-out tools were introduced.
Stability AI Mass scraping of unlicensed data across the web Multiple Lawsuits Several artists have sued Stability AI for unauthorized use of copyrighted or sensitive content in training data.

Top 3 risks of letting AI scrape your content

  • Loss of IP control: Once AI tools ingest your content, it can be reused, remixed, or monetized without attribution. This undermines your ownership and creative rights.
  • Brand dilution and misinformation: AI-generated outputs can echo your content without context or accuracy, risking brand misrepresentation or factual distortions tied to your name.
  • Regulatory and legal exposure: If user data or copyrighted material is unintentionally exposed through AI scraping, your business could face compliance violations under laws like GDPR or CCPA.

Drawing the line: who’s saying no to AI?

Many creators, studios, and companies have stepped forward, clearly signaling that their content is off-limits to AI training, setting a clear message and boundaries.

1. Studio Ghibli doesn’t want its magic fed to the machines

  • Industry: Film/Animation
  • AI privacy concern: Unauthorized use of animation style in AI-generated art
  • Response: Public rejection of AI tools
  • Status: Still publicly opposes AI mimicry of its style but hasn’t taken legal action.

Studio Ghibli hasn’t officially weighed in, but the internet made the issue loud and clear. After Ghibli-style AI art began spreading online, many created using models trained on its iconic frames and palettes, fans and creatives pushed back, calling the mimicry exploitative.

Footage from a 2016 documentary with founder Hayao Miyazaki showed his stance on AI-generated 3D animation. “I can’t watch this stuff and find it interesting. Whoever creates this stuff has no idea what pain is whatsoever. I am utterly disgusted.”

In other interviews, Ghibli executives emphasized that animation should remain a human craft, defined by intention, emotion, and cultural storytelling — not algorithmic mimicry. It wasn’t a lawsuit, but the message was firm: their work is not raw material for machine learning.

While the studio hasn’t taken legal action or made a public statement about AI, the growing resistance around its visual legacy reflects something deeper: art made with memory and meaning doesn’t translate cleanly into machine learning. Not everything beautiful wants to be automated.

2. Reddit locks the gates and puts a price on the keys

  • Industry: Social media/forums
  • AI privacy concern: Commercial AI use of user-generated content
  • Response: API restrictions and licensing stance
  • Status: API access is restricted, and the company is under FTC review for its data licensing deals.

After years of AI companies quietly training models on Reddit’s massive archive of user discussions, the platform drew a line. It announced sweeping changes to its application programming interface (API), introducing steep fees for high-volume data access, primarily aimed at AI developers.

CEO Steve Huffman framed the change as a matter of fairness: Reddit’s conversations are valuable, and companies shouldn’t be allowed to extract insights without compensation. After the shift, Reddit reportedly signed a $60 million per year licensing deal with Google, formalizing access on its own terms.

The shift reflects a broader trend: public platforms treat their data like inventory, not just traffic.

3. Stack Overflow cuts off free answers from feeding the bots

  • Industry: Developer communities
  • AI privacy concern: Use of crowdsourced answers in AI training
  • Response: Policy change and legal action
  • Status: Now charges AI companies for access and has signed a licensing deal with Google.

Stack Overflow, a G2 customer, changed its API policies and now charges AI developers for access to its community-generated programming knowledge. The platform, long regarded as a free knowledge base for developers, found itself unwillingly contributing to the AI boom. 

As tools like ChatGPT and GitHub Copilot began to surface answers that resembled Stack Overflow posts, the company responded with new policies blocking unlicensed data use.

Stack Overflow has restricted and monetized API access and partnered with OpenAI in 2024 to license its data for responsible AI use. It has also introduced a Responsible AI policy, allowing ChatGPT to pull from trusted developer responses while giving proper credit and context.

The issue wasn’t just unauthorized use — it was a breakdown of the trust that fuels open communities. Developers who answered questions to help each other weren’t signing up to train commercial tools that might eventually replace them.

This tension between open knowledge and commercial use is now at the heart of many AI privacy concerns.

4. Getty Images sues Stability AI: you can’t remix watermarks

  • Industry: Visual media/stock photography
  • AI privacy concern: Copyrighted images used in AI training
  • Response: Lawsuit against Stability AI
  • Status: The UK court has allowed the lawsuit to move forward.

Getty Images took legal action against Stability AI, accusing it of copying and using over 12 million copyrighted images, including many with visible watermarks, to train its image generation model, Stable Diffusion.

The lawsuit highlighted a core problem in generative AI: models trained on unlicensed content can reproduce styles, subjects, and ownership marks. Getty didn’t stop at litigation; it partnered with NVIDIA to launch a licensed, opt-in dataset for responsible AI training.

The lawsuit isn’t just about lost revenue. If successful, it could set a precedent for how visual IP is treated in machine learning.

5. YouTube creators say, "That’s not me, but it sounds like me."

  • Industry: Video content/influencers
  • AI privacy concern: Voice cloning and script mimicry from AI models
  • Response: Takedowns, disclosures, and community backlash
  • Status: Creators continue filing takedowns and calling for stronger AI impersonation policies.

YouTube creators began sounding the alarm after discovering AI-generated videos that used cloned versions of their voices, sometimes promoting scams, sometimes parodying them with eerily accurate tone and delivery. 

In some cases, AI models had been trained on hours of content without permission, using public-facing videos as voice datasets.

The creators responded with takedown requests and warning videos, pushing for stronger platform policies and more apparent consent mechanisms. While YouTube now requires disclosures for AI-generated political content, broader guardrails for impersonation remain inconsistent.

For influencers who built their brands on personal voice and authenticity, hijacking that voice without consent isn’t just a copyright issue but a breach of trust with their audiences.

6. Medium draws a line on AI’s reading list

  • Industry: Publishing platform
  • AI privacy concern: Use of blog content in AI training datasets
  • Response: Updated robots.txt to block AI scrapers
  • Status: Silently updated robots.txt to block AI crawlers from accessing blog content.

Medium responded to increasing concerns from its writers, many of whom suspected their essays and personal reflections were showing up in generative AI outputs. Without fanfare, Medium updated its robots.txt file to block AI crawlers, including OpenAI’s GPTBot.

While it didn’t launch a PR campaign, the platform’s move reflects a growing trend: content platforms protect their contributors by default. This is a soft but significant stance — writers shouldn’t have to worry about their most vulnerable stories becoming raw material for the next chatbot’s training run.

7. Tumblr users get protection from AI bots

  • Industry: Blogging/creative content
  • AI privacy concern: Use of user-generated posts and artwork in AI training
  • Response: Implemented AI crawler opt-outs
  • Status: Added technical blocks to keep AI crawlers away from user-generated content.

Tumblr has long been a home for fandoms, indie artists, and niche bloggers. As generative AI tools began to mine internet culture for tone and aesthetics, Tumblr’s user base raised concerns that their posts were being harvested for training without their knowledge.

The company updated its robots.txt file to block crawlers linked to AI projects, including GPTBot. There was no press release or platform-wide announcement; it was just a technical update that showed Tumblr was listening.

It may not have stopped every model already trained on old data, but the message was clear: the site’s creative archive isn’t up for taking.

8. News publishers block GPTBot in a quiet but coordinated revolt

  • Industry: News media
  • AI privacy concern: Unauthorized data scraping by AI companies
  • Response: Technical blocks and policy shifts across major outlets
  • Status: Most major U.S. outlets now block AI bots via robots.txt

Some of the world’s most trusted newsrooms quietly pulled the plug on OpenAI’s GPTBot and other AI web crawlers without a single press release. From The Washington Post to CNN and Reuters, major outlets added a few decisive lines of code to their robots.txt files, effectively telling AI companies: “You can’t train on this.”

It wasn’t about server strain or traffic. It was about control over the stories, the sources, and the trust that makes journalism work. The quiet revolt spread quickly: by early 2024, nearly 80% of top U.S. publishers had blocked OpenAI’s data collection tools.

This wasn’t just a protest. It was a hard stop — served cold, in plaintext. When AI companies treat journalism like free training material, publishers increasingly treat their sites like gated archives. Adding friction might be the only way to protect the original in a world of auto-summarized headlines and AI-generated copycats.

You've been served: AI companies facing legal action

Some AI companies have landed in hot water, facing cases that question their AI’s approach to privacy and data handling.

9. Anthropic sued for feeding pirated books to Claude

  • Industry: Artificial intelligence
  • AI privacy concern: Use of copyrighted books in AI training
  • Response: Lawsuit filed by authors; Anthropic moved to dismiss
  • Status: The case is ongoing, with Anthropic moving for summary judgment

A group of authors, including Andrea Bartz and Charles Graeber, say their books were used without consent to train Claude, Anthropic’s large language model. They didn’t opt in or get paid, and now they’re suing.

The lawsuit alleges that Anthropic fed copyrighted novels into its training pipeline, turning full-length books into raw material for a chatbot. The authors argue that this isn’t innovation — it’s appropriation. Their words weren’t just referenced; they were ingested, abstracted, and potentially regurgitated without credit.

Anthropic, for its part, claims fair use. The company says its AI transforms the content to create something new. But the writers pushing back say the transformation isn’t the point — the lack of consent is.

As this case heads to court, it tests whether creators get a say before their work becomes machine fodder. For many authors, the answer needs to be yes.

10. Clearview AI’s selfie scraping ends in court control

  • Industry: Facial recognition technology
  • AI privacy concern: Scraping billions of facial images without consent
  • Response: Class-action lawsuit and court settlement
  • Status: Settlement approved March 2025.

Your face isn’t free training data.

A group of U.S. plaintiffs sued Clearview AI after discovering the company had scraped billions of publicly available photos, including selfies, school pictures, and social media posts—to build a massive facial recognition database. The catch? No one gave permission.

The class-action lawsuit alleged that Clearview violated biometric privacy laws by harvesting identities without consent or compensation. In March 2025, a federal judge approved a unique settlement: instead of monetary damages, Clearview agreed to stop selling access to most private entities and implement guardrails under court supervision.

While the settlement didn’t write checks, it did set a precedent. The case marks one of the first large-scale wins for people who never opted into AI training but had their faces taken anyway.

11. Cohere sued for turning journalism into training fodder

  • Industry: AI/LLM
  • AI privacy concern: Scraping and training on journalism without licenses
  • Response: Lawsuit filed February 2023 by major publishers
  • Status: Proceedings ongoing

A squad of publishers, including Condé Nast, The Atlantic, and Vox Media, sued Cohere for quietly scraping thousands of their articles to train its LLMs. The problem? These weren’t open blog posts. They were paywalled, licensed, and built on decades of editorial infrastructure.

The lawsuit says Cohere not only ingested the content but now enables AI tools to summarize or remix it without attribution, payment, or even a click back to the source. For journalism that’s already battling AI-generated noise, this felt like a line crossed.

The gloves are off: publishers aren’t just protecting revenue — they’re protecting the chain of credit behind every byline.

12. Common Crawl’s open dataset gets shut out by publishers

  • Industry: Data repository/web scraping
  • AI privacy concern: Datasets used in AI training without the consent of site owners
  • Response: Growing criticism and site blocks
  • Status: Blocked by multiple publishers for enabling AI scraping without consent

Common Crawl is a nonprofit that’s quietly shaped the modern AI boom. Its petabyte-scale web archive powers training datasets for OpenAI, Meta, Stability AI, and countless others. But that broad scraping comes with baggage: many sites in the dataset never consented, and some are paywalled, copyrighted, or personal in nature.

Publishers have started fighting back. Sites like Medium, Quora, and the New York Times have blocked Common Crawl’s user agent, and others are now auditing to see if their content was included.

What was once a data scientist’s dream has become a flashpoint for ethical AI development. The age of “just crawl it and see what happens” may be coming to an end.

13. OpenAI’s opt-out sparks backlash: consent doesn’t come later

  • Industry: AI development
  • AI privacy concern: Confusing or ineffective opt-out mechanisms
  • Response: Backlash from publishers and web admins
  • Status: Opt-out is available but criticized for not addressing past scraped content.

OpenAI introduced a way for websites to block GPTBot, its data crawler, through a robots.txt file. However, the damage had already been done to many site owners and content creators. Their content was scraped before the opt-out existed, and there's no explicit rollback of past training data.

Some publishers called the move “too little, too late,” while others criticized the lack of transparency around whether their data was still being used in retrained models.

The backlash made one thing clear: consent after the fact doesn’t feel like consent at all in AI.

14. Stability AI faces heat for building on scraped creativity

  • Industry: AI model development
  • AI privacy concern: Use of unlicensed internet data in training
  • Response: Multiple lawsuits and public criticism
  • Status: Facing ongoing lawsuits from artists and media companies over training data use.

Getty Images wasn’t alone. Stability AI’s strategy of training powerful models like Stable Diffusion on openly available web data has drawn sharp criticism from artists, platforms, and copyright holders. The company claims it operates under fair use, though lawsuits from illustrators and developers allege otherwise.

Many argue that Stability AI benefited from scraping creative work without consent, only to build tools that can now compete directly with the original creators. Others point to the lack of transparency around the content used and how.

For a company built on the ideals of open access, it now finds itself at the center of one of the most urgent questions in AI: can you build tools on top of the internet without asking permission?

Technical barriers: how companies are blocking AI scraping

Some aren’t waiting for the courts; they’re already building technical walls. As AI crawlers scour the web for training data, more platforms deploy code-based defenses to control who gets access and how.

Here’s how companies are locking the gates:

Robots.txt + user-agent blocking

A robots.txt file is a behind-the-scenes directive that tells crawlers what they can index. Platforms like Medium, Tumblr, and CNN have updated these files to block AI bots (e.g., GPTBot) from accessing their content.

Example:

User-agent: GPTBot  

Disallow: /  

This simple line can stop an AI bot cold.

API restrictions

Sites like Reddit and Stack Overflow began charging for API access, especially when usage spikes came from AI companies. This has throttled large-scale data extraction and made it easier to enforce licensing terms.

Licensing language changes

Some companies, including Stack Overflow and news publishers, are rewriting their terms of service to ban AI training unless a license is granted explicitly. These updates act as legal guardrails, even before litigation begins.

Opt-out metadata and HTTP headers

Tools like DeviantArt’s “NoAI” tag and opt-out metadata allow creators to flag their content as off-limits. While not always respected, these signals are gaining traction as standard signals in the AI ethics playbook.

How to audit your website for AI data exposure

Want to know if your content is vulnerable? Start here:

  • Check access logs: Are there AI crawlers like GPTBot, CCBot, or ClaudeBot?
  • Review your robots.txt file: Is it blocking known AI scrapers?
  • Scan your content metadata: Do you have NoAI tags or opt-out headers?
  • Inspect your API: Who’s using it, and are they scraping at scale?
  • Consider a license audit: Is your usage policy updated for the AI era?

404: permission not found

What started as a quiet concern among artists and journalists has become a global push for AI accountability. The question isn’t whether AI can learn from the internet but whether it should learn without asking.

Some are taking the legal route. Others are rewriting contracts, updating headers, or blocking bots outright. 

Either way, the message is the same: creators want a say in how their work trains future machines. And they’re not waiting for permission to say no.

The real question is: can we build AI that doesn’t bulldoze over fundamental rights? Read about the ethics of AI to know more.


Get this exclusive AI content editing guide.

By downloading this guide, you are also subscribing to the weekly G2 Tea newsletter to receive marketing news and trends. You can learn more about G2's privacy policy here.