April 8, 2025
by Washija Kazim / April 8, 2025
AI isn’t just creating; it’s collecting.
Everything we’ve ever posted, painted, written, or said is up for grabs. As a result, the debate around AI privacy concerns is heating up, with severe backlash against the tech using people’s creative work without permission.
Generative AI contributes to privacy concerns by replicating personal data, enabling identity spoofing, and leaking sensitive training information. AI models trained on public or scraped data may unintentionally memorize and reproduce private details. This raises risks of data misuse, non-consensual content generation, and regulatory violations.
From indie artists to global newsrooms, creators across industries are discovering that their work has been scraped and fed into AI systems, often without consent (think AI-generated Studio Ghibli images flooding the internet.)
In some cases, the bots quote artists and creators; in others, they mimic them. The consequence is a wave of lawsuits, licensing battles, and digital defenses.
The message is clear: people want more control over how AI uses their data, identity, and creativity.
Behind every large language model (LLM) or AI image generator is a massive, often opaque dataset. These models are trained on books, blogs, artwork, forum threads, song lyrics, and even voices, usually scraped without notice or consent.
The conversation has shifted from philosophical musings to a concrete battle over who owns and controls the internet’s large database of knowledge, culture, and creativity.
Do AI systems deserve unrestricted access without permission? Until recently, training AI on publicly available data was treated like fair game. But that assumption is starting to collapse under legal, ethical, and economic pressure.
Here’s what’s driving the shift:
The scale is so large that even non-personal data becomes sensitive. What feels like open data often contains elements of personal identity, creative ownership, or emotional labor, especially when aggregated or mimicked.
Some companies are reacting to specific harm, like revenue loss or content mimicry. Others are taking a stand to protect creative ownership and set new norms.
Entity | AI privacy concern | Type of pushback | Summary |
Studio Ghibli | Style mimicry and visual IP used by AI generators | Public condemnation | Studio Ghibli publicly denounced the use of its art style in AI-generated images but has not pursued legal action. |
Data scraping of user-generated content | API Restriction | Reddit restricted API access and signed a licensing deal with Google to control how AI companies access and use its data. | |
Stack Overflow | Unlicensed reuse of community answers | Legal Threat + API Monetization | Stack Overflow issued legal warnings and began charging AI companies to access its data following unauthorized use. |
Getty Images | Use of copyrighted photos in training datasets | Lawsuit + Licensed Dataset | Getty Images sued Stability AI for using millions of its photos without permission and launched a licensed dataset for ethical AI training. |
YouTube Creators | AI-generated impersonations using creator voices | Takedowns + Platform Advocacy | YouTube creators issued takedown requests and called for better platform policies after AI tools mimicked their voices without consent. |
Medium | Use of blog content in AI tools | AI Crawler Block | Medium quietly blocked AI bots from scraping its blog content by updating its robots.txt file. |
Tumblr | AI scraping of user-created content | AI Crawler Block | Tumblr blocked AI bots from accessing its site to protect user-generated content from being scraped for training purposes. |
News Publishers Blocking AI Web Crawlers | Unauthorized scraping of journalism by AI bots | Technical Restrictions | Major newsrooms like CNN, Reuters, and The Washington Post updated their robots.txt files to block OpenAI’s GPTBot and other AI scrapers, rejecting unlicensed use of their content for model training. |
Anthropic | Use of copyrighted books to train language models | Lawsuit | Authors filed a class-action lawsuit accusing Anthropic of using pirated versions of their books to train Claude without permission or compensation. |
Clearview AI | Unauthorized scraping of biometric facial data | Class-Action Lawsuit Settlement | Faced a class-action suit over facial recognition scraping; settled in court with restrictions on private use and oversight but no financial payouts. |
Cohere | Scraping and training on copyrighted journalism | Lawsuit | Condé Nast, Vox, and The Atlantic sued Cohere for scraping thousands of articles without permission to train its AI models, bypassing attribution and licensing. |
Common Crawl | Large-scale data scraping without consent | Public criticism + site blocks | Several publishers and sites blocked Common Crawl’s web scrapers and criticized its datasets being used in AI training without consent. |
OpenAI Opt-Out Backlash | Lack of rollback or control over scraped content | Community + Publisher Backlash | OpenAI faced backlash for unclear opt-out policies and continued use of data scraped before opt-out tools were introduced. |
Stability AI | Mass scraping of unlicensed data across the web | Multiple Lawsuits | Several artists have sued Stability AI for unauthorized use of copyrighted or sensitive content in training data. |
Many creators, studios, and companies have stepped forward, clearly signaling that their content is off-limits to AI training, setting a clear message and boundaries.
Studio Ghibli hasn’t officially weighed in, but the internet made the issue loud and clear. After Ghibli-style AI art began spreading online, many created using models trained on its iconic frames and palettes, fans and creatives pushed back, calling the mimicry exploitative.
Footage from a 2016 documentary with founder Hayao Miyazaki showed his stance on AI-generated 3D animation. “I can’t watch this stuff and find it interesting. Whoever creates this stuff has no idea what pain is whatsoever. I am utterly disgusted.”
In other interviews, Ghibli executives emphasized that animation should remain a human craft, defined by intention, emotion, and cultural storytelling — not algorithmic mimicry. It wasn’t a lawsuit, but the message was firm: their work is not raw material for machine learning.
While the studio hasn’t taken legal action or made a public statement about AI, the growing resistance around its visual legacy reflects something deeper: art made with memory and meaning doesn’t translate cleanly into machine learning. Not everything beautiful wants to be automated.
After years of AI companies quietly training models on Reddit’s massive archive of user discussions, the platform drew a line. It announced sweeping changes to its application programming interface (API), introducing steep fees for high-volume data access, primarily aimed at AI developers.
CEO Steve Huffman framed the change as a matter of fairness: Reddit’s conversations are valuable, and companies shouldn’t be allowed to extract insights without compensation. After the shift, Reddit reportedly signed a $60 million per year licensing deal with Google, formalizing access on its own terms.
The shift reflects a broader trend: public platforms treat their data like inventory, not just traffic.
Stack Overflow, a G2 customer, changed its API policies and now charges AI developers for access to its community-generated programming knowledge. The platform, long regarded as a free knowledge base for developers, found itself unwillingly contributing to the AI boom.
As tools like ChatGPT and GitHub Copilot began to surface answers that resembled Stack Overflow posts, the company responded with new policies blocking unlicensed data use.
Stack Overflow has restricted and monetized API access and partnered with OpenAI in 2024 to license its data for responsible AI use. It has also introduced a Responsible AI policy, allowing ChatGPT to pull from trusted developer responses while giving proper credit and context.
The issue wasn’t just unauthorized use — it was a breakdown of the trust that fuels open communities. Developers who answered questions to help each other weren’t signing up to train commercial tools that might eventually replace them.
This tension between open knowledge and commercial use is now at the heart of many AI privacy concerns.
Getty Images took legal action against Stability AI, accusing it of copying and using over 12 million copyrighted images, including many with visible watermarks, to train its image generation model, Stable Diffusion.
The lawsuit highlighted a core problem in generative AI: models trained on unlicensed content can reproduce styles, subjects, and ownership marks. Getty didn’t stop at litigation; it partnered with NVIDIA to launch a licensed, opt-in dataset for responsible AI training.
The lawsuit isn’t just about lost revenue. If successful, it could set a precedent for how visual IP is treated in machine learning.
YouTube creators began sounding the alarm after discovering AI-generated videos that used cloned versions of their voices, sometimes promoting scams, sometimes parodying them with eerily accurate tone and delivery.
In some cases, AI models had been trained on hours of content without permission, using public-facing videos as voice datasets.
The creators responded with takedown requests and warning videos, pushing for stronger platform policies and more apparent consent mechanisms. While YouTube now requires disclosures for AI-generated political content, broader guardrails for impersonation remain inconsistent.
For influencers who built their brands on personal voice and authenticity, hijacking that voice without consent isn’t just a copyright issue but a breach of trust with their audiences.
Medium responded to increasing concerns from its writers, many of whom suspected their essays and personal reflections were showing up in generative AI outputs. Without fanfare, Medium updated its robots.txt file to block AI crawlers, including OpenAI’s GPTBot.
While it didn’t launch a PR campaign, the platform’s move reflects a growing trend: content platforms protect their contributors by default. This is a soft but significant stance — writers shouldn’t have to worry about their most vulnerable stories becoming raw material for the next chatbot’s training run.
Tumblr has long been a home for fandoms, indie artists, and niche bloggers. As generative AI tools began to mine internet culture for tone and aesthetics, Tumblr’s user base raised concerns that their posts were being harvested for training without their knowledge.
The company updated its robots.txt file to block crawlers linked to AI projects, including GPTBot. There was no press release or platform-wide announcement; it was just a technical update that showed Tumblr was listening.
It may not have stopped every model already trained on old data, but the message was clear: the site’s creative archive isn’t up for taking.
Some of the world’s most trusted newsrooms quietly pulled the plug on OpenAI’s GPTBot and other AI web crawlers without a single press release. From The Washington Post to CNN and Reuters, major outlets added a few decisive lines of code to their robots.txt files, effectively telling AI companies: “You can’t train on this.”
It wasn’t about server strain or traffic. It was about control over the stories, the sources, and the trust that makes journalism work. The quiet revolt spread quickly: by early 2024, nearly 80% of top U.S. publishers had blocked OpenAI’s data collection tools.
This wasn’t just a protest. It was a hard stop — served cold, in plaintext. When AI companies treat journalism like free training material, publishers increasingly treat their sites like gated archives. Adding friction might be the only way to protect the original in a world of auto-summarized headlines and AI-generated copycats.
Some AI companies have landed in hot water, facing cases that question their AI’s approach to privacy and data handling.
A group of authors, including Andrea Bartz and Charles Graeber, say their books were used without consent to train Claude, Anthropic’s large language model. They didn’t opt in or get paid, and now they’re suing.
The lawsuit alleges that Anthropic fed copyrighted novels into its training pipeline, turning full-length books into raw material for a chatbot. The authors argue that this isn’t innovation — it’s appropriation. Their words weren’t just referenced; they were ingested, abstracted, and potentially regurgitated without credit.
Anthropic, for its part, claims fair use. The company says its AI transforms the content to create something new. But the writers pushing back say the transformation isn’t the point — the lack of consent is.
As this case heads to court, it tests whether creators get a say before their work becomes machine fodder. For many authors, the answer needs to be yes.
Your face isn’t free training data.
A group of U.S. plaintiffs sued Clearview AI after discovering the company had scraped billions of publicly available photos, including selfies, school pictures, and social media posts—to build a massive facial recognition database. The catch? No one gave permission.
The class-action lawsuit alleged that Clearview violated biometric privacy laws by harvesting identities without consent or compensation. In March 2025, a federal judge approved a unique settlement: instead of monetary damages, Clearview agreed to stop selling access to most private entities and implement guardrails under court supervision.
While the settlement didn’t write checks, it did set a precedent. The case marks one of the first large-scale wins for people who never opted into AI training but had their faces taken anyway.
A squad of publishers, including Condé Nast, The Atlantic, and Vox Media, sued Cohere for quietly scraping thousands of their articles to train its LLMs. The problem? These weren’t open blog posts. They were paywalled, licensed, and built on decades of editorial infrastructure.
The lawsuit says Cohere not only ingested the content but now enables AI tools to summarize or remix it without attribution, payment, or even a click back to the source. For journalism that’s already battling AI-generated noise, this felt like a line crossed.
The gloves are off: publishers aren’t just protecting revenue — they’re protecting the chain of credit behind every byline.
Common Crawl is a nonprofit that’s quietly shaped the modern AI boom. Its petabyte-scale web archive powers training datasets for OpenAI, Meta, Stability AI, and countless others. But that broad scraping comes with baggage: many sites in the dataset never consented, and some are paywalled, copyrighted, or personal in nature.
Publishers have started fighting back. Sites like Medium, Quora, and the New York Times have blocked Common Crawl’s user agent, and others are now auditing to see if their content was included.
What was once a data scientist’s dream has become a flashpoint for ethical AI development. The age of “just crawl it and see what happens” may be coming to an end.
OpenAI introduced a way for websites to block GPTBot, its data crawler, through a robots.txt file. However, the damage had already been done to many site owners and content creators. Their content was scraped before the opt-out existed, and there's no explicit rollback of past training data.
Some publishers called the move “too little, too late,” while others criticized the lack of transparency around whether their data was still being used in retrained models.
The backlash made one thing clear: consent after the fact doesn’t feel like consent at all in AI.
Getty Images wasn’t alone. Stability AI’s strategy of training powerful models like Stable Diffusion on openly available web data has drawn sharp criticism from artists, platforms, and copyright holders. The company claims it operates under fair use, though lawsuits from illustrators and developers allege otherwise.
Many argue that Stability AI benefited from scraping creative work without consent, only to build tools that can now compete directly with the original creators. Others point to the lack of transparency around the content used and how.
For a company built on the ideals of open access, it now finds itself at the center of one of the most urgent questions in AI: can you build tools on top of the internet without asking permission?
Some aren’t waiting for the courts; they’re already building technical walls. As AI crawlers scour the web for training data, more platforms deploy code-based defenses to control who gets access and how.
Here’s how companies are locking the gates:
A robots.txt file is a behind-the-scenes directive that tells crawlers what they can index. Platforms like Medium, Tumblr, and CNN have updated these files to block AI bots (e.g., GPTBot) from accessing their content.
Example:
User-agent: GPTBot
Disallow: /
This simple line can stop an AI bot cold.
Sites like Reddit and Stack Overflow began charging for API access, especially when usage spikes came from AI companies. This has throttled large-scale data extraction and made it easier to enforce licensing terms.
Some companies, including Stack Overflow and news publishers, are rewriting their terms of service to ban AI training unless a license is granted explicitly. These updates act as legal guardrails, even before litigation begins.
Tools like DeviantArt’s “NoAI” tag and opt-out metadata allow creators to flag their content as off-limits. While not always respected, these signals are gaining traction as standard signals in the AI ethics playbook.
Want to know if your content is vulnerable? Start here:
What started as a quiet concern among artists and journalists has become a global push for AI accountability. The question isn’t whether AI can learn from the internet but whether it should learn without asking.
Some are taking the legal route. Others are rewriting contracts, updating headers, or blocking bots outright.
Either way, the message is the same: creators want a say in how their work trains future machines. And they’re not waiting for permission to say no.
The real question is: can we build AI that doesn’t bulldoze over fundamental rights? Read about the ethics of AI to know more.
Washija Kazim is a Sr. Content Marketing Specialist at G2 focused on creating actionable SaaS content for IT management and infrastructure needs. With a professional degree in business administration, she specializes in subjects like business logic, impact analysis, data lifecycle management, and cryptocurrency. In her spare time, she can be found buried nose-deep in a book, lost in her favorite cinematic world, or planning her next trip to the mountains.