Wikipedia’s parent foundation is cracking down on AI crawlers and banning AI-generated content from its encyclopaedia. It is simultaneously formalising paid partnerships with tech giants including Amazon, Meta, and Microsoft. The shifts raise wider questions about who pays for—and who profits from—the internet’s most-used knowledge base.

Wikipedia remains one of the most valuable datasets for training large language models. It is free, multilingual, and curated by a global community of volunteers. But that openness is increasingly under strain. According to Wikimedia, automated traffic has surged, with billions of bot requests hitting its servers every day, while human readership declines.

“We are experiencing a massive increase in expensive bot traffic, while the human traffic is down,” Dimitar Zagorski, policy director at Wikimedia Europe, told EU Perspectives. “Bot traffic is more expensive, because it is unforeseeable, can’t be cached locally, and thus needs to be served globally. Human traffic is cheaper on average, because it can more easily be cached locally.”

Restricting AI inside the platform

In response, the Foundation has tightened its robot policies, introducing rate limits and blocking abusive crawlers. Around a quarter of automated traffic is now throttled for violating these rules.

At the same time, Wikipedia’s editorial community is drawing firmer boundaries. AI-generated content is largely banned from the encyclopaedia, with limited exceptions for translation and minor edits.

You might be interested

Zagorski stressed that human access must remain unrestricted and all content freely accessible and freely licensed. “Some organisations are refusing to use the paid API or to limit the requests per hour. We thus need to put limits as to how many times per hour a crawler might access our servers. The limit is quite high, so it really should bother only the very large language models,” he said.

From scraping to structured access

Alongside these restrictions, Wikimedia is reshaping how large-scale users access its data. Through its Wikimedia Enterprise service, the Foundation offers paid, high-volume API access designed for companies that rely heavily on its content.

The model is already attracting major players. Companies including Amazon, Meta, Microsoft, Google, and newer AI firms such as Mistral AI and Perplexity AI have joined a growing roster of partners using Wikimedia’s enterprise APIs, alongside Ecosia, Nomic AI, Pleias, ProRata AI, and Reef Media. Notably, however, some major AI developers, such as Anthropic, are absent from these arrangements.

AI as creator and competitor

Wikipedia’s move is part of a broader backlash from publishers and platforms. Major news organisations, including The New York Times, BBC, CNN, Reuters, and The Washington Post, have introduced technical blocks against AI crawlers. Some have even prohibited journalism from using AI-generated content.

While publishers are restricting AI, the technology is simultaneously expanding its role in cultural production, raising new copyright concerns. Large volumes of AI-generated books are appearing, in some cases using the names of real authors without consent. In music, AI-generated tracks mimicking artists like Drake have gone viral before being removed following copyright complaints. In film, works such as Sunspring, with AI-generated storytelling, raise bigger questions about quality and authorship.

Europe moves toward regulation

In Brussels, policymakers are beginning to respond—in March, the European Parliament adopted recommendations aimed at protecting copyrighted content in the age of generative AI. Lawmakers are calling for greater transparency in training data, fair remuneration for rightsholders, and mechanisms allowing creators to exclude their work from AI systems.

We need clear rules for the use of copyright-protected content for AI training
—Axel Voss (EPP/DEU), rapporteur

“We need clear rules for the use of copyright-protected content for AI training,” said Axel Voss (EPP/DEU), the Parliament’s rapporteur on the file. “Legal certainty would let AI developers know which content can be used and how licences can be obtained. On the other hand, rightsholders would be protected against unauthorised use of their content and receive remuneration.”

Crucially, MEPs argue that EU copyright law should apply to all AI systems operating within the bloc, regardless of where they were trained. They also warn that AI-driven aggregation risks undermining the news sector by diverting traffic and revenue away from publishers. A future licensing market for training data is now on the table—one that could fundamentally alter the terms on which AI companies access the content that powers them.