Pay Up, AI Bot: That’s the Message From a Key Company in How the Internet Works

G.F.A.L.O.E.

1 год назад

AI companies might find it harder to access the entire web to train their large language models after the internet infrastructure provider Cloudflare said this week it would block AI data crawlers by default.

It’s the latest front to open in an ongoing fight between the creators of content and the AI developers who use that content to train generative AI models. In court, authors and content creators are suing major AI companies for compensation, saying copyrighted content was used without permission. (Disclosure: Ziff Davis, CNET’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

While content providers are seeking compensation for information that was used to train models in the past, Cloudflare’s move marks a new defensive measure against future efforts to train models.

But it isn’t just about blocking crawlers: Cloudflare says it wants to create a marketplace where AI companies can pay to crawl and scrape a site, meaning the provider of that information gets paid, and the AI developer gets permission.

«That content is the fuel that powers AI engines, and so it’s only fair that content creators are compensated directly for it,» Cloudflare CEO Matthew Prince said in a blog post.

Why websites want to block AI crawlers

Crawlers — bots that visit and copy the information from a website — are a vital component of the connected internet. It’s how search engines like Google know what’s on different websites, and how they can serve you the latest information from places like CNET.

AI crawlers pose distinct challenges for websites. For one, they can be aggressive, generating unsustainable levels of traffic for smaller sites. They also offer little reward for their scraping: If Google crawls a site for search engine results, it will likely send traffic back to that site by including it in search results. Being crawled for training data might mean no additional traffic or even less, if people stop visiting the site and rely just on the AI model.

Read more: AI Essentials: 29 Ways to Make Gen AI Work for You, According to Our Experts

That’s why executives from major websites like Pinterest, Reddit and several major publishing companies (including Ziff Davis, which owns CNET) cheered Cloudflare’s news in statements.

«The whole ecosystem of creators, platforms, web users and crawlers will be better when crawling is more transparent and controlled, and Cloudflare’s efforts are a step in the right direction for everyone,» Reddit CEO Steve Huffman said in a statement.

Asked about Cloudflare’s announcement, OpenAI said its ChatGPT model is intended to help connect its users to content on the web, similarly to search engines, and that it has integrated search into its chat functions. The company also said it uses a separate model from what Cloudflare has proposed to allow publishers to indicate how AI crawlers should behave, known as robots.txt. OpenAI said the robots.txt model works already and Cloudflare’s changes are unnecessary.

The training data tug-of-war

AI models require a ton of data to train. That’s how they’re able to provide detailed answers to questions and do a decent (if imperfect) job of providing a wide range of information. These models are fed incredible amounts of information and make connections between words and concepts based on what they see in that training data.

The issue is how developers have gotten that data. There are now dozens of lawsuits between content creators and AI companies. Two saw major rulings just last week.

In one case, a federal judge ruled Anthropic followed the law when it used copyright-protected books to train its model Claude — via a concept called fair use. At the same time, the judge said the company’s creation of a permanent library of the books was not, and ordered a new trial on those piracy allegations.

In a separate case, a judge ruled in favor of Meta in a dispute between the company and a group of 13 authors. But Judge Vince Chhabria said the ruling in this case doesn’t mean future cases against Meta or other AI companies will go the same way, essentially that «these plaintiffs made the wrong arguments and failed to develop a record in support of the right one.»

The idea of charging crawlers to visit a site isn’t entirely new. Other companies, like Tollbit, offer services that allow website owners to charge AI companies for crawling. Will Allen, head of AI control, privacy and media products at Tollbit, said the environment around this technology is still developing. «We think it’s very early for a content marketplace to form, and we are just starting to experiment here,» he told CNET. «We’re excited to see many different models flourish.»

CNET’s Imad Khan contributed to this report.