Several AI companies said to be ignoring robots dot txt exclusion, scraping content without permission: report
AI does not care about licensing
Several AI companies are circumventing the Robots Exclusion Protocol (robots.txt) to scrape content from websites without permission, according to TollBit, a content licensing startup, reports Reuters. This issue has led to disputes between AI firms and publishers, with Forbes accusing Perplexity of plagiarizing its content.
TollBit's letter to publishers, obtained by Reuters, reveals that many AI agents are ignoring the robots.txt standard, which is used to block parts of a site from being crawled. The company’s analytics indicate a pattern of widespread non-compliance, as various AIs use data for training without authorization. AI search startup Perplexity, in particular, has been accused by Forbes of using its investigative stories in AI-generated summaries without proper attribution or permission. Perplexity did not comment on these allegations.
The robots.txt protocol, created in the mid-1990s, was intended to prevent web crawlers from overloading websites. Although it has no legal enforcement, it has traditionally been widely respected — until now, it seems. Publishers are trying to use this protocol to block unauthorized content usage by AI systems, which scrape content to train algorithms and generate summaries.
"What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote, according to Reuters. "The more publisher logs we ingest, the more this pattern emerges."
Some publishers, like the New York Times, have taken legal action against AI companies for copyright infringement. Others have opted to negotiate licensing deals. This ongoing debate highlights the conflicting views on the value and legality of using content to train generative AI, as many AI developers argue that accessing content without charge does not violate any laws — unless, of course, it's paid content.
The issue has gained prominence as AI-generated news summaries become more common. Google's AI product, which creates summaries in response to search queries, has worsened publisher concerns. To prevent their content from being used by Google’s AI, publishers have been blocking it using robots.txt, but this removes their content from search results and impacts their online visibility. Meanwhile, if AIs ignore robots.txt, then what's the point of content owners using it to no effect, and losing online visibility?
TollBit also has a horse in this AI and editorial content race, positioning itself as an intermediary between AI companies and publishers that helps to establish licensing agreements for content usage. The startup tracks AI traffic to publisher websites and provides analytics to negotiate fees for different types of content, including premium content. TollBit claims to have 50 websites using its services as of May, but did not disclose their names.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
CmdrShepard
Yeah, when abiding to robots.txt is voluntary and when we know the techbro mentality of "it's easier to ask for forgiveness than to ask for permission" and how toothless the fines for doing stuff like this are there's nothing surprising about it.USAFRet said:Who didn't see that coming?
What I wonder is whether for the purpose of DMCA this ignoring of robots.txt would constitute a circumvention of protection of a computer system? That would be kind of creative reading of the law though, hope someone tests it in court. -
Findecanor Ignoring "opt out" is now explicitly illegal within the European Union, according to the EU AI Act and the Common Digital Single Market directive.Reply
Although I find it unclear how violations would be classified and what possibilities for enforcement and repercussions there are.
There is also a weakness in that the EU does not mandate or recommend any specific protocols.
However, "robots.txt" is decades old and an official IETF standard, so there is no excuse for not following it. -
MacZ24
The EU doesn't have directives against building tech champions, but it is as if they do.Findecanor said:Ignoring "opt out" is now explicitly illegal within the European Union, according to the EU AI Act and the Common Digital Single Market directive. -
hotaru251 honestly until there are laws made to prevent using data w/o explicit permission w/ the punishment being a massive 50%+ of the companies earnings nobody's going to care to stop as "its just the cost of business" other wise.Reply
same reason companies can be caught breaking law over and over because the fines are capped out to such a low point they make more $ breaking law and paying the fines than following law. -
Notton It's almost as if the same people behind crypto and nft money laundering are also behind these AI startups.Reply -
35below0
Even huge fines don't really stop the practise. Being fined millions or even an occassional bilion is still "cost of doing business".hotaru251 said:honestly until there are laws made to prevent using data w/o explicit permission w/ the punishment being a massive 50%+ of the companies earnings nobody's going to care to stop as "its just the cost of business" other wise.
same reason companies can be caught breaking law over and over because the fines are capped out to such a low point they make more $ breaking law and paying the fines than following law.
It's profitable so they keep doing it. -
yahrightthere
I 100% agree, also criminally prosecute owners, CEOs and officers of the offending companies.hotaru251 said:honestly until there are laws made to prevent using data w/o explicit permission w/ the punishment being a massive 50%+ of the companies earnings nobody's going to care to stop as "its just the cost of business" other wise.
same reason companies can be caught breaking law over and over because the fines are capped out to such a low point they make more $ breaking law and paying the fines than following law.
BTW shouldn't we all get recompense by companies that use our personal data and info!
Make that a law and see how companies like the turn about. -
hotaru251
you didn't read my post fully. that was covered & also why i said it needs 50% of earnings (not just profit) as at such high costs it no longer becomes profitable.35below0 said:It's profitable so they keep doing it.
if it took half ur income & forced you to remove what you did illegally then they wont do it again as it becomes bad business (because you are losing more than you make from it)
depends.yahrightthere said:BTW shouldn't we all get recompense by companies that use our personal data and info!
is content your using free? then no as nothing is "free" and your data is the price you are paying.
if you pay for soemthing? then yes as you are paying them thus your data shouldnt be free to harvest on top of that.