Several AI companies said to be ignoring robots dot txt exclusion, scraping content without permission: report

Artificial Intelligence of Things
(Image credit: Getty Images)

Several AI companies are circumventing the Robots Exclusion Protocol (robots.txt) to scrape content from websites without permission, according to TollBit, a content licensing startup, reports Reuters. This issue has led to disputes between AI firms and publishers, with Forbes accusing Perplexity of plagiarizing its content.

TollBit's letter to publishers, obtained by Reuters, reveals that many AI agents are ignoring the robots.txt standard, which is used to block parts of a site from being crawled. The company’s analytics indicate a pattern of widespread non-compliance, as various AIs use data for training without authorization. AI search startup Perplexity, in particular, has been accused by Forbes of using its investigative stories in AI-generated summaries without proper attribution or permission. Perplexity did not comment on these allegations.

The robots.txt protocol, created in the mid-1990s, was intended to prevent web crawlers from overloading websites. Although it has no legal enforcement, it has traditionally been widely respected — until now, it seems. Publishers are trying to use this protocol to block unauthorized content usage by AI systems, which scrape content to train algorithms and generate summaries.

"What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote, according to Reuters. "The more publisher logs we ingest, the more this pattern emerges."

Some publishers, like the New York Times, have taken legal action against AI companies for copyright infringement. Others have opted to negotiate licensing deals. This ongoing debate highlights the conflicting views on the value and legality of using content to train generative AI, as many AI developers argue that accessing content without charge does not violate any laws — unless, of course, it's paid content.

The issue has gained prominence as AI-generated news summaries become more common. Google's AI product, which creates summaries in response to search queries, has worsened publisher concerns. To prevent their content from being used by Google’s AI, publishers have been blocking it using robots.txt, but this removes their content from search results and impacts their online visibility. Meanwhile, if AIs ignore robots.txt, then what's the point of content owners using it to no effect, and losing online visibility?

TollBit also has a horse in this AI and editorial content race, positioning itself as an intermediary between AI companies and publishers that helps to establish licensing agreements for content usage. The startup tracks AI traffic to publisher websites and provides analytics to negotiate fees for different types of content, including premium content. TollBit claims to have 50 websites using its services as of May, but did not disclose their names.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • USAFRet
    Admin said:
    Some AIs bypass robots.txt protocol
    Who didn't see that coming?
    Reply
  • CmdrShepard
    USAFRet said:
    Who didn't see that coming?
    Yeah, when abiding to robots.txt is voluntary and when we know the techbro mentality of "it's easier to ask for forgiveness than to ask for permission" and how toothless the fines for doing stuff like this are there's nothing surprising about it.

    What I wonder is whether for the purpose of DMCA this ignoring of robots.txt would constitute a circumvention of protection of a computer system? That would be kind of creative reading of the law though, hope someone tests it in court.
    Reply
  • Findecanor
    Ignoring "opt out" is now explicitly illegal within the European Union, according to the EU AI Act and the Common Digital Single Market directive.

    Although I find it unclear how violations would be classified and what possibilities for enforcement and repercussions there are.

    There is also a weakness in that the EU does not mandate or recommend any specific protocols.
    However, "robots.txt" is decades old and an official IETF standard, so there is no excuse for not following it.
    Reply
  • MacZ24
    Findecanor said:
    Ignoring "opt out" is now explicitly illegal within the European Union, according to the EU AI Act and the Common Digital Single Market directive.
    The EU doesn't have directives against building tech champions, but it is as if they do.
    Reply
  • hotaru251
    honestly until there are laws made to prevent using data w/o explicit permission w/ the punishment being a massive 50%+ of the companies earnings nobody's going to care to stop as "its just the cost of business" other wise.

    same reason companies can be caught breaking law over and over because the fines are capped out to such a low point they make more $ breaking law and paying the fines than following law.
    Reply
  • Notton
    It's almost as if the same people behind crypto and nft money laundering are also behind these AI startups.
    Reply
  • KnightShadey
    Reply
  • 35below0
    hotaru251 said:
    honestly until there are laws made to prevent using data w/o explicit permission w/ the punishment being a massive 50%+ of the companies earnings nobody's going to care to stop as "its just the cost of business" other wise.

    same reason companies can be caught breaking law over and over because the fines are capped out to such a low point they make more $ breaking law and paying the fines than following law.
    Even huge fines don't really stop the practise. Being fined millions or even an occassional bilion is still "cost of doing business".
    It's profitable so they keep doing it.
    Reply
  • yahrightthere
    hotaru251 said:
    honestly until there are laws made to prevent using data w/o explicit permission w/ the punishment being a massive 50%+ of the companies earnings nobody's going to care to stop as "its just the cost of business" other wise.

    same reason companies can be caught breaking law over and over because the fines are capped out to such a low point they make more $ breaking law and paying the fines than following law.
    I 100% agree, also criminally prosecute owners, CEOs and officers of the offending companies.
    BTW shouldn't we all get recompense by companies that use our personal data and info!
    Make that a law and see how companies like the turn about.
    Reply
  • hotaru251
    35below0 said:
    It's profitable so they keep doing it.
    you didn't read my post fully. that was covered & also why i said it needs 50% of earnings (not just profit) as at such high costs it no longer becomes profitable.
    if it took half ur income & forced you to remove what you did illegally then they wont do it again as it becomes bad business (because you are losing more than you make from it)
    yahrightthere said:
    BTW shouldn't we all get recompense by companies that use our personal data and info!
    depends.
    is content your using free? then no as nothing is "free" and your data is the price you are paying.
    if you pay for soemthing? then yes as you are paying them thus your data shouldnt be free to harvest on top of that.
    Reply