Nvidia, Apple, and others allegedly trained AI using 173,000 YouTube videos — professional creators frustrated by latest AI training scandal: Report

YouTube
(Image credit: YouTube)

Some of the world's wealthiest companies, including Apple and Nvidia, are among countless parties who allegedly trained their AI using scraped YouTube videos as training data.  The YouTube transcripts were reportedly accumulated through means that violate YouTube's Terms of Service and have some creators seeing red. The news was first discovered in a joint investigation by Proof News and Wired.

While major AI companies and producers often keep their AI training data secret, heavyweights like Apple, Nvidia, and Salesforce have revealed their use of "The Pile", an 800GB training dataset created by EleutherAI, and the YouTube Subtitles dataset within it. The YouTube Subtitles training data is made up of 173,536 YouTube plaintext transcripts scraped from the site, including 12,000+ videos which have been removed since the dataset's creation in 2020. 

Affected parties whose work was purportedly scraped for the training data include education channels like Crash Course (1,862 videos taken for training) and Philosophy Tube (146 videos taken), YouTube megastars like MrBeast (two videos) and Pewdiepie (337 videos), and TechTubers like Marques Brownlee (seven videos) and Linus Tech Tips (90 videos). Proof News created a tool you can use to survey the entirety of the YouTube videos allegedly used without consent.

EleutherAI is a respectably-sized force in the AI training space. The non-profit AI research lab is one of many aiming to "democratize" AI for the masses, with its website stating a goal to "ensure that the ability to study foundation models is not restricted to a handful of companies". The Pile and YouTube Subtitles datasets were created for this purpose, to provide high-quality training data to even the scrappiest of at-home AI coders. However, this idyllic dream of supporting the little guy with The Pile has become another fuel source for major corporations to train AI, rather than DIYers.

However, YouTube Subtitles violates YouTube's Terms of Service based on its use of YouTube's content without permission and its use of "automated means" to access the data. In the research paper about The Pile and YouTube Subtitles, EleutherAI acknowledges its violation of TOS but claims that the tools used to scrape YouTube data were already widespread enough that no additional harm was caused. 

Many of those affected have reacted strongly against the use of their content. Abigail Thorn, producer of YouTube channel Philosophy Tube and actress on House of the Dragon, shared on X (formerly Twitter), "When I was told about this I lay on the floor and cried, it’s so violating, it made me want to quit writing forever. The reason I got back up was because I know my audience come to my show for real connection and ideas, not cheapfake AI garbage." 

She continued, "I’d like to see YouTube do more to prevent theft like this from happening." Thorn and other YouTubers confirm that no one ever requested to initially scrape or later use any of the videos as training data. 

Who to lay fault on is made difficult by the fact that no one will accept blame or responsibility for the use of the transcripts. Apple and other major tech companies who used the training data avoid blame because they weren't the ones doing the scraping, although conversations must be had within such companies about the ethical sourcing of training data. EleutherAI, creators of the dataset, have not responded to any publications' requests for comment and reject any wrongdoing or harm in their initial research paper on Pile.

The tech industry is spending on AI hardware at an unhealthy rate, with the AI market needing to turn $600 billion in profit per year to keep up with its insane hardware purchasing. As companies seek to spend less on AI, more instances of  illicitly obtained data become more likely, like this YouTube theft and Google's Gemini reading files without permission. Before long, it may not be shocking to see web content end with "You have exceeded the GPT rate limit. Don't forget to smash that like button!"

Dallin Grimm
Contributing Writer

Dallin Grimm is a contributing writer for Tom's Hardware. He has been building and breaking computers since 2017, serving as the resident youngster at Tom's. From APUs to RGB, Dallin has a handle on all the latest tech news. 

  • abufrejoval
    This is an easy issue to fix: just boycott AI services for two years, don't pay extra for NPUs, just buy hardware as if it weren't there.

    With that the bubble might burst fast and some degree of sanity and lawfulness can return.

    Nothing like a bloody nose to teach a lesson...
    Reply
  • MobileJAD
    abufrejoval said:
    This is an easy issue to fix: just boycott AI services for two years, don't pay extra for NPUs, just buy hardware as if it weren't there.

    With that the bubble might burst fast and some degree of sanity and lawfulness can return.

    Nothing like a bloody nose to teach a lesson...
    I personally have never cared or desired for a phone with a NPU, whenever I buy a phone, it just takes up space in the SoC's silicon, Android probably uses it for some secrative data harvesting purposes tho. I sure as hell aint planning on buying a desktop cpu with a NPU built in and absolutely will not buy one of those trash co-pilot laptops. I personally have yet to find any reason to get excited about AI, I mean I'm sure science and medical research can make use of it. Do I care about AI enhanced search results? Nope. Do I care about AI chat bots? Nope. I do find it ammusing how the industry is desperately pushing AI hard tho.
    Reply
  • derekullo
    abufrejoval said:
    This is an easy issue to fix: just boycott AI services for two years, don't pay extra for NPUs, just buy hardware as if it weren't there.

    With that the bubble might burst fast and some degree of sanity and lawfulness can return.

    Nothing like a bloody nose to teach a lesson...
    Good luck getting Nvidia to remove AI from their chips. LOL

    Even if you don't use it, you still paid for it.
    Reply
  • DS426
    Business.............ethics................
    *Cue Bill Madison*
    *ROFL*

    Funny how us peasants are AI washed and concerns brushed right off as cynical and conspiracy theories. None of these folks can be trusted -- apparently our digital existence is and will probably forever be pure digital gold to the mega corps.
    Reply
  • MobileJAD
    derekullo said:
    Good luck getting Nvidia to remove AI from their chips. LOL

    Even if you don't use it, you still paid for it.
    Well at least with Nvidia there is a chance they will use the AI on their boards for something useful, like enhancing rendering quality, or for use with RTX, which I have still barely used. Either way the price for video cards is just ridiculous and as many have said, the chances are slim of video card prices going back to the "good old days", IE back when the GTX 1080 first launched or earlier. I got a 3rd party refurbished EVGA RTX 3080ti at a 'decent' pricethrough pure luck when I had the money, chances of me replacing that card with a newer GPU are Very low, probably wont happen for a few years at least.
    Reply
  • Sippincider
    Was wondering why Siri told me to be sure to subscribe to her channel...

    (Seriously THAT didn't happen. But terms-of-service aside, not sure if YT is the reality we should want AI's learning from.)
    Reply
  • CmdrShepard
    If they used subtitles to train a model that does automatic video captioning I am fine with that because it is a noble use case which will benefit everyone (creators and users, especially those with disabilities).

    However, if they used subtitles to train a model which then generates similar text in the style of those creators that's a clear copyright breach and should be sanctioned to the fullest extent of the law.
    Reply
  • thestryker
    They directly violated the TOS of a service they were using, but generally speaking there's nothing that can really be done. Especially when it comes to AI where the models have already likely been seeded and generated so barring their complete destruction it's a dead end.

    They did use my favorite excuse though which is when I tend to hope everyone involved gets what's coming to them:
    In the research paper about The Pile and YouTube Subtitles, EleutherAI acknowledges its violation of TOS but claims that the tools used to scrape YouTube data were already widespread enough that no additional harm was caused.
    Reply
  • renz496
    MobileJAD said:
    Well at least with Nvidia there is a chance they will use the AI on their boards for something useful, like enhancing rendering quality, or for use with RTX, which I have still barely used. Either way the price for video cards is just ridiculous and as many have said, the chances are slim of video card prices going back to the "good old days", IE back when the GTX 1080 first launched or earlier. I got a 3rd party refurbished EVGA RTX 3080ti at a 'decent' pricethrough pure luck when I had the money, chances of me replacing that card with a newer GPU are Very low, probably wont happen for a few years at least.
    Slim chance? There is no chance at all. Unless those $18k 5nm price suddenly drop down to $4k haha.
    Reply