X

Google scraped data from YouTube to train Gemini

Featured image for Google scraped data from YouTube to train Gemini

One of the biggest topics affecting AI today is data scraping. In order to train AI models, companies need to scrape data from online sources to feed it into AI models. Well, we got the news that OpenAI has scraped tons of data from YouTube. However, we also got the news that even Google has been scraping data from YouTube videos.

Right now, YouTube is safeguarding the data on its platform. Recently, YouTube’s CEO, Neal Mohan, warned OpenAI against using its videos to train Sora. This is OpenAI’s extremely realistic AI video generator.

Well, according to a report from The New York Times, OpenAI has been scraping data from the massive video-sharing platform, but it wasn’t video data. The company used a tool called “Whisper” that automatically transcribes audio from YouTube videos and uses that to train the model. The model in question is GPT-4. The report states that OpenAI was able to scrape transcripts of over a million YouTube videos.

OpenAI made the argument that it’s using information from publicly available YouTube videos. So, this should, ostensibly, be justified. However, YouTube states that it prohibits any unauthorized downloading or scraping of YouTube videos. This means that OpenAI could possibly be in violation of YouTube’s terms of use. If this becomes a big deal, then we are sure to see the companies battle this out in court at some point.

Google is also scraping YouTube videos

In a pretty big twist, it appears that Google is also scraping data from YouTube videos. What makes it significant is the fact that Google is YouTube’s parent company. So, it raises questions. Does YouTube know about this? Is Google telling YouTube to be quiet about it? Will YouTube seek any sort of legal action against its parent company?

These questions will remain unanswered for quite some time. In any case, it appears that Google has made a little change to its terms of service. This change, according to the report, allows the company to scrape data from publicly visible sources such as Google Docs, Google Sheet files, Google Maps reviews, Etc. This means that the company wants to ramp up its data collection, and that does not bode well for users who want to preserve their data.

People read companies’ terms of service to know what’s going on with their data. However, knowing what’s going on with your data doesn’t do anything if the companies can casually change their terms to allow them to scrape it.

  翻译: