OpenAI trained ChatGPT on millions of hours of YouTube videos without permission

AI companies encountered obstacles in collecting high-quality training data.

Now the New York Times has detailed how companies have addressed this problem. To no one’s surprise, This is about doing things that are in the gray area of ​​copyright in the area of ​​AI.

OpenAI, desperate for training data, developed an audio transcription model to transcribe over a million hours of YouTube videos, which he used to train GPT-4. According to the New York Times, the company knew this was legally questionable but considered it fair use.

OpenAI President Greg Brockman was personally involved in collecting the videos used.

OpenAI spokeswoman Lindsay Held said the company uses “numerous sources, including public domain data and attributions for non-public data”and investigates the generation of our own synthetic data.

The company appeared to have exhausted supplies of useful data in 2021 and was discussing transcribing YouTube videos, podcasts and audiobooks after other resources were exhausted. Until then, it had trained its models on data including computer code from Github, databases of chess moves, and schoolwork content from Quizlet.

Google He also collected YouTube transcripts, according to Times sources. Bryant said the company trained its models “in some YouTube content pursuant to our agreements with YouTube creators.”

Goal It also hit the limits of the availability of good training data, and in recordings the Times heard, its AI team discussed unauthorized use of copyrighted works as it tried to catch up with OpenAI.

Read Also:  Dolphin explains why its Wii and GameCube emulator cannot be available on the App Store

Recent Articles

Related News

Leave A Reply

Please enter your comment!
Please enter your name here