Open Source AI4 min read

The Atlantic’s searchable database of music used to train AI

Alex Reisner made four training-music datasets searchable, including sets of 12 million and 9 million tracks; Google and Stability appear.

The Brieftide

TL;DR

  • 01Alex Reisner made four training-music datasets searchable, including sets of 12 million and 9 million tracks; Google and Stability appear.
  • 02The Atlantic created a searchable database that exposes four music datasets used to train AI, two of them containing 12 million and 9 million tracks, and two smaller sets of over 100,000 songs each.
  • 03The collections have been downloaded thousands of times, and Google and Stability have confirmed use of the data in research papers.

The Atlantic created a searchable database that exposes four music datasets used to train AI, two of them containing 12 million and 9 million tracks, and two smaller sets of over 100,000 songs each. The collections have been downloaded thousands of times, and Google and Stability have confirmed use of the data in research papers.

What did The Atlantic find?

The Atlantic’s tracker uncovered four datasets used as AI training material: two enormous collections at 12 million and 9 million tracks, plus two additional sets with more than 100,000 songs each. Alex Reisner made those datasets fully searchable on the Atlantic’s AI Watchdog site so the public can see which songs and artists appear in training pools.

Reisner notes the datasets contain music by a broad range of artists, from Lady Gaga and Fred Again.. to Radiohead, Aphex Twin, Wu-Tang Clan, Bruce Springsteen, and experimental composer Hainbach. He also found that the collections have been downloaded thousands of times, though it is impossible to know exactly who has used them beyond the research-paper confirmations from Google and Stability.

How is the audio being gathered and distributed?

Three of the datasets Reisner examined are distributed as lists of links to songs on YouTube or Spotify, and AI developers download the actual audio using automated tools. "Three of the datasets I found are distributed as a list of links to songs on YouTube or Spotify," Reisner wrote, and he added that some download tools allow developers to bypass logins, advertisements, and mechanisms that might earn money or subscribers for creators.

Those download techniques violate the platforms’ terms of service, Reisner said. One of the sources named in the investigation is the Free Music Archive dataset, which the reporter notes is free to stream for personal use but requires licensing for commercial applications.

Why it matters

The size of these collections matters because scale shapes what AI models learn: two datasets of 12 million and 9 million tracks represent a vast corpus of music metadata and audio likely to influence music-generation models. Public visibility changes the conversation about consent, licensing, and platform terms because researchers and the public can now see specific songs and artists included in training pools.

The confirmation that Google and Stability have used material from these datasets in research papers ties the question of dataset provenance to major industry actors. The presence of popular and commercially valuable artists in the lists highlights the tension between free-to-stream access and the legal and commercial rules that govern reuse.

What should creators and listeners know?

Creators whose work appears in these sets face different rights outcomes depending on the source: some collections draw from free-to-stream archives that still require licensing for commercial use, while other sets appear to be harvested from streaming platforms where automated downloads circumvent monetization mechanisms. The searchable index makes it possible for individuals to check whether a particular track appears in the training pools.

What to watch

Look for formal responses from the platforms whose links are used and for any follow-up from institutions that cite these datasets in research. Legal challenges or licensing claims tied to tracks listed in the Atlantic’s database would be a clear next milestone. Also watch whether more research papers explicitly acknowledge which public datasets they used, or whether dataset maintainers change distribution methods in response to scrutiny.

The Atlantic’s searchable tool and Reisner’s documentation bring a rare level of transparency to the opaque supply chain of music used in AI training, putting artists, platforms, and researchers on notice and giving the public an immediate way to inspect the data.

Overview of the four music training datasets identified
Item
Dataset 1 (enormous)12,000,000List of links to YouTube/SpotifyDownloaded thousands of times
Dataset 2 (enormous)9,000,000List of links to YouTube/SpotifyGoogle and Stability confirmed usage in research papers
Dataset 3 (large)>100,000List of links to YouTube/SpotifyContains tracks by major and experimental artists
Dataset 4 (large, Free Music Archive source)>100,000Free Music Archive and other sourcesFree to stream for personal use; requires licensing for commercial use
Advertisement

Written by The Brieftide · Source: The Verge

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement