The Atlantic’s searchable database of music used to train AI
Alex Reisner made four training-music datasets searchable, including sets of 12 million and 9 million tracks; Google and Stability appear.
TL;DR
- 01Alex Reisner made four training-music datasets searchable, including sets of 12 million and 9 million tracks; Google and Stability appear.
- 02The Atlantic created a searchable database that exposes four music datasets used to train AI, two of them containing 12 million and 9 million tracks, and two smaller sets of over 100,000 songs each.
- 03The collections have been downloaded thousands of times, and Google and Stability have confirmed use of the data in research papers.
The Atlantic created a searchable database that exposes four music datasets used to train AI, two of them containing 12 million and 9 million tracks, and two smaller sets of over 100,000 songs each. The collections have been downloaded thousands of times, and Google and Stability have confirmed use of the data in research papers.
What did The Atlantic find?
The Atlantic’s tracker uncovered four datasets used as AI training material: two enormous collections at 12 million and 9 million tracks, plus two additional sets with more than 100,000 songs each. Alex Reisner made those datasets fully searchable on the Atlantic’s AI Watchdog site so the public can see which songs and artists appear in training pools.
Reisner notes the datasets contain music by a broad range of artists, from Lady Gaga and Fred Again.. to Radiohead, Aphex Twin, Wu-Tang Clan, Bruce Springsteen, and experimental composer Hainbach. He also found that the collections have been downloaded thousands of times, though it is impossible to know exactly who has used them beyond the research-paper confirmations from Google and Stability.
How is the audio being gathered and distributed?
Three of the datasets Reisner examined are distributed as lists of links to songs on YouTube or Spotify, and AI developers download the actual audio using automated tools. "Three of the datasets I found are distributed as a list of links to songs on YouTube or Spotify," Reisner wrote, and he added that some download tools allow developers to bypass logins, advertisements, and mechanisms that might earn money or subscribers for creators.
Those download techniques violate the platforms’ terms of service, Reisner said. One of the sources named in the investigation is the Free Music Archive dataset, which the reporter notes is free to stream for personal use but requires licensing for commercial applications.
Why it matters
The size of these collections matters because scale shapes what AI models learn: two datasets of 12 million and 9 million tracks represent a vast corpus of music metadata and audio likely to influence music-generation models. Public visibility changes the conversation about consent, licensing, and platform terms because researchers and the public can now see specific songs and artists included in training pools.
The confirmation that Google and Stability have used material from these datasets in research papers ties the question of dataset provenance to major industry actors. The presence of popular and commercially valuable artists in the lists highlights the tension between free-to-stream access and the legal and commercial rules that govern reuse.
What should creators and listeners know?
Creators whose work appears in these sets face different rights outcomes depending on the source: some collections draw from free-to-stream archives that still require licensing for commercial use, while other sets appear to be harvested from streaming platforms where automated downloads circumvent monetization mechanisms. The searchable index makes it possible for individuals to check whether a particular track appears in the training pools.
What to watch
Look for formal responses from the platforms whose links are used and for any follow-up from institutions that cite these datasets in research. Legal challenges or licensing claims tied to tracks listed in the Atlantic’s database would be a clear next milestone. Also watch whether more research papers explicitly acknowledge which public datasets they used, or whether dataset maintainers change distribution methods in response to scrutiny.
The Atlantic’s searchable tool and Reisner’s documentation bring a rare level of transparency to the opaque supply chain of music used in AI training, putting artists, platforms, and researchers on notice and giving the public an immediate way to inspect the data.
| Item | ||||
|---|---|---|---|---|
| Dataset 1 (enormous) | 12,000,000 | List of links to YouTube/Spotify | Downloaded thousands of times | |
| Dataset 2 (enormous) | 9,000,000 | List of links to YouTube/Spotify | Google and Stability confirmed usage in research papers | |
| Dataset 3 (large) | >100,000 | List of links to YouTube/Spotify | Contains tracks by major and experimental artists | |
| Dataset 4 (large, Free Music Archive source) | >100,000 | Free Music Archive and other sources | Free to stream for personal use; requires licensing for commercial use |
Written by The Brieftide · Source: The Verge
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Open Source AIZhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8
GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.
OpenAI: PRC-linked influence operations target US AI debates
OpenAI says PRC-linked campaigns are using AI to push narratives on U.S. tech debates, data centers, tariffs and false ChatGPT claims.
OpenAI: LSEG scales trusted AI, empowers 4,000 staff
LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles and empowering 4.
Industrial policy OpenAI proposes for the Intelligence Age
OpenAI published a people-first industrial policy on June 9, 2026, and opened a pilot grants program with fellowships.