documenting (nearly) all the music ive ever listened to

ive been listening to music all my life, it is the one thing that is always going on. theres just something in me that requires sound. ive gotten really into lastfm and logging my music a bit more than 5 years ago. and i intend to keep doing that. however, there are periods in my life that are missing data, whether that be because i didnt know lastfm existed or just that i forgot to scrobble (last.fm's term for logging a play). however, luckily i have a 32 mb json file detailing all the music ive listened to from 2015 to 2025, so im going to try to import that and my lastfm data into koito.

the 32mb file

ive had a variety of spotify and apple music accounts over the years, however, ive tried to do GDPR takeouts as much as i could to keep all of my data. those takeouts eventually made their way to a service called stats.fm, its similar to lastfm but more flashy. and they also take data backups to fill all the music youve listened to. i used them for a few years, in addition to lastfm to compare with my friends. a few months ago in 2025 i made a GDPR take out to request all of my data, and thats where the 32mb file comes from.

the file is a really just a massive json file. it follows this format:

[
  {
    /* irrelevant data */
    "id": "/* user uuid */"
    "streams": [
      {
        "id": "",
        "userId": "/* id from earlier */",
        "endTime": "ISO 8601",
        "playedMs": 0,
        "trackName": "Artist - Song Name"
      }
    ]
  }
]

overall that file has nearly everything about my profile in it, but i only care about the streams. however, koito doesn't take stats.fm exports so i need to convert it to a format that it does take. going to their import music documentation page shows the few avenues i have. i eventually settled on the spotify format as 1. it wont have any issues with song length (and i have this info from the stats.fm export) and 2. its a format i can replicate simply since i already have some spotify files to reference.

the spotify file(s)

spotify exports follow the format below. saved in json files that include Streaming_History_Audio in the name. overall not too bad to replicate

[  
  {
    "ts": "/* iso timestamp */",
    "platform": "windows",
    "ms_played": 0,
    "conn_country": "US",
    "ip_addr": "x.x.x.x",
    "master_metadata_track_name": "",
    "master_metadata_album_artist_name": "",
    "master_metadata_album_album_name": "",
    "spotify_track_uri": "",
    "episode_name": null,
    "episode_show_name": null,
    "spotify_episode_uri": null,
    "audiobook_title": null,
    "audiobook_uri": null,
    "audiobook_chapter_uri": null,
    "audiobook_chapter_title": null,
    "reason_start": "clickrow",
    "reason_end": "endplay",
    "shuffle": false,
    "skipped": true,
    "offline": false,
    "offline_timestamp": 1745874272,
    "incognito_mode": false
  },
]

but! holy crap thats a lot of data that we dont care about. however, we have a slight issue, it requires the album name, our data export from stats.fm does not include that crucial bit of info. so were going to have to look it up from a database. luckily my friend has a copy of a bunch of music metadata, from a particular streaming service that shall not be named, in a database that i can easily reference.

// once duckdb is loaded im going to run some test statements to see if my idea works. i grabbed an example scrobble from my data that lacks the artist field, bremen. however the name is unique enough that i remember the artist to be pigpen theatre. now time to find it only using the metadata that i have!

once i get the database all set, im going to do a quick test to see if i can find the album with the information i have available. the database has a tracks table that includes a name and duration_ms column that i can use to search. for the duration_ms column, im going to use a range of times, rounding up and down the duration logged in the original file. in case that the metadata i have doesnt match with the logged metadata.

im going to test using a song called Bremen by PigPen Theatre as it has an unique name. heres the following queries i ran:)

D. SELECT * FROM tracks WHERE name = 'Bremen' AND duration_ms BETWEEN 194000 AND 200000;

┌───────────┬──────────────────────┬───┬─────────────┬─────────────┬──────────┐
│   rowid   │          id          │ … │ disc_number │ duration_ms │ explicit │
│   int64   │       varchar        │   │    int64    │    int64    │  int64   │
├───────────┼──────────────────────┼───┼─────────────┼─────────────┼──────────┤
│  76109862 │ 5CSHAq8NHDcCofVzAE…  │ … │           1 │      194986 │        0 │
│  88080105 │ 5IRCU93Wvs04UHTFXZ…  │ … │           1 │      195014 │        0 │
│ 129700009 │ 2K5Tylpv0maglZYFgN…  │ … │           1 │      197299 │        0 │
│ 179353700 │ 2sEvrJa0UTLfwmG9vh…  │ … │           1 │      197299 │        0 │
│ 179386255 │ 7DYxii5xYNYTu8y0ZL…  │ … │           1 │      197299 │        0 │
│ 190351216 │ 0hQ3melkXzeGVFyZx3…  │ … │           1 │      198335 │        0 │
│ 234620527 │ 3W2oQQZ1OyT8hd6Y82…  │ … │           1 │      195014 │        0 │
├───────────┴──────────────────────┴───┴─────────────┴─────────────┴──────────┤
│ 7 rows                                                 13 columns (5 shown) │
└─────────────────────────────────────────────────────────────────────────────┘

D. SELECT album_rowid FROM tracks WHERE rowid = '76109862';

┌─────────────────┐
│   album_rowid   │
│      int64      │
├─────────────────┤
│    13850256     │
│ (13.85 million) │
└─────────────────┘

D. SELECT * FROM artist_albums WHERE album_rowid = '13850256';

┌──────────────┬─────────────┬───────────────┬────────────────────────┬────────────────┐
│ artist_rowid │ album_rowid │ is_appears_on │ is_implicit_appears_on │ index_in_album │
│    int64     │    int64    │     int64     │         int64          │     int64      │
├──────────────┼─────────────┼───────────────┼────────────────────────┼────────────────┤
│   4716903    │  13850256   │       0       │           0            │       0        │
└──────────────┴─────────────┴───────────────┴────────────────────────┴────────────────┘

D. SELECT * FROM artists WHERE rowid = 4716903;

┌─────────┬────────────────────────┬───────────────┬────────────────────┬─────────────────┬────────────┐
│  rowid  │           id           │  fetched_at   │        name        │ followers_total │ popularity │
│  int64  │        varchar         │     int64     │      varchar       │      int64      │   int64    │
├─────────┼────────────────────────┼───────────────┼────────────────────┼─────────────────┼────────────┤
│ 4716903 │ 2bp8QF1JExz4O3uINxPSPS │ 1743033600000 │ PigPen Theatre Co. │      9185       │     31     │
└─────────┴────────────────────────┴───────────────┴────────────────────┴─────────────────┴────────────┘

and from just the track name, and a duration range i was able to pinpoint the album, and the artist. while i should have access to the artist in the original metadata, theres a few (15k out of 151k) logged plays that lack an artist field, so i planned not to depend on it.

that took many more hours than i expected

ive been working on this project off and on for nearly a year now. and this has been the first time where i genuinely think its achievable, due to the metadata dump from my friend. but, that does not mean it was easy to finally get it done. i ran through at least 20 iterations of handwritten code. that code would simply not work, or would take longer than what was reasonable (predicted to finish after seven days running on a beefy mac). so ultimately, i had to change some principals, and realize im not a data science and comp sci major. but a technical theatre major, and use all the resources i had available. so begrudenly i turned to claude and gemini and they were able to pump out iterations that actually worked. and after 3 days of trial and error, running code, and giving feedback. i finally had 139k out of 151k tracks correctly tagged. and honestly, thats good enough. because there is so so so much thats lost to time. and i just want to be done with this project.

importing!!!!

then it was time to import. i loaded up docker, got koito, and put my files in. a combination of my last.fm export, and the claude / gemini helped file. and it worked. and after many many many hours of importing, its near complete. with a near perfectly accurate life time stats for (nearly) all the music ive ever listened to. in case you ever want to examine my music taste growing go here and feel free to judge me. i certainly do

im going to go cry tears of joy. good night

ps. i have lots of projects planned, combining the random collection of metadata my friend has, and my past streams to try to make an algorithm that recommends playlists (follow @jules@pony.social for more info)