Slurping podcasts for research and preservation with yt-dlp

My comedy archive includes hundreds of podcasts stretching back to the medium’s infancy, and I continue to actively preserve dozens of shows each week.¹ I also routinely listen to and transcribe podcasts in the course of my research, so I’m constantly saving and organising local copies of audio files. Since podcasting is ~~haphazardly~~ elegantly built on top of RSS, performing such tasks manually is basically a waking nightmare.

Enter yt-dlp, a command line tool for downloading video and audio files. Originally designed to rip videos from YouTube (hence the name), it has grown to become an indispensible general-purpose media preservation tool with support for downloading media from literally thousands of websites and platforms. As command line tools go it’s exceptionally easy to use: you provide a URL, set some optional parameters, and — as long as the media is DRM-free — it dutifully downloads whatever it’s been pointed at.

For researchers dealing with podcasts, yt-dlp’s most powerful feature is its ability to batch download and process files directly from an RSS feed, reducing days or weeks of punishingly repetitive manual labour down to a single well-crafted shell command.

In case it’s helpful to podcast scholars and others, I thought I would share my workflow for slurping (i.e., batch downloading and processing) entire podcasts automatically using yt-dlp.

First, a note

I’m not a professional archivist. I am not trained in information science, and I’m sure any GLAM worker could name several ways in which this workflow goes against best practices for archiving and preservation. (If that’s you please suggest ways I could improve it!) But as a media scholar whose work often involves saving local copies of podcasts, YouTube channels, and other digital media, it has proven invaluable so far.

In the interest of brevity I will assume you already know how to install and use command line tools in your operating system of choice. For macOS and Linux users I recommend installing yt-dlp via Homebrew, which also installs its dependencies and a few other useful components (e.g. ffmpeg), and which makes updating much easier. yt-dlp is frequently updated to accommodate changes on any of the sites it can download from, so an easy update mechanism is an absolute must.

The first task when slurping a podcast is to locate its RSS feed. Usually all this requires is searching the web for “[podcast title] rss”, but recently many hosts have been eager to make it as difficult as possible to access a podcast’s raw RSS feed for some reason. If all you can find is a listing in a directory like Apple Podcasts, online tools like GetRSSFeed can extract the URL for you. What you want is a URL pointing to the raw RSS feed, consisting of a bunch of lines of XML with tags like <item> and <enclosure>.

A screenshot of an RSS feed — The RSS feed for 3CR’s *Yeah Nah Pasaran!*

Preserving a podcast with yt-dlp

If your goal is to preserve a podcast for archival purposes, you probably want to save every piece of data associated with the podcast in as close to its original state as possible.

In that case, simply point yt-dlp at the RSS feed like so:

yt-dlp "https://www.3cr.org.au/yeahnahpasaran/itunes"

yt-dlp "https://www.3cr.org.au/yeahnahpasaran/itunes"

… and it will traverse the feed and save every <enclosure> (audio or video file) into yt-dlp’s default directory without processing them in any way.

There are some caveats to consider when preserving podcasts using this method. Firstly, it’s a problematic solution when dealing with podcasts that use Dynamic Ad Insertion (DAI), which delivers customised files to each listener in an effort to serve them tailored advertising. Since the ad insertion occurs server-side, there is no easy way to block or avoid it. DAI is a scourge for many reasons,² but most relevant here is that it prevents archivists and researchers from saving bit-perfect copies of a podcast in its original form. I’ve been using Fission to manually cut ads out of DAI-riddled podcasts so far, but I’m experimenting with some options to automate this part of the process too.

Even in cases where DAI isn’t used, there are pitfalls to slurping a podcast using this method. By default, yt-dlp will save a copy of each file exactly as it exists on the server. Since podcasters are ~~trash humans~~ busy creatives just trying to survive in the content mines, this means you will often end up with episodes in a mixture of file formats, with inscrutable or inconsistent titles, spotty or non-existent ID3 tags, and a host of other issues. As the metadata needed to fix these issues is located in an entirely separate RSS file with no easy way to browse, search, or map data to the correct field, it can be extremely difficult to work with these files. Helpfully, yt-dlp can fix many of these inconsistencies for you during download.

Normalising podcast metadata for ease of use

If you don’t care about perfectly preserving a podcast in its original state, you can pass yt-dlp a number of options to normalise any inconsistencies in the source files, up to and including transcoding audio if necessary. In this post I’m going to focus on the two functions I find most valuable when working with podcasts: normalising filenames and populating ID3 tags, both of which can be performed within a single yt-dlp command.

I’ll begin by presenting the yt-dlp command I use when downloading episodes of Comedy Bang! Bang!, and provide a breakdown of what each component of the command does.

yt-dlp -o "~/_yt-dlp/%(playlist_title)s/%(normalised_title)s.%(ext)s" --parse-metadata "playlist_title:%(artist)s" --parse-metadata "playlist_title:%(album_artist)s" --parse-metadata "playlist_title:%(album)s" --parse-metadata "description:%(meta_composer)s" --parse-metadata "%(upload_date>%Y-%m-%d)s — #%(episode_number)s\: %(title)s:%(normalised_title)s" --embed-metadata "https://cbbworld.memberfulcontent.com/rss/10368?auth=[redacted]"

yt-dlp -o "~/_yt-dlp/%(playlist_title)s/%(normalised_title)s.%(ext)s" --parse-metadata "playlist_title:%(artist)s" --parse-metadata "playlist_title:%(album_artist)s" --parse-metadata "playlist_title:%(album)s" --parse-metadata "description:%(meta_composer)s" --parse-metadata "%(upload_date>%Y-%m-%d)s — #%(episode_number)s\: %(title)s:%(normalised_title)s" --embed-metadata "https://cbbworld.memberfulcontent.com/rss/10368?auth=[redacted]"

yt-dlp -o "~/_yt-dlp/%(playlist_title)s/%(normalised_title)s.%(ext)s" ...

yt-dlp -o "~/_yt-dlp/%(playlist_title)s/%(normalised_title)s.%(ext)s" ...

yt-dlp calls the program and -o sets the output path and filename. The text /%(playlist_title)s/ in the path means that files are saved in a directory matching the playlist’s (i.e., the podcast’s) title, derived from the feed’s <title>. %(normalised_title)s.%(ext)s is a variable I construct later in the same yt-dlp command using the --parse-metadata flag (see below), but you can use the unaltered title instead with %(title)s.%(ext)s.

... --parse-metadata "playlist_title:%(artist)s" --parse-metadata "playlist_title:%(album_artist)s" --parse-metadata "playlist_title:%(album)s" --parse-metadata "description:%(meta_composer)s" ...

... --parse-metadata "playlist_title:%(artist)s" --parse-metadata "playlist_title:%(album_artist)s" --parse-metadata "playlist_title:%(album)s" --parse-metadata "description:%(meta_composer)s" ...

--parse-metadata is an option that tells yt-dlp how to interpret the various fields of data it reads from the RSS file, and you can specify how yt-dlp should interpret a field by mapping it using the format “[source field]:[destination field]”. --parse-metadata can be used to manipulate most of the fields that would be relevant to podcast researchers, though they are sometimes unintuitively named. The yt-dlp documentation contains a list of the standard metadata fields it reads from a source as well as how it derives them.

As you can see in the command, I use --parse-metadata to map the podcast’s <title> to three fields for consistency: artist, album_artist, and album. I also map each episode’s <description> to the composer field (meta_composer to yt-dlp), which is probably confusing on first glance. I do this is because there appears to be a bug in yt-dlp that truncates data saved into the description of MP3 files. I couldn’t figure out what was causing this issue but I found that using composer instead captured the whole description, so now I manually copy the data across to description during post-processing.³

... --parse-metadata "%(upload_date>%Y-%m-%d)s — #%(episode_number)s\: %(title)s:%(normalised_title)s" ...

... --parse-metadata "%(upload_date>%Y-%m-%d)s — #%(episode_number)s\: %(title)s:%(normalised_title)s" ...

The final --parse-metadata parameter constructs a normalised title for each episode using three tags: <pubDate>, <itunes:episode>, and <title>. (Note the use of a backslash \ to escape the :, which would otherwise be read by yt-dlp as a variable assignment.)

This command results in titles formatted like this: 2024-11-11 — #890: Old Money (James Acaster, Lily Sullivan, Matt Apodaca).mp3

This normalised title is assigned to the variable %(normalised_title)s, ready to be written to the episode’s title field and used as the file’s name on disk.

... --embed-metadata ...

... --embed-metadata ...

The --embed-metadata flag is easy to miss but by far the most important: it instructs yt-dlp to embed the metadata we’ve just interpreted into each output file as it saves it to disk.

... "https://cbbworld.memberfulcontent.com/rss/10368?auth=[redacted]"

... "https://cbbworld.memberfulcontent.com/rss/10368?auth=[redacted]"

Finally, we specify the URL of the feed we want yt-dlp to process. I surround the URL in quotation marks in case it contains any characters likely to be misinterpreted as options by yt-dlp.

After running this command — and waiting for it to traverse and download every episode — you will have a folder containing every audio file yt-dlp was able to extract from the feed, each consistently named and tagged.

Post-processing

After slurping a podcast I open the output folder in Meta for manual post-processing. Ideally this step wouldn’t be necessary, but I haven’t quite perfected my yt-dlp command so there are a couple of things that need to be cleaned up.

First, I batch copy data from each episode’s composer field back to description where it should be. I also use find-and-replace to strip out HTML tags, tracking URLs and ads inserted by the podcast’s hosting platform into each episode’s desription.⁴

Finally, since on most operating systems filenames cannot contain the characters : and / — often used in episode titles — yt-dlp replaces them with valid alternatives which look ugly as hell. I use Meta or Better Rename to remove or replace these characters.

Using this workflow, downloading and tagging a complete podcast takes around 15 minutes of active attention — much less than the dozens of hours it would take to manually process a podcast with as many episodes as Comedy Bang! Bang!.⁵

Other handy yt-dlp options

If a podcast episode has an especially long title it can cause issues when saving the output file. Trim filenames to a more sensible length by passing the option --trim-filenames n, where n is the desired number of characters. You can similarly limit the length of any variable: %(title).nB.%(ext)s.
--playlist-end n will save only the first n items in a feed. This is useful for downloading only the episodes that are new since the last time you archived a podcast.
Much easier than manually counting how many episodes you want to download is using a filter. You can set yt-dlp to download only episodes published after a certain date by passing the option --match-filters "upload_date>=?YYYYMMDD".

Footnotes

1
Before Serial invented podcasts, comedy was central to the medium’s early growth through podcasts like The Ricky Gervais Show and Matt Belknap’s AST Radio. Any “Mount Podmore” without Belknap on it is, in my opinion, incomplete.
2
DAI ruins the listening experience, breaks streaming and caching, and makes it impossible to use timestampts to refer to specific moments in a podcast. It also helps unscrupulous networks screw over creatives: Gimlet used DAI to clumsily replace ads in Mystery Show after firing Starlee Kine, ditching her lovingly crafted sketches about Kind Snacks — often highlighted as exemplars of why listeners prefer host-read ads — with shitty generic spots for Squarespace.
3
For obvious reasons this sucks — if you have any insight into why this error occurs or how I can fix it, let me know.
4
Again, do not do this if downloading a podcast for archival/preservation purposes.
5
In the case of CBB specifically I also manually edit each title to include not just the performers featured in that episode but also the characters they play. In my archive episode #890 is actually titled “2024-11-11 — #890: Old Money (James Acaster, Lily Sullivan [Mrs. Lyndhurst], Matt Apodaca [Ricky Johnson])”. Doing this allows me to make smart playlist collections of each character’s appearance on the show. Somehow, it took until I was nearly 40 to be diagnosed with Autism.