writing · 6 min read

open-source audio tools we love.

most transcription companies pretend their software was invented in a vacuum. we didn't. here's what we actually use, and what you should know if you're picking a stack for your own work.

a thing about transcription tools: almost all of them are built on the same handful of open-source pieces. when you pay for transcription, you're usually paying for a wrapper around tools that are themselves free. we'd rather be honest about that — so here's the stack, with credit, and a note on which pieces we use and which we recommend you pick up directly.

audacity

audacityteam.org — the audio editor that's been on every podcaster's laptop since 2000. open source, multi-platform, and still the most-recommended free editor for cleaning up an interview before transcription.

when to reach for it: removing background hum, trimming dead air at the start and end of a recording, splitting a long file into chapters, normalizing volume across speakers who recorded at different levels. audacity won't transcribe for you, but it will do the editing pass that makes the transcript better.

one practical note: audacity's recent ownership change (acquired by Muse Group in 2021) prompted a community fork called tenacity. it's nearly identical, slightly leaner, and has fewer telemetry concerns if that matters to you.

ffmpeg

ffmpeg.org — the swiss army knife of audio and video. if a file format exists, ffmpeg can read it, write it, and convert between them. we built our own browser tools on top of it (the extract-audio-from-video tool is ffmpeg.wasm running in your browser).

you don't have to learn ffmpeg's command line to benefit from it — almost every audio app you use is calling it under the hood. but learning a few common invocations is worth an afternoon. converting an mp4 to mp3 is ffmpeg -i input.mp4 -vn -acodec libmp3lame output.mp3. extracting a clip is ffmpeg -i input.mp3 -ss 00:01:30 -to 00:02:00 output.mp3. normalizing volume is ffmpeg -i input.mp3 -af loudnorm output.mp3. that covers maybe 80% of audio prep tasks people pay subscription tools for.

whisper

github.com/openai/whisper — OpenAI's open-source speech recognition model, released in 2022. trained on 680,000 hours of multilingual audio, available for download, runs on a laptop. it's the model that quietly raised the floor for the entire transcription category.

for individuals comfortable with a command line, whisper is worth picking up. pip install -U openai-whisper, point it at a file, get back a transcript. accuracy is in the same band as the major commercial tools for english, better than most for non-english audio. it has no editor, no speaker labels (you'd add diarization separately, see pyannote below), no quote-verification UX. so it's not a replacement for a finished tool — but it's the engine.

we use whisper-large-v3-turbo for our cloud-mode transcription and a smaller distilled variant for the in-browser private mode. the same model anyone can download.

openwhispr

github.com/openwhispr/openwhispr — a desktop wrapper around whisper that turns your laptop into a dictation device. press a hotkey, speak, and the transcript pastes into whatever window you have focused. everything runs locally; nothing uploads.

we love this one because it solves a different job from ours — real-time dictation, not file transcription. for a journalist taking quick notes between interviews, or a researcher dictating fieldwork observations, or anyone with a wrist injury who finds typing painful, openwhispr is a better answer than the various paid dictation apps. it's MIT-licensed, contributor-friendly, and the maintainer ships updates regularly.

pyannote.audio

github.com/pyannote/pyannote-audio — speaker diarization, the part whisper doesn't do. given an audio file, pyannote tells you which segments are which speaker. it's what we (and most other commercial transcription tools, including the big ones) use to put speaker labels on a transcript.

for individuals: pyannote alone isn't a finished tool — it needs a wrapper that combines it with a transcription model and an editor. but if you're a researcher building a custom analysis pipeline, pyannote is the right primitive. academic license + Hugging Face access for the pretrained models.

sox

sox.sourceforge.net — the older sibling of ffmpeg, focused specifically on audio. sox isn't actively developed in 2026 (the last release was 2015), but it still works flawlessly for the audio tasks it covers, and on linux+mac the binary is small and reliable.

where sox shines: noise reduction (sox in.wav out.wav noisered noise.profile), batch volume normalization, and silence detection (sox in.wav out.wav silence 1 0.5 1%). the syntax is older-feeling than ffmpeg but the output is predictable. for a journalist with a stack of phone-quality interview recordings, the sox noise-reduction pass is a real win before transcription.

kdenlive

kdenlive.org — open-source video editor that we recommend whenever a friend asks "what's the free version of premiere or final cut?" it's not perfect, but it's far less awkward than the alternatives, and the audio editing inside it is competent enough for podcast video versions and YouTube creators.

we list it here because video and audio share more workflow than people assume. for podcasters making a video version, kdenlive handles the audio waveform editing alongside the video timeline, and it imports/exports cleanly into any downstream transcription workflow.

wavtools and waveform.js

for the developers in the audience: if you're building your own browser tool that needs to render audio waveforms (we do, for our editor), wavesurfer.js is the canonical library. handles waveform rendering, regions, plugin architecture for audio-aware UI. used by most browser-based audio tools that have a waveform; used by us.

what we ship that isn't open source

for honesty: the closed parts of our stack are the editor itself (the workspace UI for fixing speaker labels, the quote-verification interface, the format-specific exports), and a small layer of fine-tuning on top of whisper for domain-specific vocabulary (legal, clinical, conversation analysis). everything underneath that is the open-source stack listed above.

if you're an individual or a small team doing your own transcription work and you don't need our editor, the open stack will get you most of the way. you'll spend a weekend getting comfortable with the pieces, but it'll work forever, locally, with no subscription.

we built audiohighlight for the people who'd rather not spend that weekend. but the open tools deserve the credit either way.

what's missing from this list

more from the writing

lifetime deal while we're in beta.

join the waitlist to get a lifetime deal — your first month free, plus 50% off forever. private invite when we ship; no drip campaign.