transcription benchmark — cleanup time per audio hour

why a new metric

the standard transcription metric is word error rate (WER) — the percentage of words the model gets wrong against a reference transcript. WER is useful for model researchers and almost useless for buyers. it tells you nothing about speaker attribution, formatting, paragraph breaks, custom vocabulary, or the editor that hosts the transcript afterwards.

users don't think in WER. users think in time. how long does it take to go from a delivered transcript to one you'd publish, file in court, paste into a paper, or hand to a producer? that number — minutes of cleanup per hour of audio — is what we benchmark.

methodology

the corpus

our reference corpus is six audio files representing the six jobs-to-be-done that drive most paid transcription:

reporter interview — phone-quality audio, two speakers, occasional crosstalk, ~25 minutes.
qualitative research interview — in-room recording with a lapel mic, semi-structured, two speakers, technical vocabulary, ~45 minutes.
deposition — courtroom-grade two-channel recording, three speakers (witness, attorney, opposing counsel), ~60 minutes.
therapy session — quiet office, two speakers, low affect, occasional silence, ~50 minutes. (synthetic — see ethics note below.)
podcast interview — studio recording, two speakers, music bed at start, ~40 minutes.
academic lecture — single speaker, occasional Q&A, technical jargon, slide references, ~50 minutes.

the corpus is published with creative-commons licenses (or synthetic where consent could not be obtained — flagged explicitly). reference transcripts are produced by professional human transcribers and double-checked.

the score

for each tool and each file we measure: time to fix speaker labels, time to fix proper nouns and technical vocabulary, time to insert correct paragraph breaks, time to verify a random sample of 20 quotes against audio. the sum, divided by audio length, is the cleanup-time-to-audio-time ratio. target: under 5%. temi's reported median: 25–40%.

tools tested

audiohighlight, temi, rev (AI tier), sonix, otter, plus three open-source baselines (whisper-large-v3 turbo, distil-whisper, parakeet-tdt-0.6b). humans cleaning the transcripts are the same humans across tools, working blind to which tool produced which transcript.

where the results live

the first round of results drops with launch. the corpus, reference transcripts, the cleanup harness, and the per-tool delivered transcripts will be published the same day, under permissive licenses, so anyone can reproduce or extend the benchmark.

if a tool we benchmarked believes a result is wrong, the ticket is: send us the audio file, the cleaning protocol you used, and the time it took. we'll re-run, and if our number is wrong we'll publish the correction.

ethics note

we do not benchmark on real therapy sessions, real depositions, or real medical audio under any circumstances. for those job shapes we use synthetic audio produced from transcripts of published, public-domain material (court opinions read aloud for the deposition file; a published self-help dialogue rewritten for the therapy file). this is flagged in the corpus documentation and tools are tested on the same synthetic audio.

we measure what users actually pay for: cleanup time.

why a new metric

methodology

the corpus

the score

tools tested

where the results live

ethics note

head-to-head writeups (preview)

vs temi

vs rev

vs sonix

lifetime deal while we're in beta.