benchmark · 3 min read

we measure what users actually pay for: cleanup time.

word error rate doesn't predict whether you'll spend two hours or twenty fixing a transcript. so we measure something closer to the job: minutes of cleanup per hour of audio. on a published corpus, head-to-head, reproducible.

why a new metric

the standard transcription metric is word error rate (WER) — the percentage of words the model gets wrong against a reference transcript. WER is useful for model researchers and almost useless for buyers. it tells you nothing about speaker attribution, formatting, paragraph breaks, custom vocabulary, or the editor that hosts the transcript afterwards.

users don't think in WER. users think in time. how long does it take to go from a delivered transcript to one you'd publish, file in court, paste into a paper, or hand to a producer? that number — minutes of cleanup per hour of audio — is what we benchmark.

methodology

the corpus

our reference corpus is six audio files representing the six jobs-to-be-done that drive most paid transcription:

the corpus is published with creative-commons licenses (or synthetic where consent could not be obtained — flagged explicitly). reference transcripts are produced by professional human transcribers and double-checked.

the score

for each tool and each file we measure: time to fix speaker labels, time to fix proper nouns and technical vocabulary, time to insert correct paragraph breaks, time to verify a random sample of 20 quotes against audio. the sum, divided by audio length, is the cleanup-time-to-audio-time ratio. target: under 5%. temi's reported median: 25–40%.

tools tested

audiohighlight, temi, rev (AI tier), sonix, otter, plus three open-source baselines (whisper-large-v3 turbo, distil-whisper, parakeet-tdt-0.6b). humans cleaning the transcripts are the same humans across tools, working blind to which tool produced which transcript.

where the results live

the first round of results drops with launch. the corpus, reference transcripts, the cleanup harness, and the per-tool delivered transcripts will be published the same day, under permissive licenses, so anyone can reproduce or extend the benchmark.

if a tool we benchmarked believes a result is wrong, the ticket is: send us the audio file, the cleaning protocol you used, and the time it took. we'll re-run, and if our number is wrong we'll publish the correction.

ethics note

we do not benchmark on real therapy sessions, real depositions, or real medical audio under any circumstances. for those job shapes we use synthetic audio produced from transcripts of published, public-domain material (court opinions read aloud for the deposition file; a published self-help dialogue rewritten for the therapy file). this is flagged in the corpus documentation and tools are tested on the same synthetic audio.

head-to-head writeups (preview)

lifetime deal while we're in beta.

join the waitlist to get a lifetime deal — your first month free, plus 50% off forever. private invite when we ship; no drip campaign.