three sources, one timeline
a video package usually has more than one audio source feeding the same timeline. transcription has to handle all of them:
- on-camera interviews. the sit-down or stand-up. lavalier or shotgun. usually one speaker per clip, sometimes two. needs a clean transcript for selecting soundbites and a caption file for the final cut.
- package narration. the reporter's voiceover, recorded in a booth or on a field recorder. clean audio, single speaker. transcribed mostly for caption generation and for the station's web copy.
- b-roll nat sound. crowd audio, ambient interviews, the protester yelling near the camera. messy, multi-speaker, often unusable as a full transcript but valuable for finding the one quotable line buried in 20 minutes of nat.
the transcript, the soundbite log, and the caption file all come from the same source. you shouldn't have to transcribe twice.
the workflow
- extract the audio. video editors don't need the picture for transcription, and uploading a 4K mp4 wastes upload time. run the file through the extract-audio-from-video tool first — pulls a clean wav or m4a out of an mp4, mov, mxf, or mkv in seconds, in your browser, no upload to us. for video podcasts and most desktop editors, the extracted audio is enough to feed downstream.
- upload the audio. wav, m4a, mp3, anything. up to 5 GB per file. you can also upload the original video directly; we extract the audio on our end. extracting locally first is faster on a slow connection.
- transcription runs. on a 30-minute interview, the first pass is ready in 1–2 minutes (cloud mode) or roughly real-time (on-device private mode for embargoed packages or off-record b-roll).
- fix labels and proper nouns. "speaker 1" becomes "the senator" or the source's actual name. fixed once, propagated everywhere. names of people, places, agencies — fixed once and remembered for the next file in the same project.
- mark the soundbites. highlight the 8-second cut you want for the package. the timestamp is preserved; the editor exports a soundbite log with file name, in/out timecode, and the text of the bite. paste straight into your script.
- export captions for the editor. .srt for most NLEs, .vtt for the web cut, plain-text transcript for the station's web producer. timecodes match the audio file you uploaded — drop the .srt onto the same clip in premiere or resolve and the captions land on the frame.
captions for the timeline
.srt and .vtt are the two caption formats every NLE and every web player accepts. premiere, resolve, final cut, avid, and every modern web embed read both. the difference is small — .vtt has slightly more styling support, .srt is universal. we export both from the same transcript with one click. see /formats/captions for the long version on caption formatting, line-length rules, and reading-speed targets.
captions stay in sync with the original audio file. if you extracted the audio from a video first, the timecodes still align with the source video — the extraction doesn't shift the clock.
nat sound and the find-the-one-quote problem
b-roll nat sound is messy by definition. you don't need a publishable transcript of 22 minutes of crowd audio — you need to find the one shouted line that fits the package. the editor's search-and-jump feature is the tool for this: ctrl-f any phrase, jump to the second of audio, click any word to replay just that beat. you can scrub through 20 minutes of nat in under a minute looking for usable cuts.
for nat that's truly multi-speaker — multiple people talking over each other at a press scrum or a protest — the speaker labels will be approximate. the click-word-to-replay is the verification step. don't trust the label without listening.
fact-check before the package airs
every word in the transcript is linked to its second of audio. before the package goes to the desk, click each quoted word in your script — hear the source say it. this catches the homophones, the misheard names, the one-syllable-off transcription errors that auto-tools produce and that an editor wouldn't otherwise catch until air. a 90-second package usually has 6–10 quoted lines; the verification pass takes under two minutes.
pricing for video reporters
$0.25 per minute of audio. a 30-minute interview is $7.50. a 90-second package narration is under $0.40. you pay per file, not per project — a daily reporter cutting six packages in a week pays for the audio they actually transcribed, not for a monthly seat. private mode and cloud mode are the same price. for newsrooms with steady volume, batch pricing arrives after launch — write hello@audiohighlight.com and tell us your shape.
waitlist signups get the lifetime deal: first month free, 50% off forever after.