what good caption export actually looks like
a working .srt or .vtt file has to satisfy several constraints at once. the timing has to be second-accurate to the video. each caption block has to fit on screen without horizontal scroll — that's typically two lines of ~32 characters each, but the exact constraint depends on the platform (youtube, broadcast, accessibility-team standards). line breaks have to land at sentence or clause boundaries, not mid-word. speaker labels need to indicate who's speaking, when speakers change. and reading speed has to be sustainable: a viewer with average reading rate should be able to actually read each caption before it disappears.
most generic .srt exports satisfy maybe one of these. the others fall on the captioner.
what we ship
- second-accurate timing from the model's word-level timestamps. each caption block starts at the first word's timestamp and ends at the last word's, with no drift across the file.
- configurable line length — 32 characters (broadcast / WCAG default), 42 characters (youtube), or custom. set per-project; remembered for the next file.
- clean sentence-boundary breaks so captions don't end mid-clause. when a sentence is too long to fit, the break lands at a clause boundary (after a comma, conjunction, or natural breath pause) rather than at a fixed character count.
- speaker labels in caption using the standard
[NAME]:convention (toggleable). when speakers change, the new block opens with the new label. - reading-speed governance — captions don't appear faster than the configured CPS (characters per second, default 17 for accessibility). for fast-talking source audio, the editor extends caption durations and merges short captions where possible.
- music and non-speech cues using the bracketed-italic convention:
[music],[laughter],[applause], etc. detected automatically from the audio with edit-pass review. - both .srt and .vtt from the same source transcript. .vtt also includes the cue settings for vertical position, alignment, and styling classes that some platforms accept.
workflow
- upload the audio or video file. mp3, m4a, mp4 (audio extracted), wav, mov — anything ffmpeg reads.
- transcription runs. on a 30-minute video, the first pass is ready in 1–2 minutes.
- review caption blocks in the editor. each caption is shown as a preview alongside the transcript with its duration, character count, and reading speed. blocks that exceed your configured limits are flagged for adjustment.
- fix labels and proper nouns once and watch them propagate through every caption block. click any word to hear that second of audio for verification.
- export .srt or .vtt with your project's settings. the file imports cleanly into premiere, davinci, final cut, kdenlive, youtube studio, kapwing, and any standards-compliant captioning workflow.
where this fits
- youtube creators uploading captions for accessibility and search-indexing. youtube's own auto-captions get most of the words right but blow the timing and line breaks.
- professional captioners and accessibility teams producing WCAG-AA compliant captions for clients. the cleanup pass that normally takes 4-5x video length compresses to about 1.5x with clean first-pass timing.
- video editors adding captions before publishing. premiere's auto-caption feature exists but is locked to creative cloud subscription and produces lower-quality timing than this.
- podcasters with video versions of their episodes for youtube and instagram reels.
- educators captioning lectures, seminars, and recorded course content for accessibility compliance (especially in higher ed).
privacy
for video that can't be uploaded — internal corporate training material with confidential content, medical educational videos, legal training videos — run the file in private mode. the .srt / .vtt export works identically; the audio just stays on your laptop.
pricing for caption export
$0.25 per minute, same as everything else — including the caption export. no per-format upcharge, no subscription, no minimum. for high-volume captioning workflows, batch pricing arrives after launch.