two ways Zoom recordings exist
when you record a Zoom meeting, the file lands in one of two places, and the transcription workflow differs slightly:
- cloud recording. saved to Zoom's servers. you download it from the "recordings" section of your Zoom account. it arrives as an mp4 (video + audio) and an m4a (audio-only). either file works for transcription; if you only want the transcript, the audio-only m4a is smaller and faster.
- local recording. saved to the host's machine. zoom drops the audio into a folder named with the meeting timestamp. you'll find it as
audio_only.m4aalongside the video file.
either kind of recording is just an audio file once it exists. drop it in, transcribe it, edit, export.
the workflow
- locate the recording. for cloud recordings: zoom.us → recordings → download. for local recordings:
~/Documents/Zoom/[meeting-name]/on mac,%USERPROFILE%\Documents\Zoom\on windows. - drop the file into audiohighlight. mp4, m4a, mp3, wav, webm — anything Zoom exports works. video files have audio extracted automatically.
- transcription runs. on a 60-minute zoom recording, the first pass is ready in 1–3 minutes (cloud mode) or roughly real-time (on-device private mode for sensitive meetings).
- fix the speaker labels. zoom doesn't pass speaker identity through to the file — the diarization is what we infer from voice patterns. relabel "speaker 1" to the actual person's name, once. propagates through every row.
- verify quotes against the recording. click any word in the transcript, hear that second of audio. for any meeting whose transcript becomes evidence — performance reviews, candidate interviews, product decisions, customer-success cases — this is the verification step.
- export. .docx for the meeting notes that go to the team. .srt or .vtt for adding captions to the recording before sharing. plain text for paste-into-doc workflows.
why no bot
the dominant workflow for zoom transcription in 2026 is a bot — otter, fireflies, fathom, granola — that joins your meeting as a participant and transcribes live. for many internal team meetings, that's a perfectly good choice. for a meaningful subset of meetings, it isn't:
- candidate interviews where some candidates decline if a bot is present, and jurisdictions with two-party consent rules complicate the transcript's usability
- medical and therapy consultations where HIPAA-bound audio shouldn't pass through a third-party bot
- legal consultations and witness preparation where attorney-client privilege is at risk if a bot is in the room
- journalism interviews with sources who agreed to talk on the condition of no third-party recording
- internal investigations and HR consultations where audio handling has legal implications
- m&a, board, and strategy meetings where the audio's existence on a third-party server is itself a leak risk
for any of those, the workflow that works is: you record the call yourself (zoom's local recording does this), and you transcribe the file after, using a tool that doesn't need to be in the meeting.
private mode for sensitive Zoom recordings
for the meetings above — medical, legal, journalism, investigation, board — even uploading the recording to a cloud transcription tool after the fact can be a problem. the audio sits on the vendor's servers; it's reachable through process the way any vendor-held document is reachable.
private mode runs the speech-recognition model in your browser using WebGPU. you drop the recording into the editor and the model transcribes locally — your audio never makes a network request, never reaches our servers, never sits in any third-party storage. for the structural argument and the audit instructions, see private transcription.
handling Zoom's quirks
- voice cuts and crosstalk. zoom's audio compression occasionally clips voices when speakers overlap. the diarization can read this as a third speaker; flag those rows during cleanup.
- screen-sharing audio. if someone shared a video or audio clip during the call, it appears in the transcript as additional speakers. the editor lets you mark those passages or trim them before export.
- echo / feedback rooms. participants without headphones can create echo that the model transcribes as repeated phrases. usually fixable in the cleanup pass; for systematic echo, the recording is difficult to transcribe accurately on any tool.
- multiple speakers per machine. two people sharing one camera often get diarized as a single speaker. relabel manually after the first pass.
pricing for Zoom recordings
$0.25 per minute. a 30-minute zoom call is $6. a 60-minute team meeting is $15. private mode and cloud mode are the same price. no subscription, no minimum. for teams with steady weekly meeting volume, batch pricing arrives after launch.