measured, not claimed

The benchmark. Reproduce it yourself.

Accuracy and speed for Voxmelt's local Whisper pipeline on a named consumer GPU, measured with a method anyone can rerun: references written before recording, the same normalization for every engine, per-clip tables instead of a cherry-picked average, and the scoring script published.

same words, very different thermal paths

Where does your voice actually go?

One sentence. Two journeys. Hover any lane to slow the trail down and read each hop.

the cloud way

Cloud STT

Otter · Whisper API · Deepgram · the usual suspects

312 ms · 4 hops

Network hops

Bytes leaked

≈ 1.2 KB

Round trip

312 ms

Bytes leave your machine within 0.4s of you opening your mouth. Stored, indexed, embeddable, subpoena-able. Yikes.

the Voxmelt way

Local pipeline

on-device · 1 GPU · 0 hops · 0 ick

78 ms · 0 hops

Network hops

Bytes leaked

Round trip

78 ms

Zero bytes touch a remote server. Whisper sidecar runs over stdio. Ollama is locked to 127.0.0.1. Verifiable in netstat - receipts only.

On a stock RTX 3080 Ti, Voxmelt's local Whisper large-v3 averaged 7.8 percent word error rate across five scripted real-world clips, held 3.1 percent on fast speech and on background noise, and transcribed 10 to 15 times faster than real time at a 4.2 GB VRAM peak. Audio bytes uploaded: zero.

Measured 2026-06-10. Full per-clip tables below.

accuracy

Word error rate, per clip

Lower is better. Same clips, same normalization for every engine: lowercased, punctuation stripped, spoken numbers folded to digits. The per-clip rows are the credibility; averages hide cherry-picking.

ClipVoxmelt · large-v3 (local GPU)Voxmelt · medium (local GPU)

Clean dictation6.8% WER · 6.8% CER17.1% WER · 10.6% CER

Technical / code speech16.1% WER · 7.1% CER40.2% WER · 26.7% CER

Fast natural speech with filler3.1% WER · 2.3% CER18.4% WER · 11.2% CER

Medical jargon10.9% WER · 10.7% CER19.6% WER · 13.9% CER

Background noise3.1% WER · 4.2% CER19.8% WER · 21.4% CER

Overall (word-weighted)7.8% WER · 6.1% CER22.8% WER · 16.7% CER

speed

Realtime factor and VRAM, per clip

Processing time divided by audio length. Below 1.0 means faster than real time. Cloud tools have no row here because you cannot measure what runs in someone else's building.

ClipAudioModelProcessed inx realtimePeak VRAM

Clean dictation40slarge-v34.2s9.6x4.2 GB

Technical / code speech51slarge-v34.4s11.4x4.2 GB

Fast natural speech with filler37slarge-v33.5s10.5x4.2 GB

Medical jargon65slarge-v34.3s15.2x4.2 GB

Background noise43slarge-v33.3s13.0x4.2 GB

Clean dictation40smedium3.5s11.3x2.3 GB

Technical / code speech51smedium2.7s18.6x2.3 GB

Fast natural speech with filler37smedium2.6s14.3x2.3 GB

Medical jargon65smedium3.5s18.7x2.4 GB

Background noise43smedium2.6s16.6x2.3 GB

environment

Exactly what it ran on

A benchmark without its environment is an anecdote. Reproduce ours or run your own card; the numbers scale with your GPU.

GPUNVIDIA GeForce RTX 3080 Ti (12 GB)

NVIDIA driver595.79

CUDA13.2

OSWindows 11

Whisper modellarge-v3 / medium

Compute typefloat16

faster-whisper1.2.1 (CTranslate2 4.7.1)

References written before recording. Ground truth is the script, not a post-hoc transcript.

method

The rules we set before measuring

Five clips covering the real shapes of dictation: clean speech, technical and code speech, fast speech with natural filler, medical jargon, and background noise. Variety is the point; a benchmark on easy audio only proves you can pick easy audio.

Reference transcripts were written before recording and read aloud. The ground truth is the script, never a cleaned-up transcript of the recording.

Every engine gets the same audio files and the same normalization: lowercase, punctuation stripped, spoken numbers folded to digits. No engine is favored.

Voxmelt is scored on the output of its shipping pipeline, the same refined pass users get, with the same quality gates. We do not benchmark a lab configuration users never see.

Speed counts processing time only, against a warm model. One-time model load is reported separately, the same way cloud tools do not bill you for their server boot.

Caveats stay attached to the numbers: one speaker, English, one hardware configuration, founder-run. If a clip goes badly, it stays in the table.

the claim no benchmark can beat

Zero audio bytes uploaded. Check it yourself.

While every clip above was transcribed, the number of audio bytes that left the machine was zero, because there is nothing to send. Open any network monitor, dictate, and watch. That is the part of this page you do not have to take on faith, and the part no cloud tool can put in their benchmark.

Try it on your GPU Compare with cloud tools