The benchmark. Reproduce it yourself.
Accuracy and speed for Voxmelt's local Whisper pipeline on a named consumer GPU, measured with a method anyone can rerun: references written before recording, the same normalization for every engine, per-clip tables instead of a cherry-picked average, and the scoring script published.
Where does your voice actually go?
One sentence. Two journeys. Hover any lane to slow the trail down and read each hop.
On a stock RTX 3080 Ti, Voxmelt's local Whisper large-v3 averaged 7.8 percent word error rate across five scripted real-world clips, held 3.1 percent on fast speech and on background noise, and transcribed 10 to 15 times faster than real time at a 4.2 GB VRAM peak. Audio bytes uploaded: zero.
Measured 2026-06-10. Full per-clip tables below.
Word error rate, per clip
Lower is better. Same clips, same normalization for every engine: lowercased, punctuation stripped, spoken numbers folded to digits. The per-clip rows are the credibility; averages hide cherry-picking.
Realtime factor and VRAM, per clip
Processing time divided by audio length. Below 1.0 means faster than real time. Cloud tools have no row here because you cannot measure what runs in someone else's building.
Exactly what it ran on
A benchmark without its environment is an anecdote. Reproduce ours or run your own card; the numbers scale with your GPU.
References written before recording. Ground truth is the script, not a post-hoc transcript.
The rules we set before measuring
Five clips covering the real shapes of dictation: clean speech, technical and code speech, fast speech with natural filler, medical jargon, and background noise. Variety is the point; a benchmark on easy audio only proves you can pick easy audio.
Reference transcripts were written before recording and read aloud. The ground truth is the script, never a cleaned-up transcript of the recording.
Every engine gets the same audio files and the same normalization: lowercase, punctuation stripped, spoken numbers folded to digits. No engine is favored.
Voxmelt is scored on the output of its shipping pipeline, the same refined pass users get, with the same quality gates. We do not benchmark a lab configuration users never see.
Speed counts processing time only, against a warm model. One-time model load is reported separately, the same way cloud tools do not bill you for their server boot.
Caveats stay attached to the numbers: one speaker, English, one hardware configuration, founder-run. If a clip goes badly, it stays in the table.
Zero audio bytes uploaded. Check it yourself.
While every clip above was transcribed, the number of audio bytes that left the machine was zero, because there is nothing to send. Open any network monitor, dictate, and watch. That is the part of this page you do not have to take on faith, and the part no cloud tool can put in their benchmark.