Record. Compose. Narrate. All local.
Two AI engines on a single consumer GPU, three creative surfaces, 80+ template operations, and hands-free controls to drive all of it. Fully offline, fully private. Here is how each piece works.
Whisper large-v3 - the recording brain
Engine one of two. Whisper large-v3 runs on your GPU via faster-whisper (CTranslate2 + CUDA). Five model sizes from tiny to large-v3, three compute precisions, 10+ languages with auto-detect. On an RTX 3080 Ti, large-v3 transcribes at 4x realtime in float16 - faster than you talk.
- Unlimited continuous dictation, overlapping chunks with auto-dedup
- Chunk duration tunable (5 / 15 / 30 / 60 s)
- Auto language detect, or pin one per session
- Sandboxed Python sidecar - stdio only, zero network access
Any Ollama model - the cleanup brain
Engine two of two. Pair Whisper's raw transcript with any local LLM Ollama can serve - Gemma 3, Qwen 3, DeepSeek-R1, Llama 4 Scout, Phi-4, Mistral, whatever. 22+ in the catalog, custom prompts unlimited. The orchestrator parses param counts from model tags and estimates VRAM automatically, so 'pull a new model and it just works' is real.
- 14 template packs with 80+ tones (dictation, rewrite, chat, email, social, video, meeting notes, coding prompts...)
- Custom system prompts for medical notes, legal briefs, meeting minutes
- Token-level streaming, cancel mid-run, run-count tracking
- Localhost-only (127.0.0.1:11434), zero telemetry, verifiable in netstat
Smart VRAM swap - why both engines coexist
This is the part no other app ships. Auto reads your card's total VRAM, measures Whisper's actual footprint, estimates the LLM's from its tag, and decides per-run whether to coexist or evict. Always-evict is safest for unknown models. Never-evict keeps recording snappy over AI speed. Auto picks the best of both for your hardware - you never tune a thing.
- Live VRAM telemetry, updated every 2 seconds
- Color-coded fit chips: green (both fit), amber (tight), red (won’t run)
- One-click Free GPU - dumps every model so you can game or render
- Idle auto-free after 1 / 5 / 15 / 30 min. Alt-tab to a game, VRAM comes back
Privacy by architecture, not by ToS
Both engines live on your machine. Whisper runs in a sandboxed sidecar talking JSON over stdio - no sockets, no shared memory. Ollama is bolted to 127.0.0.1. We don't ship telemetry, analytics, or even a license server you have to ping. Built for HIPAA, GDPR, SOC 2, and air-gapped boxes where 'we promise we won't' isn't good enough.
- No API keys. No account required to record
- No internet after install - works on an air-gapped box
- Whisper sidecar has zero network access by design
- Ollama localhost-only - receipts in netstat, not promises
Three creative surfaces built on those engines.
Record is the flagship. Compose and Voiceover turn the same local AI into a text studio and an audio studio - no mic required, no cloud, same GPU.
Compose - the text-first AI workspace.
Not everything starts with your voice. Paste any text - a rough draft, a colleague's email, meeting notes from another tool - pick a template and a tone, and let the local LLM reshape it. Same AI engine as Record, same templates, same tier access. No recording needed, no cloud, same GPU. Think of it as a private, local alternative to pasting into ChatGPT - except your text never leaves your machine.
- Paste or type any text - runs through the same 80+ template tones
- Decoupled template + tone selection - mix any action with any style
- Streaming output with copy, re-run, and Read Aloud on the result
- "Use last transcript" shortcut - bridge a recording into a different tone
- Same daily run limits, same tier gating, same model picker as Record
- Full AI Output panel - tone-shift chips, voice edit, WAV export
Voiceover - text to audio, on your GPU.
Turn any script into narration without a recording booth. Four distinct personas (Clara, Marcus, Maya, Sam) crossed with eight delivery tones (Professional, Conversational, Warm, Calm, Bright, Authoritative, Storyteller, Energetic) give you 32 voice combinations - all synthesized locally on your GPU via Kokoro ONNX. No per-character billing, no cloud API, no rate limits. Export clean WAV files straight to your Downloads folder. Unlike ElevenLabs or Play.ht, your scripts never leave your machine and there is no usage meter ticking.
- 4 personas x 8 tones = 32 voice combinations, all on-device
- Speed control from 0.7x to 1.5x (Pro+) - match any pacing need
- WAV export to Downloads - drag into your editor, timeline, or DAW
- Sentence-by-sentence playback tracking with live highlighting
- "Use last AI output" shortcut - polish in Compose, narrate in Voiceover
- No per-character billing - generate as much audio as your GPU can handle
Personas
- Clara - clear, articulate female voice (Free)
- Marcus - measured, corporate male voice (Free)
- Maya - warm, expressive female voice (Pro)
- Sam - relaxed, conversational male voice (Pro)
Delivery tones
- Professional, Conversational, Warm (Free)
- Calm, Bright, Authoritative (Pro)
- Storyteller, Energetic (Pro)
- Speed control 0.7x - 1.5x (Pro)
80+ AI operations - one click, every time.
Templates are how Voxmelt turns a raw transcript or pasted text into exactly the format you need. Each template pack contains multiple tones - not just "rewrite" but "rewrite as a casual Slack reply" or "rewrite as a formal decline email" or "rewrite as a LinkedIn post." Pick the action, pick the style, get the output. Every template works in both Record and Compose.
Quick edit
Vibe Dictation (6 tones), Rewrite (5), Proofread (4)
Tone shift
Diplomatic (5 tones), Persuasive (5)
AI and code prompts
AI Prompt Architect (5), Coding Prompt (4)
Communications
Chat Responder (6), Email (5), Social Post (4), Video Script (4)
Explain and translate
Meeting Recap (4), Notes (4), Translate (6 languages)
Custom
Write your own system prompt. Free: 1 slot, Pro: 5, Studio: 30
Vibe Dictation
Polish raw spoken thoughts, rants, or logic into structured text in any vibe.
Chat Responder
Turn raw spoken reactions into perfectly vibed replies for Slack, Teams, or text.
AI prompt
Convert spoken thought into a clean, structured LLM prompt.
Coding prompt
Developer-ready spec for AI coding assistants.
Meeting Recap
Turn a recorded meeting, call, or standup into action items, decisions, minutes, or a TL;DR.
Distill Notes
Compress a long voice memo, lecture, or brain-dump into bullets, an outline, a TL;DR, or study notes.
Signal Extractor
Strip away all the noise and keep only the dense nucleus - a one-line gist, a pure deliverables list, or a single atomic note.
Translate
DefaultTranslate the transcript to another language.
Custom prompt
Pick a persona, add your own instructions - runs 100% on your GPU, like every built-in.
Turn a spoken intent into a ready-to-send email - replies, follow-ups, declines, outreach, apologies.
Social post
Shape a raw thought into a post that lands - a LinkedIn post, an X thread, a hook, or a caption.
Rewrite
The everyday rewriter - paraphrase, shorten, expand, formalize, or sharpen any text.
Proofread
Fix and tighten - grammar, clarity, conciseness, and direct active voice.
Diplomatic
Recast a blunt or risky message so it lands well - tactful, calm, and relationship-safe, without losing the point.
Persuasive
Recast flat text into something compelling - confident, benefit-led, and hard to say no to, with the meaning intact.
Explain
Make any idea click - explain it simply, step by step, by analogy, or with a worked example.
Video script
Spoken idea to a script you can read on camera or send straight to the Voiceover page.
Two ways to drive it without touching the keyboard.
The engines do the thinking. These are how you run them - eyes off, hands free, never breaking flow.
No-Hands Mode - run it all by voice.
A second, always-on Whisper tiny model (~0.3 GB VRAM) listens in the background and fuzzy-matches your speech against short, phonetically-distinct phrases. Say "go ahead" to start dictating, "wrap up" to stop, "copy that" to grab the clean text, "warm up" to load the AI - no hotkey, no mouse, no looking at the window. The listener steps aside the instant real dictation starts so it never fights your main model for the mic or the GPU, then resumes the moment you stop. Say it, and it happens.
- Always-on tiny listener - ~0.3 GB VRAM, runs beside large-v3
- 8 built-in commands, every phrase remappable to your own
- 2-word phrases tuned for ~80%+ accuracy on the tiny model
- Auto-pauses during dictation, auto-resumes the second you stop
- Load or unload models by voice while idle ("warm up" / "cool down")
- The command listener is local too - nothing leaves your machine
Vibe coders
Hands on the keys in Cursor or Copilot - say "go ahead", dictate the next prompt, "wrap up", "copy that", paste. Ship without breaking flow to hunt for a hotkey.
Hands-busy & accessibility
Cooking, soldering, mid-workout, or resting your wrists from RSI - run dictation start to finish by voice, from across the room.
Streamers & presenters
Trigger capture and cleanup without alt-tabbing out of your scene. Your voice is the shortcut; the window can stay hidden.
Mini Mode - a pill that floats over everything.
Collapse Voxmelt into a tiny always-on-top pill you can drop anywhere and snap to any screen edge. It records in its own window with its own Whisper pipeline, mirrors your theme live, and shows the transcript and AI output in compact tabs - so you can dictate into any app without the full workbench in the way. Pair it with No-Hands Mode and the main window can stay closed in the tray while you run the whole flow from the pill. Out of the way, one tap away.
- Always-on-top, drag anywhere, snap-to-edge, optional lock
- Its own recording pipeline - keeps working with the window closed to tray
- Live transcript + AI-output tabs in a 72px pill
- Full theme sync - 5 themes, custom colors, custom opacity
- One-tap copy, paste into any app; mic mirrors the main record button
- "Ready halo" lights up when No-Hands Mode is armed
Voice-input power users
Park the pill in a corner and dictate into Slack, email, chat, or docs all day - no full workbench hogging the screen.
Small screens & dual-monitor
Keep your work full-screen; the pill rides the edge of a laptop panel or a second display, always one tap away.
Vibe coders
Float it over the IDE and talk your prompts straight into Copilot or Cursor - eyes on the code, not on a separate app.
Your silicon is ready to melt.
The local-AI-on-NVIDIA future NVIDIA and Microsoft just announced - shipping for your voice today, on the GPU you already own. RTX Spark-ready when the new hardware lands.