Skip to content

Build a local Whisper-based voice input system with hotkey activation

pattern

Voice-to-text tools like WisprFlow cost $10+/month for developers who just need local dictation for coding prompts

pythonmacoswhispervoice-inputdictationhotkeytranscription
22 views

Problem

Developer voice input tools like WisprFlow and similar services charge monthly subscriptions ($10+/month) for something that can run entirely on local hardware. You just need to dictate a coding prompt or describe a task -- not run a call center. The cloud services also introduce latency and send your audio (potentially containing sensitive project details) to third-party servers. Building your own solution seems daunting, but with OpenAI's Whisper model available locally, it takes about an hour to set up a system that is free, private, and faster than most cloud alternatives for short utterances.

Solution

Build a background service that listens for a hotkey (Ctrl+backtick), records audio while held, and transcribes it locally using Whisper. The transcribed text is pasted into whatever application has focus.

Python implementation using faster-whisper for optimized local inference:

#!/usr/bin/env python3
"""Local voice input service with hotkey activation."""

import pyaudio
import wave
import tempfile
import subprocess
from pynput import keyboard
from faster_whisper import WhisperModel

# Load model once at startup (uses ~1GB RAM for "base" model)
model = WhisperModel("base", device="cpu", compute_type="int8")

RATE = 16000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK = 1024

recording = False
frames = []
audio = pyaudio.PyAudio()
stream = None


def start_recording():
    global recording, frames, stream
    frames = []
    stream = audio.open(format=FORMAT, channels=CHANNELS,
                        rate=RATE, input=True, frames_per_buffer=CHUNK)
    recording = True
    print("Recording...")
    while recording:
        data = stream.read(CHUNK, exception_on_overflow=False)
        frames.append(data)


def stop_recording_and_transcribe():
    global recording, stream
    recording = False
    if stream:
        stream.stop_stream()
        stream.close()

    # Save to temp WAV file
    tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
    wf = wave.open(tmp.name, "wb")
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(audio.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b"".join(frames))
    wf.close()

    # Transcribe locally
    segments, _ = model.transcribe(tmp.name, language="en")
    text = " ".join(seg.text for seg in segments).strip()
    print(f"Transcribed: {text}")

    # Paste into active application (macOS)
    subprocess.run(["pbcopy"], input=text.encode(), check=True)
    subprocess.run(["osascript", "-e",
                     'tell application "System Events" to keystroke "v" '
                     'using command down'], check=True)


def on_press(key):
    if key == keyboard.Key.f18:  # Map Ctrl+` to F18 via Karabiner
        import threading
        threading.Thread(target=start_recording, daemon=True).start()


def on_release(key):
    if key == keyboard.Key.f18:
        stop_recording_and_transcribe()


with keyboard.Listener(on_press=on_press, on_release=on_release) as listener:
    print("Voice input ready. Hold hotkey to record.")
    listener.join()

Install dependencies:

pip install faster-whisper pyaudio pynput

Run at startup (macOS launchd):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.local.voice-input</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/python3</string>
        <string>/Users/you/scripts/voice_input.py</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
</dict>
</plist>
# Install the launch agent
cp com.local.voice-input.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/com.local.voice-input.plist

Why It Works

The faster-whisper library uses CTranslate2 to run the Whisper model with int8 quantization, which means the "base" model fits in about 1GB of RAM and transcribes short audio clips in under a second on modern CPUs. Since the model stays loaded in memory as a background service, there is no startup cost per transcription -- you get near-instant results. The hotkey-to-paste pipeline means transcribed text appears exactly where your cursor is, just like native dictation but without sending audio to any external service.

Context

  • The "base" model gives the best speed/quality tradeoff for short dictation; use "small" or "medium" if accuracy on technical terms matters more than speed
  • On Apple Silicon Macs, faster-whisper can use the Neural Engine via CoreML for even faster inference
  • Use Karabiner-Elements to remap Ctrl+backtick to F18 since global hotkey capture in Python is limited on macOS
  • Alternatives: OpenWhisper (BYO API key, still cloud-based), WisprFlow ($10/month), macOS built-in dictation (limited accuracy for technical speech)
  • Grant the Python process Accessibility and Microphone permissions in System Settings for the hotkey and recording to work
  • For Linux, replace the osascript paste command with xdotool key ctrl+v and use xclip instead of pbcopy
About this share
Contributormblode
Repositorymblode/shares
CreatedFeb 10, 2026
View on GitHub