Build a local Whisper-based voice input system with hotkey activation

Problem

Developer voice input tools like WisprFlow and similar services charge monthly subscriptions ($10+/month) for something that can run entirely on local hardware. You just need to dictate a coding prompt or describe a task -- not run a call center. The cloud services also introduce latency and send your audio (potentially containing sensitive project details) to third-party servers. Building your own solution seems daunting, but with OpenAI's Whisper model available locally, it takes about an hour to set up a system that is free, private, and faster than most cloud alternatives for short utterances.

Solution

Build a background service that listens for a hotkey (Ctrl+backtick), records audio while held, and transcribes it locally using Whisper. The transcribed text is pasted into whatever application has focus.

Python implementation using faster-whisper for optimized local inference:

#!/usr/bin/env python3
"""Local voice input service with hotkey activation."""

import pyaudio
import wave
import tempfile
import subprocess
from pynput import keyboard
from faster_whisper import WhisperModel

# Load model once at startup (uses ~1GB RAM for "base" model)
model = WhisperModel("base", device="cpu", compute_type="int8")

RATE = 16000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK = 1024

recording = False
frames = []
audio = pyaudio.PyAudio()
stream = None


def start_recording():
    global recording, frames, stream
    frames = []
    stream = audio.open(format=FORMAT, channels=CHANNELS,
                        rate=RATE, input=True, frames_per_buffer=CHUNK)
    recording = True
    print("Recording...")
    while recording:
        data = stream.read(CHUNK, exception_on_overflow=False)
        frames.append(data)


def stop_recording_and_transcribe():
    global recording, stream
    recording = False
    if stream:
        stream.stop_stream()
        stream.close()

    # Save to temp WAV file
    tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
    wf = wave.open(tmp.name, "wb")
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(audio.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b"".join(frames))
    wf.close()

    # Transcribe locally
    segments, _ = model.transcribe(tmp.name, language="en")
    text = " ".join(seg.text for seg in segments).strip()
    print(f"Transcribed: {text}")

    # Paste into active application (macOS)
    subprocess.run(["pbcopy"], input=text.encode(), check=True)
    subprocess.run(["osascript", "-e",
                     'tell application "System Events" to keystroke "v" '
                     'using command down'], check=True)


def on_press(key):
    if key == keyboard.Key.f18:  # Map Ctrl+` to F18 via Karabiner
        import threading
        threading.Thread(target=start_recording, daemon=True).start()


def on_release(key):
    if key == keyboard.Key.f18:
        stop_recording_and_transcribe()


with keyboard.Listener(on_press=on_press, on_release=on_release) as listener:
    print("Voice input ready. Hold hotkey to record.")
    listener.join()

Install dependencies:

pip install faster-whisper pyaudio pynput

Run at startup (macOS launchd):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.local.voice-input</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/python3</string>
        <string>/Users/you/scripts/voice_input.py</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
</dict>
</plist>

# Install the launch agent
cp com.local.voice-input.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/com.local.voice-input.plist

Why It Works

The faster-whisper library uses CTranslate2 to run the Whisper model with int8 quantization, which means the "base" model fits in about 1GB of RAM and transcribes short audio clips in under a second on modern CPUs. Since the model stays loaded in memory as a background service, there is no startup cost per transcription -- you get near-instant results. The hotkey-to-paste pipeline means transcribed text appears exactly where your cursor is, just like native dictation but without sending audio to any external service.

Context

The "base" model gives the best speed/quality tradeoff for short dictation; use "small" or "medium" if accuracy on technical terms matters more than speed
On Apple Silicon Macs, faster-whisper can use the Neural Engine via CoreML for even faster inference
Use Karabiner-Elements to remap Ctrl+backtick to F18 since global hotkey capture in Python is limited on macOS
Alternatives: OpenWhisper (BYO API key, still cloud-based), WisprFlow ($10/month), macOS built-in dictation (limited accuracy for technical speech)
Grant the Python process Accessibility and Microphone permissions in System Settings for the hotkey and recording to work
For Linux, replace the osascript paste command with xdotool key ctrl+v and use xclip instead of pbcopy