Problem
You think of a task while walking, driving, or away from your desk. Voice memos capture the thought but then you have to listen back, type it into Things 3 or Todoist, add tags, set a due date -- by which point the friction has killed the habit. Voice assistants like Siri create reminders but cannot parse complex task structures, assign projects, or update existing items. You need a pipeline that goes from speech directly to a structured task entry with no intermediate manual step. Todoist built this with Ramble, but it only works with Todoist -- you need the same pattern for any task manager.
Solution
Stream audio directly to Gemini Flash 2.5 Live's audio streaming API and let it map spoken intent to structured tool calls (add_task, update_task, remove_task) that execute against your task manager.
Architecture:
Microphone Audio --(WebSocket stream)--> Gemini Flash 2.5 Live
--> Parses intent + extracts structure
--> Tool calls: add_task(...), update_task(...), remove_task(...)
--> Execute against task manager API
--> Voice confirmation back to user
Step 1: Define tool declarations for task operations
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const tools = [{
functionDeclarations: [{
name: "add_task",
description: "Add a new task to the task manager",
parameters: {
type: "object",
properties: {
title: { type: "string", description: "Task title" },
notes: { type: "string", description: "Additional details" },
project: { type: "string", description: "Project name" },
dueDate: { type: "string", description: "ISO date string" },
tags: { type: "array", items: { type: "string" } },
},
required: ["title"],
},
}, {
name: "update_task",
description: "Update an existing task by title",
parameters: {
type: "object",
properties: {
title: { type: "string" },
completed: { type: "boolean" },
newDueDate: { type: "string" },
},
required: ["title"],
},
}, {
name: "remove_task",
description: "Complete or delete a task",
parameters: {
type: "object",
properties: {
title: { type: "string" },
action: { type: "string", enum: ["complete", "delete"] },
},
required: ["title"],
},
}],
}];
Step 2: Connect to the Gemini Live session and handle tool calls
const session = await ai.live.connect({
model: "gemini-2.5-flash-preview-native-audio-dialog",
config: {
responseModalities: ["AUDIO", "TEXT"],
tools,
systemInstruction: "You are a task capture assistant. When the user describes something they need to do, call add_task with structured fields. Confirm each task verbally.",
},
});
session.on("toolCall", async (toolCall) => {
for (const fc of toolCall.functionCalls) {
const result = await executeTaskAction(fc.name, fc.args);
session.sendToolResponse({
functionResponses: [{
name: fc.name,
id: fc.id,
response: result,
}],
});
}
});
Step 3: Bridge to Things 3 via URL scheme (or Todoist via REST API)
import { execSync } from "child_process";
function executeTaskAction(name, args) {
if (name === "add_task") {
const params = new URLSearchParams();
params.set("title", args.title);
if (args.notes) params.set("notes", args.notes);
if (args.project) params.set("list", args.project);
if (args.dueDate) params.set("deadline", args.dueDate);
if (args.tags?.length) params.set("tags", args.tags.join(","));
execSync(`open "things:///add?${params.toString()}"`);
return { success: true, action: "added", title: args.title };
}
if (name === "remove_task") {
const script = `tell application "Things3" to complete to do "${args.title}"`;
execSync(`osascript -e '${script}'`);
return { success: true, action: "completed", title: args.title };
}
}
Step 4: Run as a background listener
# Start the voice-to-task listener
node voice-task.js --listen
# Or bind to a keyboard shortcut via macOS Shortcuts
# "Run Shell Script: node ~/scripts/voice-task.js --capture"
Why It Works
Gemini Flash 2.5 Live processes audio as a continuous stream with sub-second latency -- there is no "wait for transcription then parse" step. The model receives raw audio and directly outputs structured tool calls, combining speech recognition, intent parsing, and parameter extraction into a single inference pass. You can say "add a task to the backend project -- fix the auth token refresh, high priority, due Friday" and get a fully structured add_task call without any intermediate text processing. The tool-calling interface constrains the output to valid actions, so the model cannot hallucinate arbitrary commands.
Context
- This pattern is a generalized version of Todoist Ramble that works with any task manager that has an API or URL scheme
- Gemini Flash 2.5 Live uses WebSocket streaming with built-in voice activity detection, so it knows when you stop speaking
- Things 3 URL scheme (
things:///add) works on macOS and iOS; for Todoist, swap to their REST API athttps://api.todoist.com/rest/v2/tasks - The audio response modality lets Gemini speak confirmations back, creating a fully hands-free loop
- Echo cancellation is built into the Live API so confirmation audio does not trigger re-capture
- The same pattern extends beyond tasks -- any CRUD API can be voice-controlled by defining the right tool declarations
- Consider running this as a menu bar app using Electron or Tauri for always-available access