Reference for Gemini Flash 2.5 Live Audio Streaming with Tool Calls

Problem

Connecting browser microphone audio to the Gemini Flash 2.5 Live API requires coordinating several moving parts: capturing audio via getUserMedia, converting Float32 samples to 16kHz Int16 PCM, streaming base64-encoded chunks over a WebSocket, and handling two different tool call response formats from the server.

The Gemini Live WebSocket protocol expects a specific setup message with model config, tools, and system instructions before accepting audio. Tool calls arrive in two formats (toolCall.functionCalls and serverContent.modelTurn.parts[].functionCall) and require a toolResponse acknowledgment to complete the loop.

Without a clear integration reference, each piece works in isolation but wiring the full pipeline — mic to PCM to WebSocket to tool execution and back — involves undocumented protocol details.

Solution

The architecture follows this data flow:

Mic (16kHz PCM) → Audio Processing (Float32→Int16→Base64) → Gemini WebSocket → Tool Execution

1. Define tools for Gemini

Declare function schemas that Gemini will call when it recognizes intent from speech:

// lib/gemini-tools.ts
export const RAMBLE_TOOLS = [
  {
    name: "add_task",
    description: "Add a new task. Extract title, due date, project, priority, tags, and duration from speech.",
    parameters: {
      type: "object",
      properties: {
        title: { type: "string", description: "The task title" },
        dueDate: { type: "string", description: "Due date in YYYY-MM-DD or natural language" },
        project: { type: "string", description: "Project to assign the task to" },
        priority: { type: "number", description: "Priority: 1 (highest) to 4 (lowest)" },
        tags: { type: "array", items: { type: "string" }, description: "Tags to assign" },
        duration: { type: "number", description: "Estimated duration in minutes" },
      },
      required: ["title"],
    },
  },
  // update_task, remove_task, end_session follow the same pattern
];

export const SYSTEM_PROMPT = `You are a voice-activated task capture assistant.
When the user mentions a task, use add_task immediately.
Parse natural language dates: "tomorrow", "next week", "on Friday".
Do NOT speak or provide audio responses. Only use tools.`;

2. Capture microphone audio as 16kHz PCM

Convert browser audio to the format Gemini expects:

// hooks/use-audio-stream.ts
const TARGET_SAMPLE_RATE = 16_000;
const BUFFER_SIZE = 4096;

function float32ToInt16(float32Array: Float32Array): Int16Array {
  const int16Array = new Int16Array(float32Array.length);
  for (let i = 0; i < float32Array.length; i++) {
    const s = Math.max(-1, Math.min(1, float32Array[i]));
    int16Array[i] = s < 0 ? s * 0x8000 : s * 0x7fff;
  }
  return int16Array;
}

function arrayBufferToBase64(buffer: ArrayBuffer): string {
  const bytes = new Uint8Array(buffer);
  let binary = "";
  for (let i = 0; i < bytes.byteLength; i++) {
    binary += String.fromCharCode(bytes[i]);
  }
  return btoa(binary);
}

// In the hook: capture → resample → encode → callback
const audioContext = new AudioContext({ sampleRate: TARGET_SAMPLE_RATE });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(BUFFER_SIZE, 1, 1);

processor.onaudioprocess = (event) => {
  const inputData = event.inputBuffer.getChannelData(0);
  const pcm16 = float32ToInt16(inputData); // Already at 16kHz if AudioContext honors it
  const base64 = arrayBufferToBase64(pcm16.buffer);
  onAudioData(base64);
};

3. Connect to Gemini Live via WebSocket

The setup message must include model, generation config, system instruction, and tools:

// hooks/use-gemini-live.ts
ws.onopen = () => {
  ws.send(JSON.stringify({
    setup: {
      model: "projects/{id}/locations/us-central1/publishers/google/models/gemini-live-2.5-flash-native-audio",
      generation_config: {
        response_modalities: ["AUDIO"],
        temperature: 0.3,
        speech_config: {
          voice_config: { prebuilt_voice_config: { voice_name: "Puck" } },
        },
      },
      system_instruction: { parts: [{ text: SYSTEM_PROMPT }] },
      tools: [{ function_declarations: RAMBLE_TOOLS }],
    },
  }));
};

Handle the two tool call formats and send responses back:

// Tool calls arrive in two formats — handle both
ws.onmessage = async (event) => {
  const text = event.data instanceof Blob ? await event.data.text() : event.data;
  const msg = JSON.parse(text);

  // Format 1: direct toolCall
  if (msg.toolCall?.functionCalls) {
    for (const fc of msg.toolCall.functionCalls) {
      executeToolCall(fc.name, fc.args);
      // Acknowledge the tool call
      ws.send(JSON.stringify({
        toolResponse: {
          functionResponses: [{ id: fc.id, name: fc.name, response: { success: true } }],
        },
      }));
    }
  }

  // Format 2: embedded in serverContent.modelTurn.parts
  if (msg.serverContent?.modelTurn?.parts) {
    for (const part of msg.serverContent.modelTurn.parts) {
      if (part.functionCall) {
        executeToolCall(part.functionCall.name, part.functionCall.args);
        ws.send(JSON.stringify({
          toolResponse: {
            functionResponses: [{ id: part.functionCall.id, name: part.functionCall.name, response: { success: true } }],
          },
        }));
      }
    }
  }
};

4. Stream audio chunks to Gemini

// Send base64 PCM audio to the live session
function sendAudio(pcmBase64: string) {
  ws.send(JSON.stringify({
    realtime_input: {
      media_chunks: [{
        data: pcmBase64,
        mime_type: "audio/pcm;rate=16000",
      }],
    },
  }));
}

5. Session API route (Next.js)

// app/api/ramble/session/route.ts
import { NextResponse } from "next/server";

export function POST() {
  const apiKey = process.env.GOOGLE_AI_API_KEY;
  const projectId = process.env.GOOGLE_CLOUD_PROJECT;

  return NextResponse.json({
    apiKey,
    model: `projects/${projectId}/locations/us-central1/publishers/google/models/gemini-live-2.5-flash-native-audio`,
    wsUrl: "wss://us-central1-aiplatform.googleapis.com/ws/google.cloud.aiplatform.v1beta1.LlmBidiService/BidiGenerateContent",
  });
}

Why It Works

The Gemini Live API uses a bidirectional WebSocket where audio is streamed as base64-encoded PCM chunks in real time. When the setup message includes tools with function declarations, Gemini processes incoming speech and emits structured functionCall objects instead of (or alongside) audio responses. This is the same tool-calling mechanism as the REST API, but operating on a continuous audio stream.

The critical detail is that tool calls must be acknowledged with a toolResponse message containing the original call id. Without this response, Gemini stalls waiting for confirmation. The two response formats (toolCall.functionCalls at the top level and serverContent.modelTurn.parts[].functionCall nested in content) both represent the same semantic event — Gemini uses either depending on whether the call interrupts or continues a model turn.

The audio encoding pipeline (Float32 → Int16 → base64) matches Gemini's expected audio/pcm;rate=16000 format. Requesting an AudioContext at 16kHz avoids manual resampling in most browsers, though a fallback resampler handles cases where the browser ignores the sample rate hint.

Context

Model gemini-live-2.5-flash-native-audio on Vertex AI, region us-central1, API version v1beta1
Audio format: 16kHz signed Int16 PCM, 4096-sample buffer, base64-encoded for JSON transport
ScriptProcessorNode is deprecated — production apps should migrate to AudioWorkletNode
The session API route exposes the API key to the client; production should use short-lived tokens or a proxy
WebSocket messages arrive as both Blob and plain text depending on the browser — parse both
Setting response_modalities: ["AUDIO"] is required even when you only want tool calls, otherwise the setup is rejected