Skip to content

Voice-driven UI editing with speech-to-text and a translation model

pattern

Typing detailed UI change instructions is slow and describing visual changes in text is imprecise

claudevoicespeech-to-textui-editingopenaiwebrtc
19 views

Problem

When iterating on UI designs with an AI coding assistant, describing visual changes in text is tedious and imprecise. Instructions like "move the list over a bit" or "make that section bigger" lose meaning when typed out -- you end up over-explaining layout details that would take seconds to describe verbally. For rapid UI prototyping, the typing bottleneck kills the feedback loop.

Solution

Build a browser-based voice UI editing system with three stages: select, speak, apply.

1. Element selection via click

Inject a script that captures the DOM path of any clicked element:

document.addEventListener("click", (e) => {
  e.preventDefault();
  const path = getDomPath(e.target as HTMLElement);
  setSelectedElement(path);
});

function getDomPath(el: HTMLElement): string {
  const parts: string[] = [];
  while (el.parentElement) {
    const index = Array.from(el.parentElement.children).indexOf(el);
    parts.unshift(`${el.tagName.toLowerCase()}:nth-child(${index + 1})`);
    el = el.parentElement;
  }
  return parts.join(" > ");
}

2. Voice capture and translation

Use OpenAI's real-time API via WebRTC to capture speech and translate natural language into developer terminology:

const pc = new RTCPeerConnection();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
stream.getTracks().forEach((track) => pc.addTrack(track, stream));

const dataChannel = pc.createDataChannel("oai-events");
dataChannel.onmessage = (e) => {
  const event = JSON.parse(e.data);
  if (event.type === "response.text.done") {
    // "move this list to the right" -> "justify-content: flex-end"
    applyChange(selectedElement, event.text);
  }
};

The real-time model translates vague human descriptions into precise CSS/layout terms. "Make it bigger" becomes font-size: 1.25rem. "Push it to the right" becomes justify-content: flex-end.

3. Feed precise instructions to Claude

Combine the element path and translated change into a structured prompt:

async function applyChange(elementPath: string, translatedChange: string) {
  const instruction = [
    `Target element: ${elementPath}`,
    `Change: ${translatedChange}`,
    `Apply this change to the component source file.`,
  ].join("\n");

  await sendToClaude(instruction);
}

Optional: design system awareness

Pass your design system tokens as context so the translation model maps to your actual variables:

const systemPrompt = `You translate UI change requests into developer instructions.
Use these design tokens when possible: ${JSON.stringify(designTokens)}.
Output only the technical change description, no explanation.`;

Why It Works

The key insight is using two models with different strengths. A fast, cheap model (OpenAI real-time API) handles the latency-sensitive speech-to-text translation, converting "human-ese" into precise developer terminology in real time. Claude then handles the complex task of applying the change to actual source code. OpenAI's real-time API is ideal for the voice layer because WebRTC provides built-in echo cancellation and reliable voice activity detection -- critical for a smooth voice interaction. The element selection via click removes ambiguity about which component to modify, so the voice instruction only needs to describe the change itself.

Context

  • OpenAI real-time API used for speech because of WebRTC benefits (echo cancellation, VAD) and low latency
  • The translation model can be toggled between global comments (affecting the whole page) and targeted comments (affecting the selected element)
  • Works with any frontend framework -- the DOM path approach is framework-agnostic
  • Can evolve to support multi-modal input: click to select, voice to describe, screenshot to verify
  • The same two-model pattern applies beyond UI: use a fast model for real-time interaction, a capable model for complex execution
About this share
Contributormblode
Repositorymblode/shares
CreatedFeb 10, 2026
View on GitHub