Voice-driven UI editing with speech-to-text and a translation model

Problem

When iterating on UI designs with an AI coding assistant, describing visual changes in text is tedious and imprecise. Instructions like "move the list over a bit" or "make that section bigger" lose meaning when typed out -- you end up over-explaining layout details that would take seconds to describe verbally. For rapid UI prototyping, the typing bottleneck kills the feedback loop.

Solution

Build a browser-based voice UI editing system with three stages: select, speak, apply.

1. Element selection via click

Inject a script that captures the DOM path of any clicked element:

document.addEventListener("click", (e) => {
  e.preventDefault();
  const path = getDomPath(e.target as HTMLElement);
  setSelectedElement(path);
});

function getDomPath(el: HTMLElement): string {
  const parts: string[] = [];
  while (el.parentElement) {
    const index = Array.from(el.parentElement.children).indexOf(el);
    parts.unshift(`${el.tagName.toLowerCase()}:nth-child(${index + 1})`);
    el = el.parentElement;
  }
  return parts.join(" > ");
}

2. Voice capture and translation

Use OpenAI's real-time API via WebRTC to capture speech and translate natural language into developer terminology:

const pc = new RTCPeerConnection();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
stream.getTracks().forEach((track) => pc.addTrack(track, stream));

const dataChannel = pc.createDataChannel("oai-events");
dataChannel.onmessage = (e) => {
  const event = JSON.parse(e.data);
  if (event.type === "response.text.done") {
    // "move this list to the right" -> "justify-content: flex-end"
    applyChange(selectedElement, event.text);
  }
};

The real-time model translates vague human descriptions into precise CSS/layout terms. "Make it bigger" becomes font-size: 1.25rem. "Push it to the right" becomes justify-content: flex-end.

3. Feed precise instructions to Claude

Combine the element path and translated change into a structured prompt:

async function applyChange(elementPath: string, translatedChange: string) {
  const instruction = [
    `Target element: ${elementPath}`,
    `Change: ${translatedChange}`,
    `Apply this change to the component source file.`,
  ].join("\n");

  await sendToClaude(instruction);
}

Optional: design system awareness

Pass your design system tokens as context so the translation model maps to your actual variables:

const systemPrompt = `You translate UI change requests into developer instructions.
Use these design tokens when possible: ${JSON.stringify(designTokens)}.
Output only the technical change description, no explanation.`;

Why It Works

The key insight is using two models with different strengths. A fast, cheap model (OpenAI real-time API) handles the latency-sensitive speech-to-text translation, converting "human-ese" into precise developer terminology in real time. Claude then handles the complex task of applying the change to actual source code. OpenAI's real-time API is ideal for the voice layer because WebRTC provides built-in echo cancellation and reliable voice activity detection -- critical for a smooth voice interaction. The element selection via click removes ambiguity about which component to modify, so the voice instruction only needs to describe the change itself.

Context

OpenAI real-time API used for speech because of WebRTC benefits (echo cancellation, VAD) and low latency
The translation model can be toggled between global comments (affecting the whole page) and targeted comments (affecting the selected element)
Works with any frontend framework -- the DOM path approach is framework-agnostic
Can evolve to support multi-modal input: click to select, voice to describe, screenshot to verify
The same two-model pattern applies beyond UI: use a fast model for real-time interaction, a capable model for complex execution