Problem
When iterating on UI designs with an AI coding assistant, describing visual changes in text is tedious and imprecise. Instructions like "move the list over a bit" or "make that section bigger" lose meaning when typed out -- you end up over-explaining layout details that would take seconds to describe verbally. For rapid UI prototyping, the typing bottleneck kills the feedback loop.
Solution
Build a browser-based voice UI editing system with three stages: select, speak, apply.
1. Element selection via click
Inject a script that captures the DOM path of any clicked element:
document.addEventListener("click", (e) => {
e.preventDefault();
const path = getDomPath(e.target as HTMLElement);
setSelectedElement(path);
});
function getDomPath(el: HTMLElement): string {
const parts: string[] = [];
while (el.parentElement) {
const index = Array.from(el.parentElement.children).indexOf(el);
parts.unshift(`${el.tagName.toLowerCase()}:nth-child(${index + 1})`);
el = el.parentElement;
}
return parts.join(" > ");
}
2. Voice capture and translation
Use OpenAI's real-time API via WebRTC to capture speech and translate natural language into developer terminology:
const pc = new RTCPeerConnection();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
stream.getTracks().forEach((track) => pc.addTrack(track, stream));
const dataChannel = pc.createDataChannel("oai-events");
dataChannel.onmessage = (e) => {
const event = JSON.parse(e.data);
if (event.type === "response.text.done") {
// "move this list to the right" -> "justify-content: flex-end"
applyChange(selectedElement, event.text);
}
};
The real-time model translates vague human descriptions into precise CSS/layout terms. "Make it bigger" becomes font-size: 1.25rem. "Push it to the right" becomes justify-content: flex-end.
3. Feed precise instructions to Claude
Combine the element path and translated change into a structured prompt:
async function applyChange(elementPath: string, translatedChange: string) {
const instruction = [
`Target element: ${elementPath}`,
`Change: ${translatedChange}`,
`Apply this change to the component source file.`,
].join("\n");
await sendToClaude(instruction);
}
Optional: design system awareness
Pass your design system tokens as context so the translation model maps to your actual variables:
const systemPrompt = `You translate UI change requests into developer instructions.
Use these design tokens when possible: ${JSON.stringify(designTokens)}.
Output only the technical change description, no explanation.`;
Why It Works
The key insight is using two models with different strengths. A fast, cheap model (OpenAI real-time API) handles the latency-sensitive speech-to-text translation, converting "human-ese" into precise developer terminology in real time. Claude then handles the complex task of applying the change to actual source code. OpenAI's real-time API is ideal for the voice layer because WebRTC provides built-in echo cancellation and reliable voice activity detection -- critical for a smooth voice interaction. The element selection via click removes ambiguity about which component to modify, so the voice instruction only needs to describe the change itself.
Context
- OpenAI real-time API used for speech because of WebRTC benefits (echo cancellation, VAD) and low latency
- The translation model can be toggled between global comments (affecting the whole page) and targeted comments (affecting the selected element)
- Works with any frontend framework -- the DOM path approach is framework-agnostic
- Can evolve to support multi-modal input: click to select, voice to describe, screenshot to verify
- The same two-model pattern applies beyond UI: use a fast model for real-time interaction, a capable model for complex execution