Agntic Voice — a voice assistant that actually feels like talking

I use speech-to-speech to think. Not to fire off commands — to work through ideas out loud, follow a thread, research something in the middle of a sentence and keep going. For that, one thing matters more than anything else: the conversation can't break. And the apps on the market break it constantly.

01 · The problemThe conversation kept breaking

The intelligence was never the issue. Claude and ChatGPT are both excellent. What ruined them for the way I work was everything around the model — the plumbing that's supposed to be invisible. When you're using voice to think, you're holding a train of thought in your head and trusting the tool to keep up. Every time it drops the thread, that train derails. And it dropped the thread in the same few ways, over and over:

Streaming stalls

The reply starts generating, then just… stops. Cut off mid-sentence, mid-idea. No error, no recovery — the thought hangs there, unfinished.

Silent failures

I finish talking and nothing comes back at all. No answer, no indication anything's wrong. I'm left asking the air whether it even heard me.

Missed input

It doesn't register that I spoke, or mangles it badly enough that the answer is to a question I never asked. So I repeat myself — and the momentum is gone.

Trigger-happy open mic

Continuous mode is so sensitive it cuts me off the second I pause to think. A breath mid-thought gets read as "I'm done," and it barges in.

On top of all that, the voice itself was barely adjustable — flat, generic, one or two options, none of which sounded like something I wanted to talk to for an hour. Add it up and the picture is clear: none of these are intelligence problems. They're continuity problems. The model knows the answer; the experience around it keeps fumbling the handoff.

Here's what that actually feels like — a thought, interrupted versus a thought that lands:

When you're thinking out loud, an assistant that stalls isn't slow — it's a dropped call in the middle of your own sentence.

02 · The experienceStart to finish

So here's the thing itself, the way you actually move through it — from first launch to a running conversation. Every screen below is the real design. It plays through on its own; tap a step to jump.

The open mic, fixed

The continuous mode gets a specific answer to a specific failure. Instead of barging in the instant your voice drops, end-of-turn detection is tunable, and "auto-detect end of turn" is a setting you control rather than a hidden hair-trigger. You can pause to think without getting cut off — which, for ideation, is the entire point.

The voice, finally yours

And the customization the market wouldn't give me: a real roster of voices with distinct character — calm and neutral, warm and soft-spoken, a measured baritone, a bright British lilt — that you preview and choose, plus speed and auto-play controls. It sounds like something you'd actually want to talk with for an hour.

03 · The platformHow it's built

Under that quiet interface is a deliberately simple stack. It's a native iOS app — SwiftUI and AVFoundation, no backend. The phone talks straight to the Anthropic and ElevenLabs APIs, with keys held in the iOS Keychain. For a single-user personal tool, that's the least machinery that does the job, and it keeps latency down by removing a server hop entirely.

The interface follows Apple's iOS 26 "Liquid Glass" material — translucent surfaces that blur and refract what's behind them. Every control in the app, from the mode toggle to the citation chips, is built from one glass recipe; here's the actual token system underneath it:

Every spoken turn runs through the same five-stage pipeline. The trick to continuity lives in how these stages overlap rather than wait for each other:

Speak before the answer is finished

The single most important continuity decision: the app never waits for a full reply. As Claude streams its response, the text is split at sentence boundaries, and each finished sentence is synthesized and played while later sentences are still being written. You hear the first words almost immediately, and the rest arrives in rhythm behind it. Watch it happen:

The voice engine: why ElevenLabs

Both ends of the audio run on ElevenLabs. Their Scribe model handles speech-to-text going in; their v3 model handles text-to-speech coming out. v3 specifically, because it's the one model expressive enough to interpret inline delivery cues instead of reading everything in the same flat register — and flat delivery was half the problem to begin with.

Speech → text · in

ElevenLabs Scribe

Transcribes what you say, fast and accurately. It's the front door of every turn — if transcription is slow or wrong, nothing downstream recovers the moment. Getting this right is what kills the "did it even hear me?" failure.

scribe · STT

Text → speech · out

ElevenLabs v3

The expressive model — the only one that performs inline delivery cues like a soft laugh, a sigh, or a warm aside. That expressiveness is the whole reason the reply sounds alive instead of read-aloud.

v3 · expressive TTS

The expressiveness works through audio tags. Claude is instructed to weave small cues into its reply; the voice engine performs them, and they're stripped from the text you see on screen. You hear the warmth; you never read the stage direction.

Claude writes
[warmly] Oh, that's a great question. [laughs] Honestly, I wasn't sure either at first.
You hear
A warm, smiling delivery — a real laugh on the second line — instead of a monotone read.
You see
[warmly] Oh, that's a great question. [laughs] Honestly, I wasn't sure either at first.

It's a deliberate trade. v3 is expressive but alpha-stage and slower than the fast, robotic options — so we synthesize per sentence and lean on the batching above to win back the latency. Expressiveness over raw speed, because the entire point was to fix delivery. Knowing which trade-off serves the goal is most of the job.

Platform

Native iOSSwiftUI · AVFoundation

Architecture

No backenddirect device-to-API, keys in Keychain

Intelligence

Claude, streamedwith web search & fetch

Voice

ElevenLabs Scribe + v3STT in, expressive TTS out

Memory

Global & persistentdistilled after every turn

Project

XcodeGen-managedgenerated, reproducible builds

Two more pieces round it out. A cheap background pass after each turn distills durable facts about how I work and folds them into every future conversation — global memory that survives a reset. And the audio plays through the standard media path, not the phone-call path the other apps default to, which is heavily attenuated and made everything too quiet to use. Small, unglamorous, decisive.

Continuity first
Every pipeline decision protects the thread — stream, overlap, never wait for the whole answer.
Quiet canvas
Monochrome by choice. The voice is the interface, so the screen recedes and nothing competes with the conversation.
One living element
A single audio-reactive waveform carries every state through motion, not stacked colors and badges.
You're in control
Tunable end-of-turn, a real voice roster, visible sources. The settings the market hid are yours.

04 · Why this mattersThe point isn't the app

I built this for myself, but it's the clearest example I have of how Agntic works. A real frustration, named exactly — not "the voice is bad" but "the conversation keeps breaking, here are the four ways." A diagnosis that resisted the easy answer: it was never the model, it was the continuity around it. And then a stack of specific decisions — streaming overlap, tunable turn-detection, the right voice vendor, the audio path nobody thinks about — that add up to something that feels different to use.

That's the same shape as the work we do for clients. Most of the value in an AI system isn't the model you pick — it's the hundred choices around it about latency, continuity, memory, trust, and what the person actually experiences. We get specific about the problem, and we build for how it should feel, not just whether it technically works.

We built a voice assistant that actually feels like talking.