We built a voice assistant that actually feels like talking.
The voice modes already on my phone kept breaking the conversation — stalling
mid-thought, missing what I said, cutting in too early. So I built my own
speech-to-speech iOS assistant around the one thing that actually matters: continuity.
By Charles, FounderAgnticInternal build · native iOS
The waveform — alive, listening
I use speech-to-speech to think. Not to fire off commands — to work through ideas
out loud, follow a thread, research something in the middle of a sentence and keep
going. For that, one thing matters more than anything else: the conversation can't
break. And the apps on the market break it constantly.
01 · The problemThe conversation kept breaking
The intelligence was never the issue. Claude and ChatGPT are both excellent. What
ruined them for the way I work was everything around the model — the plumbing that's
supposed to be invisible. When you're using voice to think, you're holding a
train of thought in your head and trusting the tool to keep up. Every time it drops the
thread, that train derails. And it dropped the thread in the same few ways, over and over:
Streaming stalls
The reply starts generating, then just… stops. Cut off mid-sentence, mid-idea. No error, no recovery — the thought hangs there, unfinished.
Silent failures
I finish talking and nothing comes back at all. No answer, no indication anything's wrong. I'm left asking the air whether it even heard me.
Missed input
It doesn't register that I spoke, or mangles it badly enough that the answer is to a question I never asked. So I repeat myself — and the momentum is gone.
Trigger-happy open mic
Continuous mode is so sensitive it cuts me off the second I pause to think. A breath mid-thought gets read as "I'm done," and it barges in.
On top of all that, the voice itself was barely adjustable — flat, generic, one or two
options, none of which sounded like something I wanted to talk to for an hour. Add it up
and the picture is clear: none of these are intelligence problems. They're
continuity problems. The model knows the answer; the experience around it keeps
fumbling the handoff.
Here's what that actually feels like — a thought, interrupted versus a thought that lands:
Continuity · the difference you feel
What I had been using
Generating…
What I built
Speaking aloud, in rhythm
The stall is the whole problem. A reply that dies mid-thought doesn't just cost you an answer — it breaks the train of thought you were riding. Protecting that thread end-to-end is what the entire build is for.
When you're thinking out loud, an assistant that stalls isn't slow — it's a dropped call
in the middle of your own sentence.
02 · The experienceStart to finish
So here's the thing itself, the way you actually move through it — from first launch to
a running conversation. Every screen below is the real design. It plays through on its own;
tap a step to jump.
The flow · onboarding to conversation
Talk with Agntic.
Speak naturally. Get answered out loud. Agntic can search the web when it needs fresh information.
Continue
9:41
Say hi to Agntic.
Tap the mic to start a continuous chat.
What's the latest on the Apollo mission?Web
Explain quantum entanglement simply
Latest weather in ReykjavíkWeb
9:41
Speaking
What are the latest WWDC announcements about Liquid Glass?
Apple introduced Liquid Glass as the unifying design language across iOS, macOS, and visionOS — a translucent material that reflects and refracts content.1
Cool. Does it ship in iOS 26?
Push-to-talkContinuous
9:41
iOS 26 release date
Does it ship this fall?
Checking the latest sources…
9:41
Voice
Choose the voice Agntic uses to speak. Tap to preview, then select.
Astra
Calm · Neutral American
Briar
Warm · Soft-spoken
Orion
Deep · Measured baritone
Lyra
Bright · British
9:41
History
Liquid Glass and iOS 26Just now
Apple introduced Liquid Glass as the unifying…
2 sources
Sourdough hydration ratiosYesterday
For a beginner-friendly loaf, aim for 70–75%…
Weather in Reykjavík next weekTue
Highs around 4°C with mixed precipitation…
3 sources
Why is the sky blue?Mar 12
Rayleigh scattering — shorter wavelengths…
Step 01OnboardingThree calm panels set expectations and ask for the mic — honestly, audio is never stored.
Step 02Say hiAn empty state invites a first question, including ones that show off live web search.
Step 03The conversationTranscript always visible, sources cited inline. Push-to-talk or a continuous open mic.
Step 04Searching the webMid-conversation it pulls live info — a meridian line and query chip show what it's looking up.
Step 05Pick a voiceA real roster with distinct personalities — the customization the market wouldn't give me.
Step 06History & memoryEvery thread kept and searchable. Underneath, long-term memory carries what matters across all of them.
Auto-playing · tap any step
The open mic, fixed
The continuous mode gets a specific answer to a specific failure. Instead of barging in
the instant your voice drops, end-of-turn detection is tunable, and "auto-detect end of
turn" is a setting you control rather than a hidden hair-trigger. You can pause to think
without getting cut off — which, for ideation, is the entire point.
The voice, finally yours
And the customization the market wouldn't give me: a real roster of voices with distinct
character — calm and neutral, warm and soft-spoken, a measured baritone, a bright British
lilt — that you preview and choose, plus speed and auto-play controls. It sounds like
something you'd actually want to talk with for an hour.
03 · The platformHow it's built
Under that quiet interface is a deliberately simple stack. It's a native iOS app —
SwiftUI and AVFoundation, no backend. The phone talks straight to the
Anthropic and ElevenLabs APIs, with keys held in the iOS Keychain. For a single-user
personal tool, that's the least machinery that does the job, and it keeps latency down
by removing a server hop entirely.
The interface follows Apple's iOS 26 "Liquid Glass" material — translucent
surfaces that blur and refract what's behind them. Every control in the app, from the mode
toggle to the citation chips, is built from one glass recipe; here's the actual token
system underneath it:
Design tokens · one source of truth
Monochrome palette
Corner radii · 6–28
Spacing scale · 4–80
The whole app runs on one disciplined system — a pure black-and-white palette with glass as the only accent, a tight radii scale, and a consistent rhythm of spacing. Restraint is the aesthetic.
Every spoken turn runs through the same five-stage pipeline. The trick to continuity
lives in how these stages overlap rather than wait for each other:
The turn pipeline · mic to voice
Capture
Mic tap → WAV
→
Transcribe
Scribe: speech → text
→
Reason
Claude, streamed
→
Batch
Split by sentence
→
Speak
v3: per-sentence audio
→
Capture and reason are streamed; batch and speak overlap with reasoning still in flight. Nothing waits for the whole answer — which is exactly what kills the dead-air stall.
Speak before the answer is finished
The single most important continuity decision: the app never waits for a full reply.
As Claude streams its response, the text is split at sentence boundaries, and each
finished sentence is synthesized and played while later sentences are still being
written. You hear the first words almost immediately, and the rest arrives in
rhythm behind it. Watch it happen:
Claude is still generating — but you're already hearing it
Apple introduced Liquid Glass as the unifying design language.Queued
It's a translucent material that reflects and refracts content.Queued
The depth responds to motion, with subtle parallax.Queued
And yes — it ships in iOS 26 this fall.Queued
Generated → written → spoken, one sentence ahead of the next. First words in about a second instead of after the whole paragraph.
The voice engine: why ElevenLabs
Both ends of the audio run on ElevenLabs. Their Scribe
model handles speech-to-text going in; their v3 model handles
text-to-speech coming out. v3 specifically, because it's the one model expressive enough
to interpret inline delivery cues instead of reading everything in the same flat register
— and flat delivery was half the problem to begin with.
Speech → text · in
ElevenLabs Scribe
Transcribes what you say, fast and accurately. It's the front door of every turn — if transcription is slow or wrong, nothing downstream recovers the moment. Getting this right is what kills the "did it even hear me?" failure.
scribe · STT
Text → speech · out
ElevenLabs v3
The expressive model — the only one that performs inline delivery cues like a soft laugh, a sigh, or a warm aside. That expressiveness is the whole reason the reply sounds alive instead of read-aloud.
v3 · expressive TTS
The expressiveness works through audio tags. Claude is instructed to weave small
cues into its reply; the voice engine performs them, and they're stripped from the text
you see on screen. You hear the warmth; you never read the stage direction.
Claude writes
[warmly] Oh, that's a great question. [laughs] Honestly, I wasn't sure either at first.
You hear
A warm, smiling delivery — a real laugh on the second line — instead of a monotone read.
You see
[warmly] Oh, that's a great question. [laughs] Honestly, I wasn't sure either at first.
It's a deliberate trade. v3 is expressive but alpha-stage and slower than the fast,
robotic options — so we synthesize per sentence and lean on the batching above to win
back the latency. Expressiveness over raw speed, because the entire point was to fix
delivery. Knowing which trade-off serves the goal is most of the job.
Platform
Native iOSSwiftUI · AVFoundation
Architecture
No backenddirect device-to-API, keys in Keychain
Intelligence
Claude, streamedwith web search & fetch
Voice
ElevenLabs Scribe + v3STT in, expressive TTS out
Memory
Global & persistentdistilled after every turn
Project
XcodeGen-managedgenerated, reproducible builds
Two more pieces round it out. A cheap background pass after each turn distills durable
facts about how I work and folds them into every future conversation — global memory that
survives a reset. And the audio plays through the standard media path, not the phone-call
path the other apps default to, which is heavily attenuated and made everything too quiet
to use. Small, unglamorous, decisive.
Continuity first
Every pipeline decision protects the thread — stream, overlap, never wait for the whole answer.
Quiet canvas
Monochrome by choice. The voice is the interface, so the screen recedes and nothing competes with the conversation.
One living element
A single audio-reactive waveform carries every state through motion, not stacked colors and badges.
You're in control
Tunable end-of-turn, a real voice roster, visible sources. The settings the market hid are yours.
04 · Why this mattersThe point isn't the app
I built this for myself, but it's the clearest example I have of how Agntic works. A real
frustration, named exactly — not "the voice is bad" but "the conversation keeps breaking,
here are the four ways." A diagnosis that resisted the easy answer: it was never the
model, it was the continuity around it. And then a stack of specific decisions — streaming
overlap, tunable turn-detection, the right voice vendor, the audio path nobody thinks about
— that add up to something that feels different to use.
That's the same shape as the work we do for clients. Most of the value in an AI system
isn't the model you pick — it's the hundred choices around it about latency, continuity,
memory, trust, and what the person actually experiences. We get specific about the problem,
and we build for how it should feel, not just whether it technically works.
Have a problem worth getting specific about?
We help businesses get their data AI-ready and turn real frustrations into working automation — scoped to specifics, built around your systems.