Behind the Build14 May 2026·7 min read

Today's AI Specialist: Coach Speak. The Agent That Decides Whether Sophie Sounds Human.

A bearded man records vocals in a professional studio, singing into a condenser microphone with focused intensity.

Sophie has a voice. The voice is what the learner actually hears. The agent that produces the voice is Coach Speak, and the engineering problem behind her is the hardest one in the entire practice surface.

The problem is not "make a synthesised voice." That has been a solved problem for years. The problem is making a synthesised voice that lands inside Sophie's two-second total response window, sounds warm enough for a learner to want to talk to, and carries the deliberate imperfections that keep her on the human side of the uncanny valley.

Latency. Warmth. Imperfection. Three constraints, all in tension, none optional.

This is the build story of Coach Speak. Why it had to be a separate agent, what the latency design actually does, and the counterintuitive decision to deliberately make Sophie sound slightly imperfect.

The problem Coach Speak solves

Sophie has roughly two seconds from the moment the learner stops speaking to the moment her response has to be audible. The model call to generate the response text takes most of that budget. The remaining margin has to cover everything else: turning the text into audio, getting the first bytes to the learner's device, and beginning playback.

In a naive architecture, the TTS step runs after the text generation completes. The full text is sent to a TTS service, the service returns an audio file, and the file is played. This sequence works. It also blows the latency budget by about two seconds, because TTS rendering of a full sentence takes one to two seconds on top of everything else.

The fix is streaming. Coach Speak does not wait for Sophie to finish generating the text. As tokens arrive from the model, Coach Speak takes them in chunks, renders the audio for each chunk, and starts playback before the full response is complete. By the time Sophie has finished generating the final word, Coach Speak has already played the first half of the sentence.

This compresses the perceived latency from "wait for the whole response" to "wait for the first word." The first word lands inside the two-second window. The rest of the sentence lands in real time as the learner listens.

The streaming is the latency design. Without it, Sophie cannot exist as a real-time conversational partner. With it, the conversation works.

The chunk size decision

The most interesting technical decision in Coach Speak is the chunk size at which the stream is broken.

Too small a chunk (say, one or two words at a time) produces a stuttering, robotic delivery, because the TTS model does not have enough context to render intonation correctly. The result sounds artificial in a way that breaks the practice partner illusion.

Too large a chunk (say, full sentences) loses the latency benefit, because the agent has to wait for the chunk boundary before it can render and start streaming. The latency creeps back up.

The right chunk size is calibrated to natural speech boundaries. Coach Speak breaks the stream at natural clause boundaries. The TTS model has enough context to render intonation correctly for the clause, and the latency stays inside the window.

The boundary detection is the agent's smallest, most carefully-tuned piece of logic. It has to identify clause boundaries in a stream of tokens, in real time, often before the punctuation that would make the boundary obvious has been generated. The agent uses syntactic features (the appearance of a coordinating conjunction, a pause-implying word, a verb phrase boundary) to predict where the clause will land. It is wrong sometimes. The wrongness is audible as occasional slightly-off intonation. The trade-off is acceptable because the latency benefit is large.

Decision one: deliberate imperfection

The counterintuitive design decision is that Sophie does not sound perfect. She sounds slightly imperfect, by design.

A technically perfect synthesised voice (flawless intonation, no breath, no micro-pauses, every phoneme cleanly articulated) reads as artificial to a human listener. The uncanny valley is real, and voice falls into it harder than visual does, because listeners are exquisitely tuned to the human voice. A voice that is 99% perfect lands worse than a voice that is 95% perfect with the right kind of imperfection.

Sophie has been tuned to have small, deliberate imperfections that read as human. An occasional micro-pause mid-sentence: the kind a person takes when they are thinking. Breath sounds at the beginning of a long utterance. Slight intonation variability across similar sentences, so that her tenth "good question" of the session does not sound identical to her first. A very occasional swallowed syllable on a function word.

None of these are mistakes. All of them are tuned parameters in the voice profile. The result is a voice that adult learners describe, in feedback, as "warm" and "a real person" rather than as "an AI", and they describe it that way even when they know, intellectually, that Sophie is an AI. The imperfection is what keeps them inside the practice partner relationship rather than treating the session as an interaction with a system.

This decision took longer to make than to implement. Every instinct in voice synthesis pushes toward fewer artefacts. The decision to keep some, deliberately, is the design call that separated Sophie's voice from generic AI voices.

Decision two: the fallback to text

There is a fallback in Coach Speak for cases where the audio cannot be delivered inside the latency window: TTS service slow, network slow, learner's device under load. The fallback displays Sophie's response as text on screen, the learner reads rather than listens, and the conversation continues.

The fallback is rare. Under 1% of turns in production fall back to text. But the existence of the fallback is what lets the rest of the system commit to the two-second window. If there were no fallback, every TTS slowdown would either break the session or extend the latency past where the conversational rhythm survives.

The fallback is a degraded experience. The learner is reading rather than listening, which is the opposite of the practice mode they came for. But it is a working experience. The session continues. The learner can return to audio on the next turn, when the TTS layer recovers.

The principle behind the fallback is that the system should fail to a degraded mode rather than fail to a broken mode. Coach Speak's job is to deliver audio inside two seconds; when it cannot, the fallback prevents the alternative: a long pause that breaks the conversation entirely.

TL;DR

Coach Speak turns Sophie's text into audio inside her two-second total response window. The latency design is token-by-token streaming, broken at natural clause boundaries, which compresses perceived latency from "wait for the whole response" to "wait for the first word." The chunk-boundary detection is the agent's most carefully-tuned logic. The counterintuitive design decision is deliberate imperfection: Sophie is tuned to have small, intentional micro-pauses, breath sounds, and intonation variability, because technically perfect voices fall into the uncanny valley. There is a text fallback for the under-1% of turns where TTS cannot deliver audio inside the latency window. The principle is to fail to a degraded mode rather than a broken one. Shipping deliberate imperfection over machine-perfect speech is the kind of trade-off the Strategic Council surfaces every week. Optimising for what sounds best in a vendor demo would have bought cleaner audio at the cost of the trust signal that makes the system work.

Learning Materials

Key Vocabulary

latencynoun · C1

The delay between an input and the system's response, especially in real-time audio or network systems.

“Coach Speak has roughly 800 milliseconds of latency budget after the model call returns.”

stream (verb)verb · C1

To send or receive data, especially audio or video, in a continuous flow that begins playing before the transmission is complete.

“Coach Speak streams audio as the tokens arrive from the model.”

chunknoun · B2

A small piece broken off from something larger; in software, a unit of data processed together as a group.

“The right chunk size is calibrated to natural speech boundaries — six to twelve words.”

clause boundarynoun phrase · C1

The point in a sentence where one clause ends and another begins, often marked by a conjunction or natural pause.

“Coach Speak breaks the stream at clause boundaries so the TTS model has enough context for intonation.”

render (verb)verb · C1

To process raw data into a finished form, especially turning text into audio or graphics into pixels.

“The TTS model renders the audio for each chunk as it arrives.”

playbacknoun · B2

The act of playing recorded or generated audio or video on a device.

“Coach Speak begins playback before the full response is complete.”

uncanny valleynoun phrase (idiom) · C2

The unsettling effect produced when something almost — but not quite — looks or sounds human; the closer it gets to perfect without arriving, the more it disturbs the observer.

“A 99% perfect voice falls into the uncanny valley harder than a 95% imperfect one.”

synthesise / synthesised voiceverb / noun phrase · C1

To produce sound, especially speech, artificially by combining electronic signals; a voice produced this way.

“A technically perfect synthesised voice reads as artificial to a human listener.”

intonationnoun · C1

The rise and fall of pitch in the voice when speaking, which carries meaning and emotion.

“Too small a chunk gives the TTS model too little context to render intonation correctly.”

breath soundnoun phrase · C1

An audible intake or release of air that a human speaker produces, especially before a long utterance.

“Sophie has been tuned to include breath sounds at the start of long utterances.”

micro-pausenoun · C1

A very short pause in speech, often mid-sentence, which signals that the speaker is thinking.

“An occasional micro-pause reads as human thinking, not as a fault in the system.”

artefact (vs mistake)noun · C1

In voice synthesis, an unintended audible feature of the output; importantly, an artefact is not the same as a mistake — some artefacts are deliberately preserved because they carry humanness, while a mistake is an error nobody chose.

“Sophie's micro-pauses are tuned parameters, not artefacts to be removed.”

tuned parameternoun phrase · C1

A setting in a model that has been deliberately adjusted to produce a specific behaviour or output quality.

“The micro-pauses, breath sounds, and intonation variability are tuned parameters in Sophie's voice profile.”

degraded mode (vs broken mode)noun phrase · C1

A failure mode in which the system still works but at reduced quality; contrasted with a broken mode, in which the system stops working entirely.

“The text fallback is a degraded mode — the learner reads instead of listens, but the session continues.”

compress (perceived latency)verb · C1

To reduce the apparent length of a delay, especially by changing when the user first notices a response, without necessarily reducing the absolute processing time.

“Token-by-token streaming compresses perceived latency from 'wait for the whole response' to 'wait for the first word.'”

Grammar Notes

'X is in tension with Y, none is optional' — listing constraints that all must hold at once

This pattern lists items as one-word sentences, then names the relationship between them with two short clauses: how they conflict ('all in tension'), and what cannot be dropped ('none optional'). It is a fast way to set up a problem where the engineering interest is in the trade-off, not in any single constraint. Note the elliptical second clause: 'none [is] optional', with the verb dropped for compression.

“'Latency. Warmth. Imperfection. Three constraints, all in tension, none optional.'”

Common mistake: Writing it as a flowing sentence — 'There are three constraints (latency, warmth, and imperfection), all of which are in tension with each other, and none of which is optional' — loses the impact. The pattern only works when the items are isolated as short sentences and the relationship is named in clipped clauses.

'Too small a chunk → A, too large → B' — the Goldilocks conditional framing

This is a conditional pattern that brackets a tuning decision between two failure modes. 'Too X → bad outcome A. Too Y → bad outcome B. The right value is in between.' The construction uses inverted-determiner phrases ('too small a chunk', not 'a too small chunk') which is the correct English order when 'too' modifies an adjective in front of a noun.

“'Too small a chunk — say, one or two words at a time — produces a stuttering, robotic delivery... Too large a chunk — say, full sentences — loses the latency benefit.'”

Common mistake: Writing 'a too small chunk' is ungrammatical. The order is 'too + adjective + a/an + noun': 'too small a chunk', 'too long a sentence', 'too high a price'. The 'a' moves to between the adjective and the noun.

'X is engineered, not Y' — the engineered-imperfection pattern

This pattern reframes something that looks like a flaw as a deliberate design choice. The form is: state the feature, then explicitly deny the false reading, then assert the true reading. It is the linguistic move that makes engineered imperfection legible — without it, the reader assumes the imperfections are bugs the team did not fix yet.

“'The imperfection is engineered.' / 'None of these are mistakes. All of them are tuned parameters in the voice profile.'”

Common mistake: Stating only the positive ('The imperfection is engineered') without denying the false reading first ('None of these are mistakes') leaves the reader holding the bug interpretation. The denial of the false reading is what makes the assertion of the true reading land.

Comprehension Questions

1.Why does the post argue that Coach Speak had to be a separate agent rather than part of Sophie?
2.What are the three constraints Coach Speak has to satisfy at once, and why does the post say none of them is optional?
3.Why is the chunk size set to roughly six to twelve words rather than smaller or larger?
4.Why does the post insist that Sophie's micro-pauses, breath sounds, and intonation variability are not mistakes?
5.Think of an AI voice you have heard — from a navigation app, a smart speaker, a voicemail system, or a phone assistant — that felt slightly unsettling, almost human but not quite. Based on the post, what is most likely missing in that voice that Sophie's voice has?

Run your own diagnostic

Use the same Strategic Council I run my own decisions through. The assessment preview is free. The specific central human intelligence it is based on is verified in person during the call.

Start the free diagnostic →

← All posts