
1
Listen (intake raw audio)
When a person speaks, the client device (whether it is a laptop, phone,
etc) will record raw audio (1βs & 0βs at the core of it).
This raw audio will have to either be transcribed on the client device
itself, or get shipped off to a server somewhere to turn into
transcription text.
2
Run an LLM
That transcript text will then get fed into a prompt & run through an LLM
(LLM inference). The LLM is the core intelligence
that simulates a person behind-the-scenes.
3
Speak (text β raw audio)
The LLM outputs text that now must be spoken. That text is turned back
into raw audio (again, 1βs & 0βs), that is playable back at the userβs
device.
This process can also either happen on the userβs device itself, or on a
server somewhere (then the raw speech audio be shipped back to the user).
The idea is to perform each phase in realtime (sensitive down to 50-100ms level), streaming between every layer. Ideally the whole flow voice-to-voice clocks in at <500-700ms.