> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vapi.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Core Models

> The three core components to Vapi's voice AI pipeline.

At it's core, Vapi is an orchestration layer over three modules: the **transcriber**, the **model**, and the **voice**.

<Frame>
  <img src="https://mintlify.s3-us-west-1.amazonaws.com/vapi/static/images/quickstart/quickstart-banner.png" />
</Frame>

These three modules can be swapped out with **any provider** of your choosing; OpenAI, Groq, Deepgram, ElevenLabs, PlayHT, etc. You can even plug in your server to act as the LLM.

Vapi takes these three modules, optimizes the latency, manages the scaling & streaming, and orchestrates the conversation flow to make it sound human.

<Steps titleSize="h3">
  <Step title="Listen (intake raw audio)">
    <div>
      When a person speaks, the client device (whether it is a laptop, phone,
      etc) will record raw audio (1’s & 0’s at the core of it).
    </div>

    <div>
      This raw audio will have to either be transcribed on the client device
      itself, or get shipped off to a server somewhere to turn into
      transcription text.
    </div>
  </Step>

  <Step title="Run an LLM">
    <div>
      That transcript text will then get fed into a prompt & run through an LLM
      ([LLM inference](/glossary#inference)). The LLM is the core intelligence
      that simulates a person behind-the-scenes.
    </div>
  </Step>

  <Step title="Speak (text → raw audio)">
    <div>
      The LLM outputs text that now must be spoken. That text is turned back
      into raw audio (again, 1’s & 0’s), that is playable back at the user’s
      device.
    </div>

    <div>
      This process can also either happen on the user’s device itself, or on a
      server somewhere (then the raw speech audio be shipped back to the user).
    </div>
  </Step>
</Steps>

<Info>The idea is to perform each phase in realtime (sensitive down to 50-100ms level), streaming between every layer. Ideally the whole flow [voice-to-voice](/glossary#voice-to-voice) clocks in at \<500-700ms.</Info>

Vapi pulls all these pieces together, ensuring a smooth & responsive conversation (in addition to providing you with a simple set of tools to manage these inner-workings).
