How to Build a Realtime AI Phone Agent Using OpenAI’s Realtime API

Soheil H

Soheil H

How to Build a Realtime AI Phone Agent Using OpenAI’s Realtime API

In this guide, we’ll walk through how to build a real-time conversational phone agent using:

  • 📞 Twilio for telephony
  • 🔁 A WebSocket server to stream audio
  • 🧠 OpenAI’s new Realtime API for STT → LLM → TTS
  • 🔧 Structured JSON output and conditional call workflows

By the end, your AI bot will answer calls, understand natural speech, speak back in real-time, and trigger custom actions—all in under a second of latency.

Step 1: The Stack You’ll Need

ComponentRole
Twilio VoiceReceives phone call and streams audio via <Stream>
WebSocket ServerMiddle layer that forwards audio to OpenAI and sends audio responses back
OpenAI Realtime APIDoes transcription, reasoning, and speech in one stream
Audio Encodingμ-law 8 kHz—low latency, no resampling
Redis (optional)Stores call state / partial user data between turns
App LogicRoutes responses, triggers SMS or handoffs, logs transcripts

Step 2: Set Up Twilio Media Streaming

Provision a Twilio number and configure its voice webhook to return this TwiML:

<Response>
  <Connect>
    <Stream url="wss://yourserver.com/audio"
            track="both_tracks"
            audioFormat="audio/x-mulaw;rate=8000" />
  </Connect>
</Response>

Step 3: Build a WebSocket Server

This server accepts Twilio’s media messages and sends them to OpenAI’s Realtime API.

Key tasks:

  • Decode μ-law base64 audio
  • Feed raw audio into OpenAI’s Realtime stream
  • Listen for partial transcripts and streaming responses
  • Re-encode output audio as μ-law base64 and send it back to Twilio

Use libraries like ws, mulaw, and axios or openai-streams to simplify streaming I/O.

Step 4: Stream to OpenAI’s Realtime API

OpenAI’s Realtime API handles:

  • Speech-to-text (STT) – partial transcripts as user talks
  • LLM reasoning – real-time intent extraction
  • Text-to-speech (TTS) – spoken responses as tokens stream back

You can optionally request structured output like:

{
  "intent": "schedule_appointment",
  "name": "Samantha",
  "email": "sam@example.com"
}

Step 5: Add a State Machine or Flow Router

When JSON responses come back from OpenAI, your server can:

  • Confirm names or spelled-out words (“J as in John…” → “Jake”)
  • Trigger SMS follow-ups using Twilio API
  • Escalate the call to a live agent using <Dial>
  • Log transcripts into a CRM or backend

Add simple FSM logic with conditions based on intent, confidence, or user corrections.

Step 6: Keep It Realtime (Latency Tips)

To make the AI sound natural and fast:

  • Stick to μ-law @ 8000 Hz throughout—no transcoding
  • Keep TTS responses short (2–5s of audio)
  • Use chunked audio input (20–40ms frames)
  • Stream text → audio tokens as soon as they’re ready (don’t wait for full sentences)
  • Cancel or barge-in mid-response if user interrupts

Aim for < 1 second roundtrip from user speaking to bot reply.

Bonus: Add Error Handling

Include guardrails like:

  • If OpenAI fails or delays, drop to fallback IVR logic
  • Validate and parse all returned JSON before acting on it
  • Log call metadata with unique StreamSid or CallSid per session

Architecture Summary

Caller → Twilio → WebSocket Server → OpenAI Realtime API
                           ↑                    ↓
                       Speech Out ← TTSLLM ← Transcript
                            ↘
                         Your App (FSM, DB, Routing)

Want to See It Working?

We’ve battle-tested the stack and tuned it to production standards. If you’d like to skip the dev work and get a live AI agent today, try TalkTaps.