FastTalk: Real-Time Voice Stack

Overview

Voice pipeline with a backend orchestrator coordinating optional STT/LLM/TTS microservices. Whisper-based STT with VAD handles transcription, PydanticAI drives LLM inference with tool-calling (DuckDuckGo search) on vLLM or Ollama backends, and TTS runs through Kokoro (CPU/ONNX) or Chatterbox (GPU, multilingual). The WebSocket layer streams tokens and audio chunks, manages turn detection and barge-in, and tracks 11 timestamps that feed 9 derived latency metrics. The Next.js 16 frontend captures microphone audio through AudioWorklet processors running off the main thread, batches 48 kHz PCM into ~42 ms chunks, plays back TTS through a second worklet, and renders the full metrics dashboard with color-coded latency thresholds. Each microservice has its own Docker setup with GPU, CPU, and Apple Silicon profiles.

Key Features

WebSocket streaming for STT→LLM→TTS with barge-in, turn detection, and mid-stream cancellation
Kokoro (CPU/ONNX) and Chatterbox (GPU, multilingual) TTS engines, plus Whisper STT with VAD
PydanticAI-driven LLM with DuckDuckGo tool calling on vLLM or Ollama backends
AudioWorklet-based audio I/O — PCM capture at 48 kHz batched into ~42 ms chunks, TTS playback through a separate worklet, all off the main thread
Nine latency metrics (STT FTL/FL, LLM TTFT/TTSR/TRT, TTS TAL/TST, E2E Total/User) with color-coded thresholds per metric
Switchable deployment: monolith, distributed microservices, or hybrid — Docker Compose profiles for GPU, CPU, and Apple Silicon
Three-column dashboard with collapsible conversations/chat/metrics panels, live connection state, and recording/speaking indicators

Microservices

STT Service

Whisper-based speech-to-text with VAD and word-level timestamps, exposed over WebSocket/HTTP. Supports batch file processing and live microphone modes alongside the streaming API.

Faster-whisper

webrtcvad

FastAPI

WebSockets

Docker

LLM Service

vLLM (OpenAI-compatible) or Ollama WebSocket LLM with PydanticAI agent layer, DuckDuckGo tool calling, conversation history trimming, and mid-stream request cancellation.

vLLM

PydanticAI

Ollama

FastAPI

WebSockets

Docker

TTS Service

Text-to-speech with Kokoro (CPU/ONNX) and Chatterbox (GPU, multilingual) backends, streaming audio over WebSocket. Handles EPUB and PDF document parsing for batch synthesis.

Kokoro

Chatterbox

PyTorch

Flask

WebSockets

Docker

Frontend

Next.js 16 voice-first dashboard with AudioWorklet-based PCM capture and TTS playback, three-column layout, nine latency metrics with color-coded thresholds, and auto-reconnecting WebSocket client.

Next.js 16

React 19

Tailwind v4

AudioWorklet API

Radix UI

Challenges

Keeping audio capture and playback off the main thread while synchronizing with React state — AudioWorklet processors run in a separate context with no direct DOM access
Tracking latency across three independent services: 11 timestamps per utterance, each recorded at a different hop, then stitched into 9 meaningful metrics
Handling barge-in cleanly — when the user starts talking mid-TTS, the server needs to interrupt generation and the frontend needs to flush its audio buffer without popping
Supporting GPU, CPU, and Apple Silicon from the same codebase without a combinatorial explosion of Docker configs

Solution

Async WebSocket handlers coordinate STT/LLM/TTS streams with structured payloads and VAD-based turn control. PydanticAI wraps the LLM layer with typed tool calling and conversation history trimming. The frontend uses two AudioWorklet processors — one for PCM capture with 2048-sample batching and an 8-byte header (timestamp + TTS-playing flag), another for buffered TTS playback — keeping audio processing off the main thread entirely. Docker Compose profiles and per-service Dockerfiles (GPU, CPU, Apple Silicon) handle the deployment matrix without duplicating configs.

Outcome

Low-latency voice pipeline that runs locally or as microservices, with vLLM or Ollama LLM backends and agentic tool calling via PydanticAI. Hits sub-2 s first-token and sub-1 s TTS targets across GPU, CPU, and Apple Silicon. The frontend streams audio through AudioWorklet processors and renders a live metrics dashboard with per-metric color thresholds, so you can actually see where latency is going per hop.

Project Information

Technologies

Python

FastAPI

vLLM (OpenAI-compatible)

PydanticAI

Ollama

WebSockets

Faster-whisper

Docker

CUDA/MPS

Whisper VAD STT

Kokoro/Chatterbox TTS

Next.js 16

React 19

Tailwind v4

AudioWorklet API

Radix UI

Recharts

View Frontend View Playground View Live Project View Source Code

Overview

Key Features

WebSocket streaming for STT→LLM→TTS with barge-in, turn detection, and mid-stream cancellation

Kokoro (CPU/ONNX) and Chatterbox (GPU, multilingual) TTS engines, plus Whisper STT with VAD

PydanticAI-driven LLM with DuckDuckGo tool calling on vLLM or Ollama backends

AudioWorklet-based audio I/O — PCM capture at 48 kHz batched into ~42 ms chunks, TTS playback through a separate worklet, all off the main thread

Nine latency metrics (STT FTL/FL, LLM TTFT/TTSR/TRT, TTS TAL/TST, E2E Total/User) with color-coded thresholds per metric

Switchable deployment: monolith, distributed microservices, or hybrid — Docker Compose profiles for GPU, CPU, and Apple Silicon

Three-column dashboard with collapsible conversations/chat/metrics panels, live connection state, and recording/speaking indicators

Microservices

STT Service

Whisper-based speech-to-text with VAD and word-level timestamps, exposed over WebSocket/HTTP. Supports batch file processing and live microphone modes alongside the streaming API.

Faster-whisper

webrtcvad

FastAPI

WebSockets

Docker

LLM Service

vLLM (OpenAI-compatible) or Ollama WebSocket LLM with PydanticAI agent layer, DuckDuckGo tool calling, conversation history trimming, and mid-stream request cancellation.

vLLM

PydanticAI

Ollama

FastAPI

WebSockets

Docker

TTS Service

Text-to-speech with Kokoro (CPU/ONNX) and Chatterbox (GPU, multilingual) backends, streaming audio over WebSocket. Handles EPUB and PDF document parsing for batch synthesis.

Kokoro

Chatterbox

PyTorch

Flask

WebSockets

Docker

Frontend

Next.js 16 voice-first dashboard with AudioWorklet-based PCM capture and TTS playback, three-column layout, nine latency metrics with color-coded thresholds, and auto-reconnecting WebSocket client.

Next.js 16

React 19

Tailwind v4

AudioWorklet API

Radix UI

Challenges

Keeping audio capture and playback off the main thread while synchronizing with React state — AudioWorklet processors run in a separate context with no direct DOM access

Tracking latency across three independent services: 11 timestamps per utterance, each recorded at a different hop, then stitched into 9 meaningful metrics

Handling barge-in cleanly — when the user starts talking mid-TTS, the server needs to interrupt generation and the frontend needs to flush its audio buffer without popping

Supporting GPU, CPU, and Apple Silicon from the same codebase without a combinatorial explosion of Docker configs

Solution

Outcome