Loading...
Loading...
Voice pipeline with a backend orchestrator coordinating optional STT/LLM/TTS microservices. Whisper-based STT with VAD handles transcription, PydanticAI drives LLM inference with tool-calling (DuckDuckGo search) on vLLM or Ollama backends, and TTS runs through Kokoro (CPU/ONNX) or Chatterbox (GPU, multilingual). The WebSocket layer streams tokens and audio chunks, manages turn detection and barge-in, and tracks 11 timestamps that feed 9 derived latency metrics. The Next.js 16 frontend captures microphone audio through AudioWorklet processors running off the main thread, batches 48 kHz PCM into ~42 ms chunks, plays back TTS through a second worklet, and renders the full metrics dashboard with color-coded latency thresholds. Each microservice has its own Docker setup with GPU, CPU, and Apple Silicon profiles.
Whisper-based speech-to-text with VAD and word-level timestamps, exposed over WebSocket/HTTP. Supports batch file processing and live microphone modes alongside the streaming API.
vLLM (OpenAI-compatible) or Ollama WebSocket LLM with PydanticAI agent layer, DuckDuckGo tool calling, conversation history trimming, and mid-stream request cancellation.
Async WebSocket handlers coordinate STT/LLM/TTS streams with structured payloads and VAD-based turn control. PydanticAI wraps the LLM layer with typed tool calling and conversation history trimming. The frontend uses two AudioWorklet processors — one for PCM capture with 2048-sample batching and an 8-byte header (timestamp + TTS-playing flag), another for buffered TTS playback — keeping audio processing off the main thread entirely. Docker Compose profiles and per-service Dockerfiles (GPU, CPU, Apple Silicon) handle the deployment matrix without duplicating configs.
Low-latency voice pipeline that runs locally or as microservices, with vLLM or Ollama LLM backends and agentic tool calling via PydanticAI. Hits sub-2 s first-token and sub-1 s TTS targets across GPU, CPU, and Apple Silicon. The frontend streams audio through AudioWorklet processors and renders a live metrics dashboard with per-metric color thresholds, so you can actually see where latency is going per hop.