wavekat

AI Pipeline

🚧 Coming soon. Architecture deep-dive lands with the first backend release.

WaveKat Voice streams audio through four stages with overlap at every boundary, targeting p95 end-to-end latency under 400ms:

caller audio ──▶ VAD ──▶ ASR ──▶ LLM ──▶ TTS ──▶ caller audio

                                 └─▶ tool calls (booking, lookup, transfer)

Planned topics:

  • The Frame model (a Rust port of Pipecat’s streaming Frame concept)
  • VAD backend choices (WebRTC, Silero, TEN-VAD, FireRedVAD)
  • Turn detection vs VAD — why they’re different
  • ASR backends: local (Whisper.cpp, SenseVoice) vs cloud (Deepgram, OpenAI Realtime)
  • LLM provider abstraction (OpenAI-compatible, llama.cpp, Ollama, Anthropic)
  • TTS backends and voice cloning
  • Barge-in and interruption handling
  • Latency budget breakdown