# AvatarCore_STT Plugin Speech-to-text plugin for Unreal Engine. Audio flows through a linear chain of modules: ``` STTRecorder → [STTPreprocessor, ...] → STTProcessor → transcription result ``` The chain is assembled in `STTManagerBase::InitSTTManager` using `BindUFunction` with the string name `"OnChunkReceived"`. The UE reflection system resolves the correct virtual override at bind time, so the manager code does not need to change when signatures change. --- ## ESTTChainState — Pipeline Signal Every audio chunk carries an `ESTTChainState` (defined in `Public/STTStructs.h`): | Value | Meaning | |-------|---------| | `Processing` | Normal audio — buffer, process, pass through | | `Finalizing` | End of utterance — flush buffers and trigger transcription | | `Discarding` | BLOCKED/abort — clear all buffers, cancel in-flight requests | **Rules:** - Recorders always emit `Processing` — they have no concept of "final". - PTT preprocessor emits `Finalizing` when the button is released (→ SILENCE) and `Discarding` when the system becomes BLOCKED. - VAD preprocessor emits `Finalizing` on the last postroll silence chunk, then calls `UserSpeechStateChanged(SILENCE)` for UI purposes only. It emits `Discarding` from `OnUserSpeechStateChanged` when BLOCKED. - Pass-through preprocessors (Converter, Debugger, SpeexDSP, WebRTC) forward `ChainState` unchanged. They must pass `Finalizing`/`Discarding` through even when `PCMData` is empty. - Buffer preprocessor reacts to `ChainState` in-band — no `OnSpeechStateChanged` subscription. - Processors react to `ChainState` only — no `OnSpeechStateChanged` subscriptions. **Processors still call `UserSpeechStateChanged`** after transcription completes (for UI state), but they do NOT subscribe to it. --- ## Delegate Types ```cpp // STTRecorderBase.h DECLARE_DELEGATE_ThreeParams(FDelegateUnprocessedChunkReceived, TArray, FAudioInformation, ESTTChainState); // STTPreprocessorBase.h DECLARE_DELEGATE_ThreeParams(FDelegateProcessedChunk, TArray, FAudioInformation, ESTTChainState); ``` --- ## Module Types ### STTRecorder (`Public/Recorder/STTRecorderBase.h`) Produces audio chunks from a source (microphone, file, pixel stream). Fires `OnChunkReceived` delegate with `ESTTChainState::Processing`. Has no knowledge of speech state. Implementations: `STTRecorderMicrophone` (PortAudio), `STTRecorderPrimaryMicrophone`, `STTRecorderUnrealMicrophone`, `STTRecorderDebugFile`, `STTRecorderAudioData`. ### STTPreprocessor (`Public/Preprocessor/STTPreprocessorBase.h`) Chained in sequence. Each receives `OnChunkReceived` and fires `OnChunkProcessed` to the next stage. Both delegates carry `ESTTChainState`. | Class | Role | |-------|------| | `STTPreprocessorConverter` | Stereo→mono, resample to target rate | | `STTPreprocessorWebRTC` | WebRTC APM (echo cancel, noise suppress, AGC) | | `STTPreprocessorSpeexDSP` | Speex noise suppression / echo cancel | | `STTPreprocessorPTT` | Gates audio by PTT button state; emits Finalizing/Discarding | | `STTPreprocessorVAD` | Voice activity detection; emits Finalizing after postroll, Discarding on BLOCKED | | `STTPreprocessorBuffer` | Accumulates chunks to a fixed buffer size before forwarding; `bDiscardWhenNotFilledFullyOnce` drops short utterances | | `STTPreprocessorDebugger` | Writes audio passing through it to a WAV file | **STTPreprocessorBuffer — `bDiscardWhenNotFilledFullyOnce`:** When enabled, if a `Finalizing` signal arrives before the buffer has ever dispatched a full-size `Processing` chunk in the current utterance, it sends `Discarding` instead. This silently drops very short accidental utterances without sending them to the transcription service. ### STTProcessor (`Public/Processor/STTProcessorBase.h`) Receives the final audio. On `Finalizing`: trigger transcription. On `Discarding`: cancel/clear everything. | Class | Backend | |-------|---------| | `STTProcessorAzure` | Microsoft Azure Cognitive Services (streaming, continuous) | | `STTProcessorWhisper` | OpenAI Whisper / GPT-4o Transcribe (batch HTTP) | | `STTParakeetProcessorBase` | Local NVIDIA NeMo Parakeet via TCP (JSON protocol) | | `STTProcessorRealtimeAPI` | OpenAI Realtime API (forwards audio directly) | | `STTProcessorDebugSaveWav` | Saves all received audio to a WAV file | --- ## Configuration All modules are configured via `USTTBaseProcessorConfig` (a UObject subclass per processor type). Base settings are in `FSTTBaseSettings` (`Public/STTStructs.h`): - `bUsePTT` — Push-to-talk vs. freespeech (VAD) mode - `bCanInterrupt` — Whether user speech can interrupt the avatar - `FreespeechPostRollTime` — Seconds of silence after speech before `Finalizing` is emitted - `PTTPostRollTime` — Seconds after PTT release before `Finalizing` (currently unused — PTT emits Finalizing immediately on release) - `MaxTalkingTime` — Hard timeout on PTT press duration - `VADSettings` — Mode, min speech time, min amplitude threshold, speech-while-blocked threshold - `WebRTCSettings` — Echo cancellation, noise suppression, AGC flags - `SpeexDSPSettings` — Speex processing entries - `STTReplacements` — Word replacement pairs applied to final transcription - `STTSpecialWords` — Hints passed to transcription service for uncommon words --- ## Key Files ``` Public/ STTStructs.h — ESTTChainState, ESTTTalkingState, FAudioInformation, FSTTBaseSettings STTManagerBase.h/.cpp — Pipeline assembly, state machine, delegate wiring Recorder/STTRecorderBase.h — FDelegateUnprocessedChunkReceived Preprocessor/STTPreprocessorBase.h — FDelegateProcessedChunk, virtual OnChunkReceived Processor/STTProcessorBase.h — virtual OnChunkReceived, OnTranscriptionResult helpers Private/ STTManagerBase.cpp — InitSTTManager (BindUFunction chain), UserSpeechStateChanged Preprocessor/STTPreprocessorPTT.cpp — Finalizing on SILENCE, Discarding on BLOCKED Preprocessor/STTPreprocessorVAD.cpp — Finalizing after postroll, Discarding on BLOCKED Preprocessor/STTPreprocessorBuffer.cpp — ChainState-driven flush, bDiscardWhenNotFilledFullyOnce Processor/Azure/STTProcessorAzure.cpp — Streaming Azure recognition Processor/Parakeet/STTParakeetProcessorBase.cpp — TCP JSON protocol to Python server Processor/Whisper/STTProcessorWhisper.cpp — Batch HTTP to OpenAI ``` --- ## State Machine (ESTTTalkingState) Used for UI and for VAD/PTT internal logic only. Processors do NOT subscribe to `OnSpeechStateChanged`. ``` SILENCE ──(VAD/PTT detects speech)──▶ TALKING TALKING ──(VAD postroll / PTT release)──▶ SILENCE [Finalizing propagates through chain] TALKING ──(SetBlocked)──▶ BLOCKED [Discarding propagates through chain] BLOCKED ──(SetBlocked false / interrupt)──▶ SILENCE ANY ──(transcription complete)──▶ SILENCE ``` `TRANSCRIBING` is a transitional state set by Whisper before sending an HTTP request; other processors do not use it. --- ## Common Pitfalls - **Pass-through preprocessors must forward `Finalizing`/`Discarding` even on empty `PCMData`.** The Converter, SpeexDSP, and WebRTC all have early-return guards for empty/misaligned data — these guards check `ChainState != Processing` before returning so control signals are not swallowed. - **PTT emits an empty `TArray` with `Finalizing`.** Processors must guard against transcribing zero-length audio (they already do via `BufferedPCMData.Num() == 0` checks). - **Azure runs a background thread (`FAzureRunnable`).** `StopRecognition(false)` signals a graceful stop; the runnable delivers the final result via `OnRecognized`/`OnRunnableEnded` callbacks on the game thread. `StopRecognition(true)` is a forced abort (used on `Discarding`). - **Parakeet communicates over TCP with a local Python server** (`ParakeetSTT.bat`). In editor (`bKeepAlive=true`) the Python process is kept alive between PIE sessions to avoid restart overhead. - **`BindUFunction` matches by string name and delegate parameter types.** All `OnChunkReceived` overrides must have exactly the same signature as the base UFUNCTION or the bind will fail at runtime.