8.1 KiB
AvatarCore_STT Plugin
Speech-to-text plugin for Unreal Engine. Audio flows through a linear chain of modules:
STTRecorder → [STTPreprocessor, ...] → STTProcessor → transcription result
The chain is assembled in STTManagerBase::InitSTTManager using BindUFunction with the string name "OnChunkReceived". The UE reflection system resolves the correct virtual override at bind time, so the manager code does not need to change when signatures change.
ESTTChainState — Pipeline Signal
Every audio chunk carries an ESTTChainState (defined in Public/STTStructs.h):
| Value | Meaning |
|---|---|
Processing |
Normal audio — buffer, process, pass through |
Finalizing |
End of utterance — flush buffers and trigger transcription |
Discarding |
BLOCKED/abort — clear all buffers, cancel in-flight requests |
Rules:
- Recorders always emit
Processing— they have no concept of "final". - PTT preprocessor emits
Finalizingwhen the button is released (→ SILENCE) andDiscardingwhen the system becomes BLOCKED. - VAD preprocessor emits
Finalizingon the last postroll silence chunk, then callsUserSpeechStateChanged(SILENCE)for UI purposes only. It emitsDiscardingfromOnUserSpeechStateChangedwhen BLOCKED. - Pass-through preprocessors (Converter, Debugger, SpeexDSP, WebRTC) forward
ChainStateunchanged. They must passFinalizing/Discardingthrough even whenPCMDatais empty. - Buffer preprocessor reacts to
ChainStatein-band — noOnSpeechStateChangedsubscription. - Processors react to
ChainStateonly — noOnSpeechStateChangedsubscriptions.
Processors still call UserSpeechStateChanged after transcription completes (for UI state), but they do NOT subscribe to it.
Delegate Types
// STTRecorderBase.h
DECLARE_DELEGATE_ThreeParams(FDelegateUnprocessedChunkReceived, TArray<int16>, FAudioInformation, ESTTChainState);
// STTPreprocessorBase.h
DECLARE_DELEGATE_ThreeParams(FDelegateProcessedChunk, TArray<int16>, FAudioInformation, ESTTChainState);
Module Types
STTRecorder (Public/Recorder/STTRecorderBase.h)
Produces audio chunks from a source (microphone, file, pixel stream). Fires OnChunkReceived delegate with ESTTChainState::Processing. Has no knowledge of speech state.
Implementations: STTRecorderMicrophone (PortAudio), STTRecorderPrimaryMicrophone, STTRecorderUnrealMicrophone, STTRecorderDebugFile, STTRecorderAudioData.
STTPreprocessor (Public/Preprocessor/STTPreprocessorBase.h)
Chained in sequence. Each receives OnChunkReceived and fires OnChunkProcessed to the next stage. Both delegates carry ESTTChainState.
| Class | Role |
|---|---|
STTPreprocessorConverter |
Stereo→mono, resample to target rate |
STTPreprocessorWebRTC |
WebRTC APM (echo cancel, noise suppress, AGC) |
STTPreprocessorSpeexDSP |
Speex noise suppression / echo cancel |
STTPreprocessorPTT |
Gates audio by PTT button state; emits Finalizing/Discarding |
STTPreprocessorVAD |
Voice activity detection; emits Finalizing after postroll, Discarding on BLOCKED |
STTPreprocessorBuffer |
Accumulates chunks to a fixed buffer size before forwarding; bDiscardWhenNotFilledFullyOnce drops short utterances |
STTPreprocessorDebugger |
Writes audio passing through it to a WAV file |
STTPreprocessorBuffer — bDiscardWhenNotFilledFullyOnce:
When enabled, if a Finalizing signal arrives before the buffer has ever dispatched a full-size Processing chunk in the current utterance, it sends Discarding instead. This silently drops very short accidental utterances without sending them to the transcription service.
STTProcessor (Public/Processor/STTProcessorBase.h)
Receives the final audio. On Finalizing: trigger transcription. On Discarding: cancel/clear everything.
| Class | Backend |
|---|---|
STTProcessorAzure |
Microsoft Azure Cognitive Services (streaming, continuous) |
STTProcessorWhisper |
OpenAI Whisper / GPT-4o Transcribe (batch HTTP) |
STTParakeetProcessorBase |
Local NVIDIA NeMo Parakeet via TCP (JSON protocol) |
STTProcessorRealtimeAPI |
OpenAI Realtime API (forwards audio directly) |
STTProcessorDebugSaveWav |
Saves all received audio to a WAV file |
Configuration
All modules are configured via USTTBaseProcessorConfig (a UObject subclass per processor type). Base settings are in FSTTBaseSettings (Public/STTStructs.h):
bUsePTT— Push-to-talk vs. freespeech (VAD) modebCanInterrupt— Whether user speech can interrupt the avatarFreespeechPostRollTime— Seconds of silence after speech beforeFinalizingis emittedPTTPostRollTime— Seconds after PTT release beforeFinalizing(currently unused — PTT emits Finalizing immediately on release)MaxTalkingTime— Hard timeout on PTT press durationVADSettings— Mode, min speech time, min amplitude threshold, speech-while-blocked thresholdWebRTCSettings— Echo cancellation, noise suppression, AGC flagsSpeexDSPSettings— Speex processing entriesSTTReplacements— Word replacement pairs applied to final transcriptionSTTSpecialWords— Hints passed to transcription service for uncommon words
Key Files
Public/
STTStructs.h — ESTTChainState, ESTTTalkingState, FAudioInformation, FSTTBaseSettings
STTManagerBase.h/.cpp — Pipeline assembly, state machine, delegate wiring
Recorder/STTRecorderBase.h — FDelegateUnprocessedChunkReceived
Preprocessor/STTPreprocessorBase.h — FDelegateProcessedChunk, virtual OnChunkReceived
Processor/STTProcessorBase.h — virtual OnChunkReceived, OnTranscriptionResult helpers
Private/
STTManagerBase.cpp — InitSTTManager (BindUFunction chain), UserSpeechStateChanged
Preprocessor/STTPreprocessorPTT.cpp — Finalizing on SILENCE, Discarding on BLOCKED
Preprocessor/STTPreprocessorVAD.cpp — Finalizing after postroll, Discarding on BLOCKED
Preprocessor/STTPreprocessorBuffer.cpp — ChainState-driven flush, bDiscardWhenNotFilledFullyOnce
Processor/Azure/STTProcessorAzure.cpp — Streaming Azure recognition
Processor/Parakeet/STTParakeetProcessorBase.cpp — TCP JSON protocol to Python server
Processor/Whisper/STTProcessorWhisper.cpp — Batch HTTP to OpenAI
State Machine (ESTTTalkingState)
Used for UI and for VAD/PTT internal logic only. Processors do NOT subscribe to OnSpeechStateChanged.
SILENCE ──(VAD/PTT detects speech)──▶ TALKING
TALKING ──(VAD postroll / PTT release)──▶ SILENCE [Finalizing propagates through chain]
TALKING ──(SetBlocked)──▶ BLOCKED [Discarding propagates through chain]
BLOCKED ──(SetBlocked false / interrupt)──▶ SILENCE
ANY ──(transcription complete)──▶ SILENCE
TRANSCRIBING is a transitional state set by Whisper before sending an HTTP request; other processors do not use it.
Common Pitfalls
- Pass-through preprocessors must forward
Finalizing/Discardingeven on emptyPCMData. The Converter, SpeexDSP, and WebRTC all have early-return guards for empty/misaligned data — these guards checkChainState != Processingbefore returning so control signals are not swallowed. - PTT emits an empty
TArray<int16>withFinalizing. Processors must guard against transcribing zero-length audio (they already do viaBufferedPCMData.Num() == 0checks). - Azure runs a background thread (
FAzureRunnable).StopRecognition(false)signals a graceful stop; the runnable delivers the final result viaOnRecognized/OnRunnableEndedcallbacks on the game thread.StopRecognition(true)is a forced abort (used onDiscarding). - Parakeet communicates over TCP with a local Python server (
ParakeetSTT.bat). In editor (bKeepAlive=true) the Python process is kept alive between PIE sessions to avoid restart overhead. BindUFunctionmatches by string name and delegate parameter types. AllOnChunkReceivedoverrides must have exactly the same signature as the base UFUNCTION or the bind will fail at runtime.