AvatarCore_STT Plugin

Speech-to-text plugin for Unreal Engine. Audio flows through a linear chain of modules:

STTRecorder → [STTPreprocessor, ...] → STTProcessor → transcription result

The chain is assembled in STTManagerBase::InitSTTManager using BindUFunction with the string name "OnChunkReceived". The UE reflection system resolves the correct virtual override at bind time, so the manager code does not need to change when signatures change.

ESTTChainState — Pipeline Signal

Every audio chunk carries an ESTTChainState (defined in Public/STTStructs.h):

Value	Meaning
`Processing`	Normal audio — buffer, process, pass through
`Finalizing`	End of utterance — flush buffers and trigger transcription
`Discarding`	BLOCKED/abort — clear all buffers, cancel in-flight requests

Rules:

Recorders always emit Processing — they have no concept of "final".
PTT preprocessor emits Finalizing when the button is released (→ SILENCE) and Discarding when the system becomes BLOCKED.
VAD preprocessor emits Finalizing on the last postroll silence chunk, then calls UserSpeechStateChanged(SILENCE) for UI purposes only. It emits Discarding from OnUserSpeechStateChanged when BLOCKED.
Pass-through preprocessors (Converter, Debugger, SpeexDSP, WebRTC) forward ChainState unchanged. They must pass Finalizing/Discarding through even when PCMData is empty.
Buffer preprocessor reacts to ChainState in-band — no OnSpeechStateChanged subscription.
Processors react to ChainState only — no OnSpeechStateChanged subscriptions.

Processors still call UserSpeechStateChanged after transcription completes (for UI state), but they do NOT subscribe to it.

Delegate Types

// STTRecorderBase.h
DECLARE_DELEGATE_ThreeParams(FDelegateUnprocessedChunkReceived, TArray<int16>, FAudioInformation, ESTTChainState);

// STTPreprocessorBase.h
DECLARE_DELEGATE_ThreeParams(FDelegateProcessedChunk, TArray<int16>, FAudioInformation, ESTTChainState);

Module Types

STTRecorder (`Public/Recorder/STTRecorderBase.h`)

Produces audio chunks from a source (microphone, file, pixel stream). Fires OnChunkReceived delegate with ESTTChainState::Processing. Has no knowledge of speech state.

Implementations: STTRecorderMicrophone (PortAudio), STTRecorderPrimaryMicrophone, STTRecorderUnrealMicrophone, STTRecorderDebugFile, STTRecorderAudioData.

STTPreprocessor (`Public/Preprocessor/STTPreprocessorBase.h`)

Chained in sequence. Each receives OnChunkReceived and fires OnChunkProcessed to the next stage. Both delegates carry ESTTChainState.

Class	Role
`STTPreprocessorConverter`	Stereo→mono, resample to target rate
`STTPreprocessorWebRTC`	WebRTC APM (echo cancel, noise suppress, AGC)
`STTPreprocessorSpeexDSP`	Speex noise suppression / echo cancel
`STTPreprocessorPTT`	Gates audio by PTT button state; emits Finalizing/Discarding
`STTPreprocessorVAD`	Voice activity detection; emits Finalizing after postroll, Discarding on BLOCKED
`STTPreprocessorBuffer`	Accumulates chunks to a fixed buffer size before forwarding; `bDiscardWhenNotFilledFullyOnce` drops short utterances
`STTPreprocessorDebugger`	Writes audio passing through it to a WAV file

STTPreprocessorBuffer — bDiscardWhenNotFilledFullyOnce:
When enabled, if a Finalizing signal arrives before the buffer has ever dispatched a full-size Processing chunk in the current utterance, it sends Discarding instead. This silently drops very short accidental utterances without sending them to the transcription service.

STTProcessor (`Public/Processor/STTProcessorBase.h`)

Receives the final audio. On Finalizing: trigger transcription. On Discarding: cancel/clear everything.

Class	Backend
`STTProcessorAzure`	Microsoft Azure Cognitive Services (streaming, continuous)
`STTProcessorWhisper`	OpenAI Whisper / GPT-4o Transcribe (batch HTTP)
`STTParakeetProcessorBase`	Local NVIDIA NeMo Parakeet via TCP (JSON protocol)
`STTProcessorRealtimeAPI`	OpenAI Realtime API (forwards audio directly)
`STTProcessorDebugSaveWav`	Saves all received audio to a WAV file

Configuration

All modules are configured via USTTBaseProcessorConfig (a UObject subclass per processor type). Base settings are in FSTTBaseSettings (Public/STTStructs.h):

bUsePTT — Push-to-talk vs. freespeech (VAD) mode
bCanInterrupt — Whether user speech can interrupt the avatar
FreespeechPostRollTime — Seconds of silence after speech before Finalizing is emitted
PTTPostRollTime — Seconds after PTT release before Finalizing (currently unused — PTT emits Finalizing immediately on release)
MaxTalkingTime — Hard timeout on PTT press duration
VADSettings — Mode, min speech time, min amplitude threshold, speech-while-blocked threshold
WebRTCSettings — Echo cancellation, noise suppression, AGC flags
SpeexDSPSettings — Speex processing entries
STTReplacements — Word replacement pairs applied to final transcription
STTSpecialWords — Hints passed to transcription service for uncommon words

Key Files

Public/
  STTStructs.h                        — ESTTChainState, ESTTTalkingState, FAudioInformation, FSTTBaseSettings
  STTManagerBase.h/.cpp               — Pipeline assembly, state machine, delegate wiring
  Recorder/STTRecorderBase.h          — FDelegateUnprocessedChunkReceived
  Preprocessor/STTPreprocessorBase.h  — FDelegateProcessedChunk, virtual OnChunkReceived
  Processor/STTProcessorBase.h        — virtual OnChunkReceived, OnTranscriptionResult helpers

Private/
  STTManagerBase.cpp                  — InitSTTManager (BindUFunction chain), UserSpeechStateChanged
  Preprocessor/STTPreprocessorPTT.cpp — Finalizing on SILENCE, Discarding on BLOCKED
  Preprocessor/STTPreprocessorVAD.cpp — Finalizing after postroll, Discarding on BLOCKED
  Preprocessor/STTPreprocessorBuffer.cpp — ChainState-driven flush, bDiscardWhenNotFilledFullyOnce
  Processor/Azure/STTProcessorAzure.cpp       — Streaming Azure recognition
  Processor/Parakeet/STTParakeetProcessorBase.cpp — TCP JSON protocol to Python server
  Processor/Whisper/STTProcessorWhisper.cpp   — Batch HTTP to OpenAI

State Machine (ESTTTalkingState)

Used for UI and for VAD/PTT internal logic only. Processors do NOT subscribe to OnSpeechStateChanged.

SILENCE ──(VAD/PTT detects speech)──▶ TALKING
TALKING ──(VAD postroll / PTT release)──▶ SILENCE   [Finalizing propagates through chain]
TALKING ──(SetBlocked)──▶ BLOCKED                   [Discarding propagates through chain]
BLOCKED ──(SetBlocked false / interrupt)──▶ SILENCE
ANY     ──(transcription complete)──▶ SILENCE

TRANSCRIBING is a transitional state set by Whisper before sending an HTTP request; other processors do not use it.

Common Pitfalls

Pass-through preprocessors must forward Finalizing/Discarding even on empty PCMData. The Converter, SpeexDSP, and WebRTC all have early-return guards for empty/misaligned data — these guards check ChainState != Processing before returning so control signals are not swallowed.
PTT emits an empty TArray<int16> with Finalizing. Processors must guard against transcribing zero-length audio (they already do via BufferedPCMData.Num() == 0 checks).
Azure runs a background thread (FAzureRunnable). StopRecognition(false) signals a graceful stop; the runnable delivers the final result via OnRecognized/OnRunnableEnded callbacks on the game thread. StopRecognition(true) is a forced abort (used on Discarding).
Parakeet communicates over TCP with a local Python server (ParakeetSTT.bat). In editor (bKeepAlive=true) the Python process is kept alive between PIE sessions to avoid restart overhead.
BindUFunction matches by string name and delegate parameter types. All OnChunkReceived overrides must have exactly the same signature as the base UFUNCTION or the bind will fail at runtime.

8.1 KiB Raw Blame History