Projekt for SPIE - Avatar for safety briefing / managment event
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

8.1 KiB

AvatarCore_STT Plugin

Speech-to-text plugin for Unreal Engine. Audio flows through a linear chain of modules:

STTRecorder → [STTPreprocessor, ...] → STTProcessor → transcription result

The chain is assembled in STTManagerBase::InitSTTManager using BindUFunction with the string name "OnChunkReceived". The UE reflection system resolves the correct virtual override at bind time, so the manager code does not need to change when signatures change.


ESTTChainState — Pipeline Signal

Every audio chunk carries an ESTTChainState (defined in Public/STTStructs.h):

Value Meaning
Processing Normal audio — buffer, process, pass through
Finalizing End of utterance — flush buffers and trigger transcription
Discarding BLOCKED/abort — clear all buffers, cancel in-flight requests

Rules:

  • Recorders always emit Processing — they have no concept of "final".
  • PTT preprocessor emits Finalizing when the button is released (→ SILENCE) and Discarding when the system becomes BLOCKED.
  • VAD preprocessor emits Finalizing on the last postroll silence chunk, then calls UserSpeechStateChanged(SILENCE) for UI purposes only. It emits Discarding from OnUserSpeechStateChanged when BLOCKED.
  • Pass-through preprocessors (Converter, Debugger, SpeexDSP, WebRTC) forward ChainState unchanged. They must pass Finalizing/Discarding through even when PCMData is empty.
  • Buffer preprocessor reacts to ChainState in-band — no OnSpeechStateChanged subscription.
  • Processors react to ChainState only — no OnSpeechStateChanged subscriptions.

Processors still call UserSpeechStateChanged after transcription completes (for UI state), but they do NOT subscribe to it.


Delegate Types

// STTRecorderBase.h
DECLARE_DELEGATE_ThreeParams(FDelegateUnprocessedChunkReceived, TArray<int16>, FAudioInformation, ESTTChainState);

// STTPreprocessorBase.h
DECLARE_DELEGATE_ThreeParams(FDelegateProcessedChunk, TArray<int16>, FAudioInformation, ESTTChainState);

Module Types

STTRecorder (Public/Recorder/STTRecorderBase.h)

Produces audio chunks from a source (microphone, file, pixel stream). Fires OnChunkReceived delegate with ESTTChainState::Processing. Has no knowledge of speech state.

Implementations: STTRecorderMicrophone (PortAudio), STTRecorderPrimaryMicrophone, STTRecorderUnrealMicrophone, STTRecorderDebugFile, STTRecorderAudioData.

STTPreprocessor (Public/Preprocessor/STTPreprocessorBase.h)

Chained in sequence. Each receives OnChunkReceived and fires OnChunkProcessed to the next stage. Both delegates carry ESTTChainState.

Class Role
STTPreprocessorConverter Stereo→mono, resample to target rate
STTPreprocessorWebRTC WebRTC APM (echo cancel, noise suppress, AGC)
STTPreprocessorSpeexDSP Speex noise suppression / echo cancel
STTPreprocessorPTT Gates audio by PTT button state; emits Finalizing/Discarding
STTPreprocessorVAD Voice activity detection; emits Finalizing after postroll, Discarding on BLOCKED
STTPreprocessorBuffer Accumulates chunks to a fixed buffer size before forwarding; bDiscardWhenNotFilledFullyOnce drops short utterances
STTPreprocessorDebugger Writes audio passing through it to a WAV file

STTPreprocessorBuffer — bDiscardWhenNotFilledFullyOnce:
When enabled, if a Finalizing signal arrives before the buffer has ever dispatched a full-size Processing chunk in the current utterance, it sends Discarding instead. This silently drops very short accidental utterances without sending them to the transcription service.

STTProcessor (Public/Processor/STTProcessorBase.h)

Receives the final audio. On Finalizing: trigger transcription. On Discarding: cancel/clear everything.

Class Backend
STTProcessorAzure Microsoft Azure Cognitive Services (streaming, continuous)
STTProcessorWhisper OpenAI Whisper / GPT-4o Transcribe (batch HTTP)
STTParakeetProcessorBase Local NVIDIA NeMo Parakeet via TCP (JSON protocol)
STTProcessorRealtimeAPI OpenAI Realtime API (forwards audio directly)
STTProcessorDebugSaveWav Saves all received audio to a WAV file

Configuration

All modules are configured via USTTBaseProcessorConfig (a UObject subclass per processor type). Base settings are in FSTTBaseSettings (Public/STTStructs.h):

  • bUsePTT — Push-to-talk vs. freespeech (VAD) mode
  • bCanInterrupt — Whether user speech can interrupt the avatar
  • FreespeechPostRollTime — Seconds of silence after speech before Finalizing is emitted
  • PTTPostRollTime — Seconds after PTT release before Finalizing (currently unused — PTT emits Finalizing immediately on release)
  • MaxTalkingTime — Hard timeout on PTT press duration
  • VADSettings — Mode, min speech time, min amplitude threshold, speech-while-blocked threshold
  • WebRTCSettings — Echo cancellation, noise suppression, AGC flags
  • SpeexDSPSettings — Speex processing entries
  • STTReplacements — Word replacement pairs applied to final transcription
  • STTSpecialWords — Hints passed to transcription service for uncommon words

Key Files

Public/
  STTStructs.h                        — ESTTChainState, ESTTTalkingState, FAudioInformation, FSTTBaseSettings
  STTManagerBase.h/.cpp               — Pipeline assembly, state machine, delegate wiring
  Recorder/STTRecorderBase.h          — FDelegateUnprocessedChunkReceived
  Preprocessor/STTPreprocessorBase.h  — FDelegateProcessedChunk, virtual OnChunkReceived
  Processor/STTProcessorBase.h        — virtual OnChunkReceived, OnTranscriptionResult helpers

Private/
  STTManagerBase.cpp                  — InitSTTManager (BindUFunction chain), UserSpeechStateChanged
  Preprocessor/STTPreprocessorPTT.cpp — Finalizing on SILENCE, Discarding on BLOCKED
  Preprocessor/STTPreprocessorVAD.cpp — Finalizing after postroll, Discarding on BLOCKED
  Preprocessor/STTPreprocessorBuffer.cpp — ChainState-driven flush, bDiscardWhenNotFilledFullyOnce
  Processor/Azure/STTProcessorAzure.cpp       — Streaming Azure recognition
  Processor/Parakeet/STTParakeetProcessorBase.cpp — TCP JSON protocol to Python server
  Processor/Whisper/STTProcessorWhisper.cpp   — Batch HTTP to OpenAI

State Machine (ESTTTalkingState)

Used for UI and for VAD/PTT internal logic only. Processors do NOT subscribe to OnSpeechStateChanged.

SILENCE ──(VAD/PTT detects speech)──▶ TALKING
TALKING ──(VAD postroll / PTT release)──▶ SILENCE   [Finalizing propagates through chain]
TALKING ──(SetBlocked)──▶ BLOCKED                   [Discarding propagates through chain]
BLOCKED ──(SetBlocked false / interrupt)──▶ SILENCE
ANY     ──(transcription complete)──▶ SILENCE

TRANSCRIBING is a transitional state set by Whisper before sending an HTTP request; other processors do not use it.


Common Pitfalls

  • Pass-through preprocessors must forward Finalizing/Discarding even on empty PCMData. The Converter, SpeexDSP, and WebRTC all have early-return guards for empty/misaligned data — these guards check ChainState != Processing before returning so control signals are not swallowed.
  • PTT emits an empty TArray<int16> with Finalizing. Processors must guard against transcribing zero-length audio (they already do via BufferedPCMData.Num() == 0 checks).
  • Azure runs a background thread (FAzureRunnable). StopRecognition(false) signals a graceful stop; the runnable delivers the final result via OnRecognized/OnRunnableEnded callbacks on the game thread. StopRecognition(true) is a forced abort (used on Discarding).
  • Parakeet communicates over TCP with a local Python server (ParakeetSTT.bat). In editor (bKeepAlive=true) the Python process is kept alive between PIE sessions to avoid restart overhead.
  • BindUFunction matches by string name and delegate parameter types. All OnChunkReceived overrides must have exactly the same signature as the base UFUNCTION or the bind will fail at runtime.