# AvatarCore_STT Plugin

Speech-to-text plugin for Unreal Engine. Audio flows through a linear chain of modules:

```
STTRecorder → [STTPreprocessor, ...] → STTProcessor → transcription result
```

The chain is assembled in `STTManagerBase::InitSTTManager` using `BindUFunction` with the string name `"OnChunkReceived"`. The UE reflection system resolves the correct virtual override at bind time, so the manager code does not need to change when signatures change.

---

## ESTTChainState — Pipeline Signal

Every audio chunk carries an `ESTTChainState` (defined in `Public/STTStructs.h`):

| Value | Meaning |
|-------|---------|
| `Processing` | Normal audio — buffer, process, pass through |
| `Finalizing` | End of utterance — flush buffers and trigger transcription |
| `Discarding` | BLOCKED/abort — clear all buffers, cancel in-flight requests |

**Rules:**
- Recorders always emit `Processing` — they have no concept of "final".
- PTT preprocessor emits `Finalizing` when the button is released (→ SILENCE) and `Discarding` when the system becomes BLOCKED.
- VAD preprocessor emits `Finalizing` on the last postroll silence chunk, then calls `UserSpeechStateChanged(SILENCE)` for UI purposes only. It emits `Discarding` from `OnUserSpeechStateChanged` when BLOCKED.
- Pass-through preprocessors (Converter, Debugger, SpeexDSP, WebRTC) forward `ChainState` unchanged. They must pass `Finalizing`/`Discarding` through even when `PCMData` is empty.
- Buffer preprocessor reacts to `ChainState` in-band — no `OnSpeechStateChanged` subscription.
- Processors react to `ChainState` only — no `OnSpeechStateChanged` subscriptions.

**Processors still call `UserSpeechStateChanged`** after transcription completes (for UI state), but they do NOT subscribe to it.

---

## Delegate Types

```cpp
// STTRecorderBase.h
DECLARE_DELEGATE_ThreeParams(FDelegateUnprocessedChunkReceived, TArray<int16>, FAudioInformation, ESTTChainState);

// STTPreprocessorBase.h
DECLARE_DELEGATE_ThreeParams(FDelegateProcessedChunk, TArray<int16>, FAudioInformation, ESTTChainState);
```

---

## Module Types

### STTRecorder (`Public/Recorder/STTRecorderBase.h`)

Produces audio chunks from a source (microphone, file, pixel stream). Fires `OnChunkReceived` delegate with `ESTTChainState::Processing`. Has no knowledge of speech state.

Implementations: `STTRecorderMicrophone` (PortAudio), `STTRecorderPrimaryMicrophone`, `STTRecorderUnrealMicrophone`, `STTRecorderDebugFile`, `STTRecorderAudioData`.

### STTPreprocessor (`Public/Preprocessor/STTPreprocessorBase.h`)

Chained in sequence. Each receives `OnChunkReceived` and fires `OnChunkProcessed` to the next stage. Both delegates carry `ESTTChainState`.

| Class | Role |
|-------|------|
| `STTPreprocessorConverter` | Stereo→mono, resample to target rate |
| `STTPreprocessorWebRTC` | WebRTC APM (echo cancel, noise suppress, AGC) |
| `STTPreprocessorSpeexDSP` | Speex noise suppression / echo cancel |
| `STTPreprocessorPTT` | Gates audio by PTT button state; emits Finalizing/Discarding |
| `STTPreprocessorVAD` | Voice activity detection; emits Finalizing after postroll, Discarding on BLOCKED |
| `STTPreprocessorBuffer` | Accumulates chunks to a fixed buffer size before forwarding; `bDiscardWhenNotFilledFullyOnce` drops short utterances |
| `STTPreprocessorDebugger` | Writes audio passing through it to a WAV file |

**STTPreprocessorBuffer — `bDiscardWhenNotFilledFullyOnce`:**  
When enabled, if a `Finalizing` signal arrives before the buffer has ever dispatched a full-size `Processing` chunk in the current utterance, it sends `Discarding` instead. This silently drops very short accidental utterances without sending them to the transcription service.

### STTProcessor (`Public/Processor/STTProcessorBase.h`)

Receives the final audio. On `Finalizing`: trigger transcription. On `Discarding`: cancel/clear everything.

| Class | Backend |
|-------|---------|
| `STTProcessorAzure` | Microsoft Azure Cognitive Services (streaming, continuous) |
| `STTProcessorWhisper` | OpenAI Whisper / GPT-4o Transcribe (batch HTTP) |
| `STTParakeetProcessorBase` | Local NVIDIA NeMo Parakeet via TCP (JSON protocol) |
| `STTProcessorRealtimeAPI` | OpenAI Realtime API (forwards audio directly) |
| `STTProcessorDebugSaveWav` | Saves all received audio to a WAV file |

---

## Configuration

All modules are configured via `USTTBaseProcessorConfig` (a UObject subclass per processor type). Base settings are in `FSTTBaseSettings` (`Public/STTStructs.h`):

- `bUsePTT` — Push-to-talk vs. freespeech (VAD) mode
- `bCanInterrupt` — Whether user speech can interrupt the avatar
- `FreespeechPostRollTime` — Seconds of silence after speech before `Finalizing` is emitted
- `PTTPostRollTime` — Seconds after PTT release before `Finalizing` (currently unused — PTT emits Finalizing immediately on release)
- `MaxTalkingTime` — Hard timeout on PTT press duration
- `VADSettings` — Mode, min speech time, min amplitude threshold, speech-while-blocked threshold
- `WebRTCSettings` — Echo cancellation, noise suppression, AGC flags
- `SpeexDSPSettings` — Speex processing entries
- `STTReplacements` — Word replacement pairs applied to final transcription
- `STTSpecialWords` — Hints passed to transcription service for uncommon words

---

## Key Files

```
Public/
  STTStructs.h                        — ESTTChainState, ESTTTalkingState, FAudioInformation, FSTTBaseSettings
  STTManagerBase.h/.cpp               — Pipeline assembly, state machine, delegate wiring
  Recorder/STTRecorderBase.h          — FDelegateUnprocessedChunkReceived
  Preprocessor/STTPreprocessorBase.h  — FDelegateProcessedChunk, virtual OnChunkReceived
  Processor/STTProcessorBase.h        — virtual OnChunkReceived, OnTranscriptionResult helpers

Private/
  STTManagerBase.cpp                  — InitSTTManager (BindUFunction chain), UserSpeechStateChanged
  Preprocessor/STTPreprocessorPTT.cpp — Finalizing on SILENCE, Discarding on BLOCKED
  Preprocessor/STTPreprocessorVAD.cpp — Finalizing after postroll, Discarding on BLOCKED
  Preprocessor/STTPreprocessorBuffer.cpp — ChainState-driven flush, bDiscardWhenNotFilledFullyOnce
  Processor/Azure/STTProcessorAzure.cpp       — Streaming Azure recognition
  Processor/Parakeet/STTParakeetProcessorBase.cpp — TCP JSON protocol to Python server
  Processor/Whisper/STTProcessorWhisper.cpp   — Batch HTTP to OpenAI
```

---

## State Machine (ESTTTalkingState)

Used for UI and for VAD/PTT internal logic only. Processors do NOT subscribe to `OnSpeechStateChanged`.

```
SILENCE ──(VAD/PTT detects speech)──▶ TALKING
TALKING ──(VAD postroll / PTT release)──▶ SILENCE   [Finalizing propagates through chain]
TALKING ──(SetBlocked)──▶ BLOCKED                   [Discarding propagates through chain]
BLOCKED ──(SetBlocked false / interrupt)──▶ SILENCE
ANY     ──(transcription complete)──▶ SILENCE
```

`TRANSCRIBING` is a transitional state set by Whisper before sending an HTTP request; other processors do not use it.

---

## Common Pitfalls

- **Pass-through preprocessors must forward `Finalizing`/`Discarding` even on empty `PCMData`.** The Converter, SpeexDSP, and WebRTC all have early-return guards for empty/misaligned data — these guards check `ChainState != Processing` before returning so control signals are not swallowed.
- **PTT emits an empty `TArray<int16>` with `Finalizing`.** Processors must guard against transcribing zero-length audio (they already do via `BufferedPCMData.Num() == 0` checks).
- **Azure runs a background thread (`FAzureRunnable`).** `StopRecognition(false)` signals a graceful stop; the runnable delivers the final result via `OnRecognized`/`OnRunnableEnded` callbacks on the game thread. `StopRecognition(true)` is a forced abort (used on `Discarding`).
- **Parakeet communicates over TCP with a local Python server** (`ParakeetSTT.bat`). In editor (`bKeepAlive=true`) the Python process is kept alive between PIE sessions to avoid restart overhead.
- **`BindUFunction` matches by string name and delegate parameter types.** All `OnChunkReceived` overrides must have exactly the same signature as the base UFUNCTION or the bind will fail at runtime.