The audio pipeline

From microphone to PAD score in under 100 milliseconds.

Every FluentPlay game shares the same browser-side audio engine. Your microphone feeds a real-time analysis pipeline that classifies every audio frame — about 60 per second — into a feature stream. That stream feeds the PAD scorer. No audio is recorded. Nothing leaves the browser except one cloud call for phoneme-level accuracy.

🎙
Microphone
Browser capture
Analysis Engine
Real-time · 60 fps
DFS
Disfluency Feature Stream
PAD Scorer
Per-syllable PAD score
PAD Score
Output · per syllable
01 · Live playback

Watch the pipeline process a real utterance.

Pre-recorded · illustrative
Waveform
DFS state
Silent Building Voiced Delayed-onset
Features
RMS peak
0.00
intensity
Onsets
0
count
Voiced
0
ms
Delayed-onset
no
flag
PAD score
per syllable
0.0s
02 · What the pipeline produces

Four features per frame. Each one feeds a PAD component.

01 / 04
RMS Intensity

Root-mean-square energy of the audio frame. Tracks how much acoustic energy the speaker is producing. Drops during delayed-onset events, spikes during forced articulation. A stable RMS across syllables is a smoothness signal.

Feeds → A (Acoustic component)
02 / 04
Onset Count

Cumulative count of voiced onsets detected in the session. Each time the pipeline sees a transition from silent or building to voiced, the counter increments. Repeated onsets on the same syllable window signal repeated-onset events.

Feeds → P (Prediction component)
03 / 04
Voiced Duration

Cumulative time spent in the voiced state, in milliseconds. The ratio of voiced-to-total time tracks productive speech output. Long stretches of building without voiced indicate motor-planning stalls in the pre-articulatory window.

Feeds → G (Gate component)
04 / 04
Delayed-onset Flag

Binary flag that fires when the pipeline detects a sustained building state exceeding a duration threshold. The state is not silence — it is active motor effort without articulation. When the flag fires, the PAD scorer weights that syllable window accordingly.

Feeds → λ (attenuation parameter)
03 · Architecture

Design constraints that are features.

🌐
Browser-native

Runs in any modern browser. No install, no plugin, no app store. Single-file HTML deployments via Netlify Drop.

🔒
No audio recorded

Audio is analyzed in real time and discarded frame by frame. Nothing is stored. Nothing leaves the device except one cloud call for phoneme-level pronunciation assessment.

Sub-100ms latency

From speech onset to scored feature output in under 100 milliseconds. Frame rate of ~60 fps. Fast enough for real-time visual feedback during practice.

🔌
Separable layers

The audio pipeline and the PAD scorer are architecturally independent. License the pipeline, the scoring framework, or the full integrated stack.

License the pipeline, the PAD framework, or the full integrated stack.
The full picture

The first platform built around the pre-articulatory window.

The audio pipeline captures pre-articulatory timing instability in real time. PAD scores it per syllable. Every game in the FluentPlay library ships with both layers hardwired in. The pipeline and the scoring framework are architecturally independent — licensable separately or as an integrated stack. Patent pending under U.S. Provisional 64/016,001.

Request a meeting

Tell us what you're working on.

Whether you're evaluating the audio pipeline, the PAD scoring framework, or the full integrated stack — describe your use case and we'll schedule a founder call.