The audio pipeline

From microphone to PAD score in under 100 milliseconds.

Every FluentPlay game shares the same browser-side audio engine. Your microphone feeds a real-time analysis pipeline that classifies every audio frame — about 60 per second — into a feature stream. That stream feeds the PAD scorer. No audio is recorded. Nothing leaves the browser except one cloud call for phoneme-level accuracy.

🎙
Microphone
Browser capture
Analysis Engine
Real-time · 60 fps
DFS
Disfluency Feature Stream
PAD Scorer
Per-syllable D score
D Score
Output · per syllable
01 · Live playback

Watch the pipeline process a real utterance.

Pre-recorded · illustrative
Waveform
DFS state
Silent Building Voiced Block
Features
RMS peak
0.00
intensity
Onsets
0
count
Voiced
0
ms
Block
no
flag
PAD score
D value
0.0s
02 · What the pipeline produces

Four features per frame. Each one feeds a PAD component.

01 / 04
RMS Intensity

Root-mean-square energy of the audio frame. Tracks how much acoustic energy the speaker is producing. Drops during blocks, spikes during forced articulation. A stable RMS across syllables is a fluency signal.

Feeds → A (Acoustic component)
02 / 04
Onset Count

Cumulative count of voiced onsets detected in the session. Each time the pipeline sees a transition from silent or building to voiced, the counter increments. Repeated onsets on the same syllable window signal repetition-type disfluency.

Feeds → P (Prediction component)
03 / 04
Voiced Duration

Cumulative time spent in the voiced state, in milliseconds. The ratio of voiced-to-total time tracks productive speech output. Long stretches of building without voiced indicate motor-planning stalls — the system before the block.

Feeds → G (Gate component)
04 / 04
Block Flag

Binary flag that fires when the pipeline detects a sustained building state exceeding a duration threshold. A block is not a silence — it is active motor effort without articulation. When the flag fires, the PAD scorer weights that syllable window accordingly.

Feeds → λ (attenuation parameter)
03 · Architecture

Design constraints that are features.

🌐
Browser-native

Runs in any modern browser. No install, no plugin, no app store. Single-file HTML deployments via Netlify Drop.

🔒
No audio recorded

Audio is analyzed in real time and discarded frame by frame. Nothing is stored. Nothing leaves the device except one cloud call for phoneme-level pronunciation assessment.

Sub-100ms latency

From speech onset to scored feature output in under 100 milliseconds. Frame rate of ~60 fps. Fast enough for real-time visual feedback during practice.

🔌
Separable layers

The audio pipeline and the PAD scorer are architecturally independent. License the pipeline, the scoring framework, or the full integrated stack.

License the pipeline, the PAD framework, or the full integrated stack.
The full picture

The first platform to measure what happens before the block.

The audio pipeline captures pre-articulatory timing instability in real time. PAD scores it per syllable. Every game in the FluentPlay library ships with both layers hardwired in. The pipeline and the scoring framework are architecturally independent — licensable separately or as an integrated stack. Patent pending under U.S. Provisional 64/016,001.

Request a meeting

Tell us what you're working on.

Whether you're evaluating the audio pipeline, the PAD scoring framework, or the full integrated stack — describe your use case and we'll schedule a founder call.