Building a Vocal Analysis Pipeline to Replace My Ear

I have a perforated right eardrum — about 50% hearing loss on that side. When I started learning to sing in late 2025, I couldn't reliably tell whether I was hitting the notes.

So I built a tool that could.

The problem

When you're learning to sing, the feedback loop is everything: you sing, you hear, you adjust. But when your ear is damaged, that loop is broken.

A vocal coach can tell you what they hear in a lesson, but they're not there for every take at 11 PM in your bedroom.

I needed an objective replacement for the ear I couldn't trust. Not a pitch tuner — those show you one note at a time. I needed a system that could evaluate an entire performance the way a producer would: pitch accuracy, yes, but also tone, dynamics, phrasing, consistency, and presence.

What I built

vocal-tools is a Python pipeline that analyzes recordings across 70+ metrics and produces a composite score (0–100) on my internal rubric, calibrated against a reference set of professional recordings. It's built on NumPy, SciPy, parselmouth, and librosa — no ML models, no black boxes. Every metric is deterministic and reproducible.

The core command is simple:

$ python vocal_take_analyzer.py take.wav --outdir analysis/ --style mix

It produces a full report: composite score, per-metric breakdowns, phrase-level data, and detailed JSON output for downstream processing.

323

takes analyzed

70+

metrics per take

10 mo

of tracking

69 → 94

score range

The scoring system

The composite score uses sigmoid-normalized metrics with weighted category aggregation across five areas:

Pitch control (~30%): intonation deviation (cents), pitch stability, vibrato regularity, register consistency
Tonal quality (~25%): tone clarity, formant positioning (F1/F2), singer's-formant proxy ("ring"), harmonic-to-noise ratio
Dynamic expression (~20%): phrase-level dynamic range, verse–chorus arc, micro-dynamics within phrases
Vocal presence (~15%): projection/forward-placement proxy via roughly 2.5–3.5 kHz energy consistency
Technical execution (~10%): onset quality, breath management proxies, consonant clarity, phrase endings

Each metric passes through a sigmoid curve with a calibrated midpoint per style, compressing outliers and centering scoring around the discriminative range. In my rubric, 85+ is competent amateur work. 90+ is semi-professional. The pitch-corrected commercial reference track I calibrated against scores 91.2.

Signal processing details

The pipeline processes audio in stages:

Pitch tracking: frame-by-frame F0 estimation via Praat's cross-correlation method (parselmouth, 10 ms hop), plus voiced/unvoiced classification.
Phrase detection: segmentation using silence gaps (≥300 ms below −45 dBFS) and voiced boundaries.
Formant analysis: LPC-based estimates (F1–F4) and a singer's-formant proxy (roughly 2.5–3.5 kHz energy relative to lower bands). This often tracks the "ring" and forward placement that trained singers develop.
Register indicator: a heuristic proxy from spectral shape, used to flag where I tend to lose pitch control (mix range for my voice).
Composite scoring: sigmoid aggregation so improvements matter more in the problem zone than near perfection.

What the data taught me

The biggest insight had nothing to do with code.

Monitoring was the biggest pitch lever

My early takes had intonation around 25+ cents deviation — objectively rough. I spent weeks working technique. The number barely moved.

Then I changed my monitoring setup: I replaced headphones with a Forbrain bone-conduction headset (reducing reliance on the damaged eardrum for monitoring), and added a little reverb to make pitch relationships easier to perceive.

My intonation dropped to 14–15 cents overnight.

For me, the bottleneck wasn't talent. It was that I couldn't hear myself accurately.

The drone breakthrough

With a full backing track, my pitch scattered — too many harmonic layers for one good ear to track cleanly.

So I stripped everything away and sang with a single B3 sine-wave drone (root note) generated in REAPER.

Singing a cappella with a drone produced the same intonation as singing over the full track: 15.3 cents. One tone was enough.

That made it clear: pitch accuracy, in my case, was about reference signal clarity as much as it was about technique.

Single takes beat comps

I built a companion tool (vocal_comp_builder.py) that selects the best phrases from multiple takes and stitches them together. The greedy algorithm picks the highest-scoring segment at each phrase boundary.

The automated comp scored 88.4. The best single take scored 88.1 — nearly identical — and the comp sounded worse.

The algorithm switched takes in 31 of 43 phrases. The result was technically proficient, but emotionally flat: it optimized per-phrase metrics while destroying continuity.

A single committed performance with light correction consistently outperformed the technically perfect comp.

Formant targets

The singer's-formant proxy became the most diagnostic single metric in my data.

My best raw take (91.5) had a presence score of 100 and a singer's-formant proxy of 0.120. Takes where I pulled back and listened carefully: 0.058. Takes where I opened up and projected: 0.082–0.086.

For this song and this vocal shape, my best takes cluster around:

F1 ~ 630 Hz (jaw opening proxy)
F2 ~ 1790 Hz (tongue position proxy)
singer's-formant proxy > 0.12

These aren't abstractions. They're measurements I check after every take.

Consonant flicking

After weeks of stalling around 88, I found a mechanical bottleneck I hadn't considered: consonants.

I was chewing them — clamping my jaw shut on every B, T, and K, collapsing the resonant chamber mid-phrase. The pitch tracker showed intonation wobbling at every consonant boundary, and the formant ratio dipped each time my jaw closed.

The fix was what my vocal coach would call "flicking" — minimal jaw movement, tongue does the work, jaw stays open. The resonant chamber stays intact between notes.

The data confirmed it immediately: intonation dropped from 15.5 to 13.2 cents, pitch control jumped from 81.8 to 85.6, and the raw score went from 88.1 to 90.7.

The pipeline caught something I couldn't feel happening. That's the whole point of building it.

Close your eyes

After an hour of declining scores — takes in the 70s, mix voice pitch scattering at 28 cents — I stopped looking at the screen. Closed my eyes. Sang like I already knew how.

The next take: 80.9. Hundred percent mix register. Intonation at 18 cents without trying.

The three-step protocol and the formant targets are training tools. They teach the body what to do. But during a take, conscious monitoring of the metrics interferes with muscle memory.

The reps build the skill. The skill executes without thinking.

The data is for between takes, not during them.

The arc

I started in April 2025 with a Maono mic, Audacity, and no idea what I was doing. By February 2026 — after vocal lessons, a complete recording chain rebuild, and a measurement system that could tell me what my ear couldn't — here's what 10 months and 323 takes look like:

10 months, 323 takesApr 2025 – Feb 2026

Composite score

~6991.5

Intonation

25+ cents11.5 cents

Tone clarity

~5597.9

Vocal presence

60–75100

The era-by-era progression across 323 takes, showing mean and best scores on my internal rubric:

Pre-TrainingApr 2025

73.7 / 83.1

6 takes

Vocal LessonsSep–Oct 2025

79 / 82.1

4 takes

SM7B MarathonFeb 11

84.9 / 94

91 takes

Valentine DiscoveryFeb 13–14

85.8 / 94.4

18 takes

The MergeFeb 15–16

86.8 / 94

75 takes

And the moments where it clicked:

Feb 12

First 88+ raw

SM7B + bone conduction + analyzer converge

Feb 14

Matched pro reference

91.3 raw vs 91.2 pitch-corrected commercial track

Feb 15

Broke 90

Consonant flicking discovery — 88.1 → 90.7

Feb 15

THE MERGE

91.5 — conscious integration of monitoring + projection

Feb 15

Best tuned

92.8 — transparent Auto-Tune on the strongest raw take

The pipeline didn't just measure progress. It identified what to change. Every session, I ran the analysis, found the weakest metric, and knew exactly what to work on next.

The data replaced the feedback loop my ear couldn't provide.

The pattern

vocal-tools started because I have a damaged ear. Other tools started the same way: a constraint creates a gap.

You can accept the gap, or close it. I build things that close it — and I build them with enough rigor that the evidence stands on its own.