Building a Vocal Analysis Pipeline to Replace My Ear
I have a perforated right eardrum — about 50% hearing loss on that side. When I started learning to sing in late 2025, I couldn't reliably tell whether I was hitting the notes.
So I built a tool that could.
The problem
When you're learning to sing, the feedback loop is everything: you sing, you hear, you adjust. But when your ear is damaged, that loop is broken.
A vocal coach can tell you what they hear in a lesson, but they're not there for every take at 11 PM in your bedroom.
I needed an objective replacement for the ear I couldn't trust. Not a pitch tuner — those show you one note at a time. I needed a system that could evaluate an entire performance the way a producer would: pitch accuracy, yes, but also tone, dynamics, phrasing, consistency, and presence.
What I built
vocal-tools is a Python pipeline that analyzes recordings across 70+ metrics and produces a composite score (0–100) on my internal rubric, calibrated against a reference set of professional recordings. It's built on NumPy, SciPy, parselmouth, and librosa — no ML models, no black boxes. Every metric is deterministic and reproducible.
The core command is simple:
$ python vocal_take_analyzer.py take.wav --outdir analysis/ --style mix
It produces a full report: composite score, per-metric breakdowns, phrase-level data, and detailed JSON output for downstream processing.
The scoring system
The composite score uses sigmoid-normalized metrics with weighted category aggregation across five areas:
- Pitch control (~30%): intonation deviation (cents), pitch stability, vibrato regularity, register consistency
- Tonal quality (~25%): tone clarity, formant positioning (F1/F2), singer's-formant proxy ("ring"), harmonic-to-noise ratio
- Dynamic expression (~20%): phrase-level dynamic range, verse–chorus arc, micro-dynamics within phrases
- Vocal presence (~15%): projection/forward-placement proxy via roughly 2.5–3.5 kHz energy consistency
- Technical execution (~10%): onset quality, breath management proxies, consonant clarity, phrase endings
Each metric passes through a sigmoid curve with a calibrated midpoint per style, compressing outliers and centering scoring around the discriminative range. In my rubric, 85+ is competent amateur work. 90+ is semi-professional. The pitch-corrected commercial reference track I calibrated against scores 91.2.
Signal processing details
The pipeline processes audio in stages:
- Pitch tracking: frame-by-frame F0 estimation via Praat's cross-correlation method (parselmouth, 10 ms hop), plus voiced/unvoiced classification.
- Phrase detection: segmentation using silence gaps (≥300 ms below −45 dBFS) and voiced boundaries.
- Formant analysis: LPC-based estimates (F1–F4) and a singer's-formant proxy (roughly 2.5–3.5 kHz energy relative to lower bands). This often tracks the "ring" and forward placement that trained singers develop.
- Register indicator: a heuristic proxy from spectral shape, used to flag where I tend to lose pitch control (mix range for my voice).
- Composite scoring: sigmoid aggregation so improvements matter more in the problem zone than near perfection.
What the data taught me
The biggest insight had nothing to do with code.
Monitoring was the biggest pitch lever
My early takes had intonation around 25+ cents deviation — objectively rough. I spent weeks working technique. The number barely moved.
Then I changed my monitoring setup: I replaced headphones with a Forbrain bone-conduction headset (reducing reliance on the damaged eardrum for monitoring), and added a little reverb to make pitch relationships easier to perceive.
My intonation dropped to 14–15 cents overnight.
For me, the bottleneck wasn't talent. It was that I couldn't hear myself accurately.
The drone breakthrough
With a full backing track, my pitch scattered — too many harmonic layers for one good ear to track cleanly.
So I stripped everything away and sang with a single B3 sine-wave drone (root note) generated in REAPER.
Singing a cappella with a drone produced the same intonation as singing over the full track: 15.3 cents. One tone was enough.
That made it clear: pitch accuracy, in my case, was about reference signal clarity as much as it was about technique.
Single takes beat comps
I built a companion tool (vocal_comp_builder.py) that selects the best phrases from multiple takes and stitches them together. The greedy algorithm picks the highest-scoring segment at each phrase boundary.
The automated comp scored 88.4. The best single take scored 88.1 — nearly identical — and the comp sounded worse.
The algorithm switched takes in 31 of 43 phrases. The result was technically proficient, but emotionally flat: it optimized per-phrase metrics while destroying continuity.
A single committed performance with light correction consistently outperformed the technically perfect comp.
Formant targets
The singer's-formant proxy became the most diagnostic single metric in my data.
My best raw take (91.5) had a presence score of 100 and a singer's-formant proxy of 0.120. Takes where I pulled back and listened carefully: 0.058. Takes where I opened up and projected: 0.082–0.086.
For this song and this vocal shape, my best takes cluster around:
- F1 ~ 630 Hz (jaw opening proxy)
- F2 ~ 1790 Hz (tongue position proxy)
- singer's-formant proxy > 0.12
These aren't abstractions. They're measurements I check after every take.
Consonant flicking
After weeks of stalling around 88, I found a mechanical bottleneck I hadn't considered: consonants.
I was chewing them — clamping my jaw shut on every B, T, and K, collapsing the resonant chamber mid-phrase. The pitch tracker showed intonation wobbling at every consonant boundary, and the formant ratio dipped each time my jaw closed.
The fix was what my vocal coach would call "flicking" — minimal jaw movement, tongue does the work, jaw stays open. The resonant chamber stays intact between notes.
The data confirmed it immediately: intonation dropped from 15.5 to 13.2 cents, pitch control jumped from 81.8 to 85.6, and the raw score went from 88.1 to 90.7.
The pipeline caught something I couldn't feel happening. That's the whole point of building it.
Close your eyes
After an hour of declining scores — takes in the 70s, mix voice pitch scattering at 28 cents — I stopped looking at the screen. Closed my eyes. Sang like I already knew how.
The next take: 80.9. Hundred percent mix register. Intonation at 18 cents without trying.
The three-step protocol and the formant targets are training tools. They teach the body what to do. But during a take, conscious monitoring of the metrics interferes with muscle memory.
The reps build the skill. The skill executes without thinking.
The data is for between takes, not during them.
The arc
I started in April 2025 with a Maono mic, Audacity, and no idea what I was doing. By February 2026 — after vocal lessons, a complete recording chain rebuild, and a measurement system that could tell me what my ear couldn't — here's what 10 months and 323 takes look like:
The era-by-era progression across 323 takes, showing mean and best scores on my internal rubric:
And the moments where it clicked:
The pipeline didn't just measure progress. It identified what to change. Every session, I ran the analysis, found the weakest metric, and knew exactly what to work on next.
The data replaced the feedback loop my ear couldn't provide.
The pattern
vocal-tools started because I have a damaged ear. Other tools started the same way: a constraint creates a gap.
You can accept the gap, or close it. I build things that close it — and I build them with enough rigor that the evidence stands on its own.