Methodology

How Musavox transcribes music

Musavox uses a multi-stage AI pipeline specifically designed for music — not speech, not podcasts, not meetings. Each stage is purpose-built to handle the challenges that make music transcription fundamentally different from spoken word: overlapping vocals, ad-libs, heavy instrumentation, regional slang, and bilingual code-switching.

Pipeline Overview

01

Audio Ingestion

02

Vocal Isolation

03

Speech Recognition

04

Contextual Enhancement

05

Quality Scoring

Stage 1 — Vocal Isolation

Before any transcription occurs, the audio is processed through a source separation model that isolates the vocal track from the instrumental mix. This step is critical — standard speech recognition models are trained on clean voice recordings and fail dramatically when processing full music mixes with bass, drums, and effects.

The isolation model separates audio into stems (vocals, drums, bass, other), retaining only the vocal track for downstream processing. Vocal isolation materially raises the transcription floor on tracks with heavy 808s, autotune, and layered vocal production common in Latin urban music; downstream region-aware post-processing then refines the output for the dialect and genre.

Stage 2 — Speech Recognition

The isolated vocal track is processed by a large-scale automatic speech recognition (ASR) model optimized for multilingual audio. The model produces:

  • Full text transcription of the vocal performance
  • Word-level timestamps (start and end time for each word)
  • Language detection (English, Spanish, or mixed)

The word-level timestamps enable synchronized lyric export (LRC format) and are preserved through all subsequent processing stages.

Stage 3 — Contextual Enhancement

Raw ASR output is accurate but lacks musical context. A large language model post-processes the transcription with domain-specific intelligence:

Regional dialect correction

A curated lexicon of 302+ terms across eight fully curated regional varieties: Puerto Rican, Mexican, Colombian, Dominican, Argentine, Chilean, Venezuelan, and U.S. Latin Spanish. Coverage expands as new territories enter the platform. The model corrects ASR errors using linguistic context — for example, distinguishing "jevo" (PR slang for partner) from "huevo" (egg), or recognizing "corillo" as a social group rather than a transcription error.

Section detection

Identifies verse, pre-chorus, chorus, bridge, and outro boundaries based on lyrical repetition patterns and structural cues.

Ad-lib classification

Separates ad-libs (exclamations, vocal effects, producer tags) from primary lyrics. Each ad-lib is tagged with its type and timestamp. This matters for publishing: ad-libs are not copyrightable lyrics.

Code-switch detection

When an artist switches language mid-verse (common in Latin music and hip-hop), the system marks the transition point. This affects copyright registration metadata and translation workflows.

Producer tag identification

Recognizes and labels producer tags that appear at the beginning or within tracks, keeping them separate from lyrical content.

Stage 4 — Quality Scoring

Every line of the transcription receives a confidence score from 0.0 to 1.0. Lines scoring below 0.7 are flagged for human review. The editor displays these as highlighted lines with alternative suggestions — the model's second and third best interpretations of what was said.

This scoring system means a reviewer can focus only on uncertain lines instead of reading the entire transcription, materially reducing review time per track.

Stage 5 — Adaptive Dialect Engine

The system improves over time. When a user corrects a transcription in the editor, the correction is recorded with its context: the original ASR output, the corrected text, the language, and the regional variant.

Phase 1 (current)

Static curated glossary embedded in the enhancement prompt.

Phase 2 (planned)

Corrections accumulate as training data, enabling region-specific accuracy improvements.

Phase 3 (future)

Custom model fine-tuned on accumulated corrections — a proprietary dataset that no competitor can replicate.

Export Formats

Processed transcriptions can be exported in industry-standard formats:

  • TXTPlain text with section markers. Universal compatibility.
  • LRCSynchronized lyrics with timestamps. Used by DSPs, karaoke systems, and music players for real-time lyric display.
  • SRT (planned)Subtitle format for music videos.
  • JSON (planned)Structured data for API consumers and integrations.

Processing Time

A typical 3.5-minute track completes processing in 30–90 seconds. Real-time pipeline visibility shows each stage as it progresses: uploading, isolating vocals, transcribing, enhancing, complete.

Deduplication

If the same audio file is uploaded twice (detected via hash comparison computed before upload), the system returns the existing transcription instantly at zero cost and zero processing time.

Questions about our methodology? Contact us at support@musavox.io