# Mega-ASR **Repository Path**: cellinlab/Mega-ASR ## Basic Information - **Project Name**: Mega-ASR - **Description**: First foundation ASR built for the real world - 7 atomic acoustic conditions, 54 compound scenarios, 2.6M samples, and up to ~30% gains over SOTA where every other model falls apart. **You'll come back to MEGA-ASR, after the rest fail in the wild. ⭐** - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-26 - **Last Updated**: 2026-05-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Mega-ASR Logo

Mega-ASR: Towards In-the-Wild^2 Speech Recognition via Scaling Up Real-world Acoustic Simulation

We introduce **MEGA-ASR**, the first foundation ASR model to target **full-scenario robust speech recognition in the wild** through systematic training on **7 atomic acoustic conditions** and **54 compound acoustic scenarios**. Built on **2.6M training samples** covering **noise, far-field speech, obstruction, echo and reverberation, recording artifacts, electronic distortion, and transmission dropout**, MEGA-ASR uses **A2S-SFT** and **DG-WGPO based RL** to achieve **up to nearly 30% gains** over leading open and closed source SOTA models in challenging acoustic environments. If you like us, please give us a star✨.

You’ll come back to Mega-ASR — after finding the rest fail in the real world.

Technical Report 📖 / Voices-in-the-wild-2M 🤗 / Mega-ASR Weights 🤗 / Voices-in-the-Wild-Bench 🏆

Mega-ASR Logo

### Comparison with SOTA open-source and closed-source models. #### Sample 1

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
...and said to him let us go and eat some honey. Whose honey? inquired Kobay cautiously. My father's, Soongoora replied. Oh, all right, I'm with you, said the tortoise eagerly, and away they went. Reference	He said to him let's go and eat some honey. It's honey? inquired very cautiously. My father is Superabundant — oh, all right, I will, said to her eagerly, and away they went. WER: 47.1 ✅	<empty> WER: 100.0 🔴	But tell me, that's how she met my father's sister. Oh, all right. I wish... I really... WER: 86.1 🔴	My father is. Oh, all right, I wish you can. WER: 85.3 🔴	...to him... some honey... oh yeah... WER: 92.5 🔴

More examples (Sample 2 – 6)

#### Sample 2

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
To waste, I skip forty years, said the baker in tears, and proceed without further remark to the day when you took me aboard your ship to help you in hunting the snark. Reference	To witness, I skip forty years, said the baker in tears, and proceed without further remark to the day when you took me aboard of your ship to help you in hunting the snark. WER: 5.9 ✅	I skipped 40 years. Second day in here. Ever since you left, I've been a monk... WER: 64.7 🟠	I spent forty years at sea and never seen a rougher than the day that you took me aboard your ship... WER: 64.7 🟠	To wait. I skip forty years. Saturday and years. And proceed without further remark... WER: 38.2 🟡	I skip forty years... to the day you took me on a ship... to hunt the shark. WER: 71.5 🟠

#### Sample 3

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
The friendly gang left the drug store. Reference	The friendly gang left the drug store. WER: 8.0 ✅	It's a friendly gang. That's the drug gang. WER: 57.1 🟠	Friendly gang left the drugs. WER: 42.9 🟡	The friendly gang left the drugstore. WER: 28.6 🟢	A friendly young man left the drug store. WER: 62.3 🟠

#### Sample 4

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
The set of china hit the floor with a crash. Reference	The set of china hit the floor with a crash. WER: 8.0 ✅	The bed is fine. It hit the floor with a crash. WER: 40.0 🟡	He said it's fine I hit the forward slash. WER: 100.0 🔴	The sound of china hits the floor with a crash. WER: 20.0 🟢	The chef of China hit the floor with a clash. WER: 55.0 🟠

#### Sample 5

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
Among export-led electrical and computer makers, Japan Victor Company fell fifty to two thousand three hundred twenty. Reference	Among export-led (missing: electrical and) computer makers, Japan Victor Company fell fifty to two thousand three hundred twenty. WER: 11.1 ✅	Among export-led (missing: electrical and) computer makers, Japan VictorNet sold fifty-two thousand three hundred fifty. WER: 38.9 🟡	Among export-led (missing: electrical and) computer makers, Japan Victor Co. fell 50 to 2,350 yen. WER: 35.7 🟡	Among export-led in computer makers, Japan Victor Company sell 50 to 2300 unit. WER: 50.0 🟠	Among exporters, computer makers in Japan victor companies sold fifty... WER: 66.7 🟠

#### Sample 6

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
Has exposure really been reduced? Reference	Has exposure really been reduced. WER: 8.0 ✅	Has exposure really done you? WER: 40.0 🟡	Has the closure really affected you? WER: 80.0 🔴	Has exposure to beauty products. WER: 60.0 🟠	Have those who really been refused? WER: 78.5 🔴

## 🔥News - [Coming]: We are going to release RL code and optimize WebUI. - [Coming]: Dataset and benchmark will be reformatted to be clearer. - [Coming]: We will release all the data process pipeline. - **May 20, 2026**: 🔥 We release **Voices-in-the-Wild-Bench**, a benchmark for in-the-wild ASR robustness evaluation. - **May 20, 2026**: 🔥 We release **Voices-in-the-Wild-2M**. - **May 20, 2026**: 🔥 We release the **Mega-ASR Inference and Training Codebase**. - **May 19, 2026**: 🔥 **Mega-ASR** model weights are now available on Hugging Face. - **May 19, 2026**: 🔥 We release the **Mega-ASR Technical Report**. ## Overview * **[Quick Start](#quick-start)** * **[Introduction](#inference)** * **[Inference and deployment](#inference)** * **[Finetuning](#finetune)** * **[Evaluation](#evaluation)** * **[Citation and licence](#citation)** ## Quick Start Mega-ASR is trained on a large volume of inherently high-WER data, which leads to a slight degradation in its basic recognition capability. To address this, **we equip the system with a router** that determines whether Mega-ASR should be activated for the current audio input, via deciding whether to mount the LoRA weights. **Installation** ```bash git clone https://github.com/xzf-thu/Mega-ASR.git cd Mega-ASR conda create -n mega-asr python=3.10 -y conda activate mega-asr pip install -r requirements.txt ``` **Download Weights** ```bash python scripts/download.py ``` **Offline Inference** ```bash # infer with default audio bash scripts/inference.sh #Use your own audio: bash scripts/inference.sh --audio /path/to/audio.wav ``` ## Introduction **MEGA-ASR** is purpose-built for **full-scenario robust ASR in the wild**, especially excelling at **semantic recovery** and **local keyword reconstruction** under severe acoustic degradation. It substantially reduces common failure modes such as **hallucinations**, **empty outputs**, and **dropped utterances**, making speech recognition reliable in truly challenging real-world environments.

Results

### Features ✅ **One model for the messy real world**: Covers **7 atomic acoustic conditions** and **54 compound acoustic scenarios** in a single model. ✅ **Stronger recovery under severe distortion**: Excels at **semantic recovery** and **local keyword reconstruction**, greatly reducing **hallucinations**, **empty outputs**, and **dropped utterances**. ✅ **SOTA robust ASR performance**: Achieves up to nearly **30% gains** over leading open and closed source SOTA models in challenging acoustic environments. ## Finetuning You can further fine-tune Mega-ASR on your own scenarios and data. You can also use our repository to directly train Qwen3-ASR. ### A2S-SFT `src/MegaASR/A2S-SFT` contains the core training code for Mega-ASR A2S-SFT. ```text src/MegaASR/A2S-SFT/ ├── arguments.py # Defines command-line arguments and training hyperparameters. ├── checkpointing.py # Saves base-model metadata and required processor/tokenizer files for LoRA reuse. ├── dataloader.py # Loads JSONL data, reads audio, builds model inputs, and masks non-target tokens. ├── finetune.py # Main entry point for launching A2S-SFT training. ├── modeling.py # Loads Qwen3-ASR and defines LoRA injection scopes. ├── trainer.py # Defines MegaASRTrainer with adapter-only saving and module-wise learning rates. ``` Training data is in JSONL format: ```json { "audio": ".../wavs/test-clean/61/70968/61-70968-0000.wav", "text": "language EnglishTHE TRANSCRIPT TEXT", "prompt": "" } ``` We can use the following command to start it. ```bash torchrun --nproc_per_node=2 A2S_SFT/finetune.py \ --model_path Qwen3-ASR-1.7B --train_file ${TRAIN_JSONL} \ --eval_file ${VAL_JSONL} --output_dir ${OUT_DIR} \ --batch_size 8 --grad_acc 8 \ --lr 1e-6 --lr_encoder 1e-6 --lr_aligner 1e-6 --lr_llm 1e-6 \ --epochs 2 --save_steps 200 --save_total_limit 300 --use_lora 1 \ --lora_scope all --lora_r 8 --lora_alpha 16 --lora_dropout 0.05 \ --warmup_ratio 0.05 --max_grad_norm 1.0 --weight_decay 0.01 \ --run_name ${RUN_NAME} --report_to wandb \ 2>&1 | tee -a ${LOG_FILE} ``` The DG-WGPO reinforcement learning module will be released in a future update. ## Evaluation We provide a simple evaluation script for running Mega-ASR inference and computing WER/CER. The input file should be a JSONL file. Each line only needs two required fields: ```json {"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."} ``` The script will keep all original fields and append the following fields to the output JSONL: ```text prediction # model transcription metric # "wer" for English samples, "cer" for Chinese samples wer # WER/CER score value; CER is also stored in this field for compatibility num_edits # edit distance between prediction and ground truth ref_len # number of reference words or characters ``` The script reuses the same Mega-ASR wrapper as `infer.py`, loading the base model, LoRA, and router from `ckpt/Mega-ASR`. ```bash python src/MegaASR/eval/evaluate_wer.py \ --ckpt_dir ckpt/Mega-ASR \ --input_jsonl examples/test.jsonl \ --output_jsonl outputs/pred_with_wer.jsonl ```

Mega-ASR Training

**Mega-ASR** is trained with an acoustic-to-semantic progressive supervised fine-tuning strategy: it first curriculum-trains the encoder and aligner on increasingly difficult samples from WER<30% to WER<50% and then WER<70%, then fine-tunes the LLM on WER<70% data to strengthen semantic recovery, and finally jointly fine-tunes the full encoder-aligner-LLM stack for end-to-end alignment. On top of Mega-ASR-Base, DG-WGPO further optimizes the model with WER-gated policy learning: low-WER samples emphasize token-level acoustic refinement, while high-WER samples emphasize sentence-level semantic reconstruction to reduce hallucinations, omissions, and off-audio outputs. The final reward combines a static WER-based accuracy signal with an anti-repetition gate and a dynamic dual-granularity reward, using fixed hyperparameters τ=0.3, αs=0.4, and αdyn=0.6. Run Mega-ASR inference without routing if you want to force the LoRA on every sample: ```bash python src/MegaASR/eval/evaluate_wer.py \ --ckpt_dir ckpt/Mega-ASR \ --input_jsonl examples/test.jsonl \ --output_jsonl outputs/pred_with_wer.jsonl \ --no-routing ``` Each input line requires `audio` or `audio_path`, plus `answer` as the ground-truth transcription. **Mega-ASR** is evaluated across three benchmark families — classical academic test sets, robustness benchmarks, and our own in-the-wild compound benchmark.

Mega-ASR Results

## Acknowledgements We sincerely thank the creators, maintainers, and contributors of the public datasets used in this work, including MUSAN, DNS Challenge, ESC-50, UrbanSound8K, LibriSpeech, Common Voice, WenetSpeech, and AISHELL-1. We also sincerely thank the Qwen3-ASR Team for developing such an excellent foundation model, which provides a strong backbone for this work. ## Licence, Citation and stars This project will be released under the **Apache-2.0 License**. You can do everything with Mega-ASR 🎉 **Citation**: You can cite Mega-ASR using the following BibTeX entry. Thank you for your kindness 🙂 ```bibtex @misc{xie2026megaasrinthewild2speechrecognition, title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation}, author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao}, year={2026}, eprint={2605.19833}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2605.19833}, } ```