USAD-Base / README.md
nielsr's picture
nielsr HF Staff
Add link to Github repository
82ca084 verified
|
raw
history blame
4.1 kB
metadata
datasets:
  - openslr/librispeech_asr
  - facebook/multilingual_librispeech
  - mozilla-foundation/common_voice_17_0
  - speechcolab/gigaspeech
  - facebook/voxpopuli
  - agkphysics/AudioSet
language:
  - en
library_name: transformers
license: bsd-3-clause
pipeline_tag: feature-extraction
tags:
  - automatic-speech-recognition
  - audio-classification
  - audio
  - speech
  - music

USAD: Universal Speech and Audio Representation via Distillation

The model was presented in the paper USAD: Universal Speech and Audio Representation via Distillation.

The abstract of the paper is the following:

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

Universal Speech and Audio Distillation (USAD) is a unified speech, sound, and music encoder distilled from domain-specific teachers. Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model.

πŸ‘€ Read Full Paper

Code: MIT-SLS/USAD (Assuming this is the correct repository. Please verify.)


πŸ—‚οΈ Models

USAD models are all transformer encoders operating at 50Hz frame rate. The teacher models are WavLM Base+ and ATST Frame.

Model Parameters Dim Layer Checkpoint
USAD Small 24M 384 12 link
USAD Base 94M 768 12 link
USAD Large 330M 1024 24 link

πŸš€ How To Use

Installation

pip install -U transformers

Load Model and Extract Features

import torch
from transformers import AutoModel

# Load pre-trained model
model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval()

# Load audio and resample to 16kHz
wav = model.load_audio("path/to/audio").unsqueeze(0)  # (batch_size, wav_len)
# wav is a float tensor on the same device as the model
# You can also load waveforms directly with torchaudio.load

# Extract features
with torch.no_grad():
    results = model(wav)

# result["x"]:              model final output (batch_size, seq_len)
# result["mel"]:            mel fbank (batch_size, seq_len * 2, mel_dim)
# result["hidden_states"]:  list of (batch_size, seq_len, encoder_dim)
# result["ffn"]:            list of (batch_size, seq_len, encoder_dim)

See usad_model.py for more details about the model.


πŸ“– Citation

@article{chang2025usad,
  title={{USAD}: Universal Speech and Audio Representation via Distillation},
  author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
  journal={arXiv preprint arXiv:2506.18843},
  year={2025}
}

πŸ™ Acknowledgement

Our implementation is based on the awesome facebookresearch/fairseq, cwx-worst-one/EAT, and sooftware/conformer repositories.