LocalDoc/azerbaijani_asr
Viewer • Updated • 351k • 2.87k • 3
How to use LocalDoc/azerbaijani-whisper-small with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="LocalDoc/azerbaijani-whisper-small") # Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
processor = AutoProcessor.from_pretrained("LocalDoc/azerbaijani-whisper-small")
model = AutoModelForSpeechSeq2Seq.from_pretrained("LocalDoc/azerbaijani-whisper-small")Fine-tuned openai/whisper-small for Azerbaijani automatic speech recognition.
| Model | Params | WER | CER |
|---|---|---|---|
| whisper-small (baseline) | 242M | 52.17% | 14.52% |
| whisper-medium (baseline) | 769M | 34.54% | 9.00% |
| whisper-large-v3 (baseline) | 1543M | 21.00% | 5.51% |
| azerbaijani-whisper-small | 242M | 20.54% | 5.72% |
This model achieves better quality than whisper-large-v3 while being 6x smaller.
Evaluated on FLEURS Azerbaijani test set.
pip install --upgrade transformers
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf
import numpy as np
processor = WhisperProcessor.from_pretrained("LocalDoc/azerbaijani-whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("LocalDoc/azerbaijani-whisper-small")
audio, sr = sf.read("audio.wav")
if len(audio.shape) > 1:
audio = audio.mean(axis=1)
audio = librosa.resample(np.asarray(audio, dtype=np.float32), orig_sr=sr, target_sr=16000)
sr = 16000
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
forced_ids = processor.get_decoder_prompt_ids(language="az", task="transcribe")
with torch.no_grad():
ids = model.generate(inputs.input_features, forced_decoder_ids=forced_ids)
text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(text)
Note: Audio must be 16kHz mono. If your audio has a different sample rate, use
librosa.resample()as shown above. Passing audio without resampling will produce incorrect results.
pip install transformers torch soundfile librosa
All models evaluated on FLEURS Azerbaijani test split (921 samples) with the same normalization (lowercase, no punctuation).
| Model | Params | WER | CER | RTF (GPU) |
|---|---|---|---|---|
| whisper-tiny | 38M | 104.48% | 53.93% | 0.033 |
| whisper-base | 73M | 82.63% | 30.35% | 0.032 |
| whisper-small | 242M | 52.17% | 14.52% | 0.053 |
| whisper-medium | 769M | 34.54% | 9.00% | 0.097 |
| whisper-large-v3 | 1543M | 21.00% | 5.51% | 0.129 |
| whisper-large-v3-turbo | 809M | 22.99% | 6.55% | 0.024 |
| azerbaijani-whisper-small | 242M | 20.54% | 5.72% | ~0.05 |
Apache 2.0