VibeVoice ONNX INT4 - Browser TTS - TESTS FAILED - Need to reduce model size or simplify generation mode and double diffusion
ONNX INT4 quantized version of Microsoft VibeVoice-Realtime-0.5B for browser deployment with transformers.js.
Model Components
| File | Size | Description |
|---|---|---|
tts_llm_int4.onnx |
702 MB | Qwen2-based language model (text β hidden states) |
vocoder_int4.onnx |
339 MB | Ο-VAE decoder (latents β audio) |
diffusion_head_int4.onnx |
25 MB | DDPM diffusion (hidden states β latents) |
diffusion_head_int8.onnx |
40 MB | INT8 variant for comparison |
| Total | ~1.07 GB | Down from 3.1 GB FP32 |
Architecture
Text Input
β
[Tokenizer] β Token IDs (vocab: 151936)
β
[TTS LLM] β Hidden States (896-dim)
β
[Diffusion Head] β Noise (20 steps, cosine schedule)
β
Acoustic Latents (64-dim)
β
[Vocoder]
β
Audio Waveform (24kHz)
Usage with ONNX Runtime Web
import * as ort from 'onnxruntime-web';
// Load models
const llm = await ort.InferenceSession.create('tts_llm_int4.onnx');
const diffusion = await ort.InferenceSession.create('diffusion_head_int4.onnx');
const vocoder = await ort.InferenceSession.create('vocoder_int4.onnx');
// Run LLM
const inputIds = new ort.Tensor('int64', tokenizedText, [1, seqLen]);
const attentionMask = new ort.Tensor('int64', ones, [1, seqLen]);
const positionIds = new ort.Tensor('int64', positions, [1, seqLen]);
const { hidden_states } = await llm.run({
input_ids: inputIds,
attention_mask: attentionMask,
position_ids: positionIds
});
// Run diffusion (20 steps)
let latent = noise;
for (let t = 999; t >= 0; t -= 50) {
const { v_prediction } = await diffusion.run({
noisy_latent: latent,
timestep: new ort.Tensor('float32', [t], [1]),
hidden_states: hidden_states
});
latent = denoise(latent, v_prediction, t);
}
// Run vocoder
const { audio } = await vocoder.run({ latents: latent });
Configuration
{
"audio": {
"sample_rate": 24000,
"vae_dim": 64
},
"llm": {
"hidden_size": 896,
"num_hidden_layers": 20,
"vocab_size": 151936
},
"diffusion": {
"num_inference_steps": 20,
"beta_schedule": "cosine",
"prediction_type": "v_prediction"
}
}
Quantization Details
- Method: INT4 MatMulNBits (block_size=32)
- Embedding: FP32 (cannot be quantized)
- Linear layers: INT4
- Reduction: 65.9% smaller than FP32
Original Model
- Source: microsoft/VibeVoice-Realtime-0.5B
- License: MIT
- Paper: VibeVoice
Limitations
- Work in Progress Still working on KV cache support and autoregressive generation
- English only
- Requires ~1GB download for browser
- No streaming support (full sequence generation)
- Tokenizer not included (use Qwen2 tokenizer)
- Downloads last month
- 11