VibeVoice ONNX INT4 - Browser TTS - TESTS FAILED - Need to reduce model size or simplify generation mode and double diffusion

ONNX INT4 quantized version of Microsoft VibeVoice-Realtime-0.5B for browser deployment with transformers.js.

Model Components

File	Size	Description
`tts_llm_int4.onnx`	702 MB	Qwen2-based language model (text → hidden states)
`vocoder_int4.onnx`	339 MB	σ-VAE decoder (latents → audio)
`diffusion_head_int4.onnx`	25 MB	DDPM diffusion (hidden states → latents)
`diffusion_head_int8.onnx`	40 MB	INT8 variant for comparison
Total	~1.07 GB	Down from 3.1 GB FP32

Architecture

Text Input
    ↓
[Tokenizer] → Token IDs (vocab: 151936)
    ↓
[TTS LLM] → Hidden States (896-dim)
    ↓
[Diffusion Head] ← Noise (20 steps, cosine schedule)
    ↓
Acoustic Latents (64-dim)
    ↓
[Vocoder]
    ↓
Audio Waveform (24kHz)

Usage with ONNX Runtime Web

import * as ort from 'onnxruntime-web';

// Load models
const llm = await ort.InferenceSession.create('tts_llm_int4.onnx');
const diffusion = await ort.InferenceSession.create('diffusion_head_int4.onnx');
const vocoder = await ort.InferenceSession.create('vocoder_int4.onnx');

// Run LLM
const inputIds = new ort.Tensor('int64', tokenizedText, [1, seqLen]);
const attentionMask = new ort.Tensor('int64', ones, [1, seqLen]);
const positionIds = new ort.Tensor('int64', positions, [1, seqLen]);

const { hidden_states } = await llm.run({
  input_ids: inputIds,
  attention_mask: attentionMask,
  position_ids: positionIds
});

// Run diffusion (20 steps)
let latent = noise;
for (let t = 999; t >= 0; t -= 50) {
  const { v_prediction } = await diffusion.run({
    noisy_latent: latent,
    timestep: new ort.Tensor('float32', [t], [1]),
    hidden_states: hidden_states
  });
  latent = denoise(latent, v_prediction, t);
}

// Run vocoder
const { audio } = await vocoder.run({ latents: latent });

Configuration

{
  "audio": {
    "sample_rate": 24000,
    "vae_dim": 64
  },
  "llm": {
    "hidden_size": 896,
    "num_hidden_layers": 20,
    "vocab_size": 151936
  },
  "diffusion": {
    "num_inference_steps": 20,
    "beta_schedule": "cosine",
    "prediction_type": "v_prediction"
  }
}

Quantization Details

Method: INT4 MatMulNBits (block_size=32)
Embedding: FP32 (cannot be quantized)
Linear layers: INT4
Reduction: 65.9% smaller than FP32

Original Model

Source: microsoft/VibeVoice-Realtime-0.5B
License: MIT
Paper: VibeVoice

Limitations

Work in Progress Still working on KV cache support and autoregressive generation
English only
Requires ~1GB download for browser
No streaming support (full sequence generation)
Tokenizer not included (use Qwen2 tokenizer)

Downloads last month: 11

Model tree for FluffyBunnies/vibevoice-onnx-int4

Base model

Qwen/Qwen2.5-0.5B

Finetuned

microsoft/VibeVoice-Realtime-0.5B

Quantized

(1)

this model