VibeVoice ONNX INT4 - Browser TTS - TESTS FAILED - Need to reduce model size or simplify generation mode and double diffusion

ONNX INT4 quantized version of Microsoft VibeVoice-Realtime-0.5B for browser deployment with transformers.js.

Model Components

File Size Description
tts_llm_int4.onnx 702 MB Qwen2-based language model (text β†’ hidden states)
vocoder_int4.onnx 339 MB Οƒ-VAE decoder (latents β†’ audio)
diffusion_head_int4.onnx 25 MB DDPM diffusion (hidden states β†’ latents)
diffusion_head_int8.onnx 40 MB INT8 variant for comparison
Total ~1.07 GB Down from 3.1 GB FP32

Architecture

Text Input
    ↓
[Tokenizer] β†’ Token IDs (vocab: 151936)
    ↓
[TTS LLM] β†’ Hidden States (896-dim)
    ↓
[Diffusion Head] ← Noise (20 steps, cosine schedule)
    ↓
Acoustic Latents (64-dim)
    ↓
[Vocoder]
    ↓
Audio Waveform (24kHz)

Usage with ONNX Runtime Web

import * as ort from 'onnxruntime-web';

// Load models
const llm = await ort.InferenceSession.create('tts_llm_int4.onnx');
const diffusion = await ort.InferenceSession.create('diffusion_head_int4.onnx');
const vocoder = await ort.InferenceSession.create('vocoder_int4.onnx');

// Run LLM
const inputIds = new ort.Tensor('int64', tokenizedText, [1, seqLen]);
const attentionMask = new ort.Tensor('int64', ones, [1, seqLen]);
const positionIds = new ort.Tensor('int64', positions, [1, seqLen]);

const { hidden_states } = await llm.run({
  input_ids: inputIds,
  attention_mask: attentionMask,
  position_ids: positionIds
});

// Run diffusion (20 steps)
let latent = noise;
for (let t = 999; t >= 0; t -= 50) {
  const { v_prediction } = await diffusion.run({
    noisy_latent: latent,
    timestep: new ort.Tensor('float32', [t], [1]),
    hidden_states: hidden_states
  });
  latent = denoise(latent, v_prediction, t);
}

// Run vocoder
const { audio } = await vocoder.run({ latents: latent });

Configuration

{
  "audio": {
    "sample_rate": 24000,
    "vae_dim": 64
  },
  "llm": {
    "hidden_size": 896,
    "num_hidden_layers": 20,
    "vocab_size": 151936
  },
  "diffusion": {
    "num_inference_steps": 20,
    "beta_schedule": "cosine",
    "prediction_type": "v_prediction"
  }
}

Quantization Details

  • Method: INT4 MatMulNBits (block_size=32)
  • Embedding: FP32 (cannot be quantized)
  • Linear layers: INT4
  • Reduction: 65.9% smaller than FP32

Original Model

Limitations

  • Work in Progress Still working on KV cache support and autoregressive generation
  • English only
  • Requires ~1GB download for browser
  • No streaming support (full sequence generation)
  • Tokenizer not included (use Qwen2 tokenizer)
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FluffyBunnies/vibevoice-onnx-int4

Base model

Qwen/Qwen2.5-0.5B
Quantized
(1)
this model