VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings
VoxMorph is a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. The method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network.
This repository hosts the official model checkpoints for VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings (ICASSP 2026). It contains the checkpoint files (s3gen.pt and t3_cfg.pt) for VoxMorph, a zero-shot TTS framework built on top of Resemble AI's frozen Chatterbox-TTS backbone.
Citation
If you find this work useful in your research, please consider citing the ICASSP 2026 paper:
@INPROCEEDINGS{11462383,
author={Krishnamurthy, Bharath and Rattani, Ajita},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={VoxMorph: Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings},
year={2026},
volume={},
number={},
pages={13332-13336},
keywords={Filtering;Filters;Deepfakes;Vocoders;Videos;Protocols;HTTP;Wide area networks;Communication equipment;Communication systems;Voice morphing;text-to-speech;zero-shot learning;speaker embedding;interpolation;speech synthesis},
doi={10.1109/ICASSP55912.2026.11462383}
}
@article{krishnamurthy2026voxmorph_arxiv,
title={VoxMorph: Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings},
author={Krishnamurthy, Bharath and Rattani, Ajita},
journal={arXiv preprint arXiv:2601.20883},
year={2026}
}
- Downloads last month
- 4
Model tree for BharathK333/VoxMorph-Models
Base model
ResembleAI/chatterbox