Spaces:
Running
Question about MiniMax Speech Report - Speaker Encoder Train
Hi Minimax team! Thank you so much for the excellent MiniMax speech report. While I understand you're working with a closed-source model, I was wondering if you'd be willing to discuss some technical details from your paper.
I'm particularly curious about your speaker encoder architecture:
What type of model architecture are you using for the speaker encoder?
If it's a traditional SV (Speaker Verification) model, do you use the complete reference audio during end-to-end training? I'm asking because this could affect pooling operations and padding handling.
Or are you using a learnable query-based model?
Regarding information leakage: You mentioned that reference audio needs to be different from target audio to prevent information leakage. However, I've observed some interesting patterns:
Models with global pooling don't seem to suffer much from information leakage
Perceiver sampler architectures (with stronger representation capabilities) do show information leakage, but surprisingly, learning from reference audio doesn't perform as well as using pretrained speaker verification models
Would you be open to discussing these technical details? I'd really appreciate any insights you could share!
Thanks again for the great work!
We are using a module based on a Transformer encoder rather than the SV model. This module is initialized from scratch and jointly trained with the AR Transformer. While global pooling can help reduce information leakage, it may not completely eliminate the issue. We have not specifically compared the performance of Perceiver sampler architectures. It is possible that these methods encode more information, which could lead to increased information leakage and result in worse performance compared to the SV model.
We are using a module based on a Transformer encoder rather than the SV model. This module is initialized from scratch and jointly trained with the AR Transformer. While global pooling can help reduce information leakage, it may not completely eliminate the issue. We have not specifically compared the performance of Perceiver sampler architectures. It is possible that these methods encode more information, which could lead to increased information leakage and result in worse performance compared to the SV model.
Thank you for your detailed explanation. I have two follow-up questions to clarify my understanding:
Since you're using a Transformer encoder initialized from scratch, what specific method do you employ to compress the variable-length encoded features into fixed-dimensional representations?
Regarding your training setup, are you using:
- Audio from the same speaker but different utterances as reference audio for training?
- Or are you splitting a single audio file into segments (e.g., front and back portions) and using them as reference-target pairs?
Understanding these implementation details would completely resolve my questions about your approach. Thank you for taking the time to clarify these points!