Peter Szemraj PRO
AI & ML interests
Recent Activity
Organizations
Much needed and very cool work!
Btw one related idea I've had sitting on the back of my mind: there's a class of synthetic audio scenarios that don't really occur naturally but are both tricky and relevant to real deployments.
Most ASR models lately seem geared toward clean meeting transcription. But if you run a Granola style setup that pulls every audio stream on your machine and mixes them together, things get messy fast. The system audio from the meeting, your physical hardware mic, and sometimes your own voice echoing back through the meeting can all be mixed/delayed/noisy in the same track the model sees. (I use VibeVoice for this myself since I figured it would be more robust; it's held up okay, but I haven't done a real comparison)
Mixed multi-stream audio like that feels like a natural fit for the kind of robustness/"real world scenario" this benchmark is measuring, even though it's a synthetic condition rather than a recorded room
Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

- +3
Which tokens does a hybrid model predict better?
Cool work, I've been quite excited about AllenAI's take/improved hybrid arch. Question for you though:
The one genuinely matched-data comparison in the paper is the 1B ladder (transformer / hybrid / pure-RNN, identical mix), which you use for the 6 filtered-loss eval - but only as aggregate loss, not the POS/bracket/copy decomposition. Since that's forward-passes-only on released checkpoints, have you run (or can you) the same tag-stratified analysis on those models? It'd help show whether the content-word / open-close / copy structure survives when data is actually held constant (vs ~7b case).
Curious if you've looked at this internally as well