Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context Paper • 2606.26493 • Published 9 days ago • 3
Nemotron-Labs-TwoTower Collection Diffusion Language Modeling with Pretrained Autoregressive Nemotron 3 Models • 1 item • Updated 3 days ago • 5
Rethinking the Role of Efficient Attention in Hybrid Architectures Paper • 2606.15378 • Published 21 days ago • 18
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence Paper • 2605.26494 • Published May 26 • 41
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Paper • 2605.22791 • Published May 21 • 33
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation Paper • 2604.10098 • Published Apr 11 • 82
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models Paper • 2512.08829 • Published Dec 9, 2025 • 23
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models Paper • 2512.02556 • Published Dec 2, 2025 • 270
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds Paper • 2511.08892 • Published Nov 12, 2025 • 218
Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer Paper • 2510.25976 • Published Oct 29, 2025 • 16
view article Article Why Did MiniMax M2 End Up as a Full Attention Model? MiniMax-AI • Oct 30, 2025 • 80
Less is More: Recursive Reasoning with Tiny Networks Paper • 2510.04871 • Published Oct 6, 2025 • 517
view article Article There is no such thing as a tokenizer-free lunch catherinearnett • Sep 25, 2025 • 101
RWKV-7 "Goose" with Expressive Dynamic State Evolution Paper • 2503.14456 • Published Mar 18, 2025 • 153