Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning Paper • 2512.19687 • Published 3 days ago • 1
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement Paper • 2512.13303 • Published 11 days ago • 16
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance Paper • 2510.24711 • Published Oct 28 • 19
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer Paper • 2509.24695 • Published Sep 29 • 44
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Paper • 2411.10442 • Published Nov 15, 2024 • 87
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers Paper • 2503.11579 • Published Mar 14 • 21
Qwen2.5 Collection Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 46 items • Updated Jul 21 • 669
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published Dec 6, 2024 • 159
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions Paper • 2402.17485 • Published Feb 27, 2024 • 194