VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
Abstract
VGGT-Det enables sensor-geometry-free multi-view indoor 3D object detection by integrating a Visual Geometry Grounded Transformer encoder with attention-guided query generation and query-driven feature aggregation to effectively utilize semantic and geometric priors.
Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation (2026)
- UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception (2026)
- Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning (2026)
- MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection (2026)
- Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction (2026)
- HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps (2026)
- UniGeo: A Unified 3D Indoor Object Detection Framework Integrating Geometry-Aware Learning and Dynamic Channel Gating (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
arXivLens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/vggt-det-mining-vggt-internal-priors-for-sensor-geometry-free-multi-view-indoor-3d-object-detection-4464-76a61d76
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper