Learning Image Representations by Completing Damaged Jigsaw Puzzles Paper • 1802.01880 • Published Feb 6, 2018
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles Paper • 1811.09795 • Published Nov 24, 2018
Align-and-Attend Network for Globally and Locally Coherent Video Inpainting Paper • 1905.13066 • Published May 30, 2019
Contrastive Feature Masking Open-Vocabulary Vision Transformer Paper • 2309.00775 • Published Sep 2, 2023 • 10
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models Paper • 2504.03970 • Published Apr 4
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Paper • 2507.06261 • Published Jul 7 • 64
Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment Paper • 2508.02762 • Published Aug 3
EmbeddingGemma: Powerful and Lightweight Text Representations Paper • 2509.20354 • Published Sep 24 • 41
Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications Paper • 2509.19087 • Published Sep 23 • 1
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities Paper • 2311.05698 • Published Nov 9, 2023 • 13
Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection Paper • 2310.00161 • Published Sep 29, 2023 • 1
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers Paper • 2305.07011 • Published May 11, 2023 • 5