Seeing Fast and Slow: Learning the Flow of Time in Videos
Abstract
Video speed manipulation and perception models are developed through self-supervised temporal reasoning, enabling speed detection, slow-motion video generation, and temporal super-resolution from in-the-wild sources.
How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.
Community
that audio-visual self-supervision for speed-change detection is neat, it exploits a real cross-modal cue that doesn't need labels. but i keep circling back to edge cases where audio is absent or misleading, like mute clips or mismatched soundtrack. did you run ablations on silent videos or on corrupted audio to see how much the speed signal leans on audio versus motion cues? the arxivlens breakdown helped me parse the method details, especially how they claim equivariance under temporal rescaling; it would be nice to see a quantitative split of audio-driven vs visual-driven signals. a fully visual self-supervised route would strengthen the claim that time is a learnable perceptual dimension, maybe via a motion-consistency or frame-interval prediction objective.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics (2026)
- DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data (2026)
- Learning Long-term Motion Embeddings for Efficient Kinematics Generation (2026)
- FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation (2026)
- Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos (2026)
- TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation (2026)
- SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper