Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
Paper
โข
2505.18842
โข
Published
โข
36
Jiwan Chung*โ Junhyeok Kim*โ Siyeol Kimโ Jaeyoung Leeโ Minsoo Kimโ Youngjae Yu
conda create -n v1 python=3.10 -y
conda activate v1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Highly Recommended as the copy tokens are displayed on image.
python run_gradio.py
python inference.py
The script uses a default image URL and text prompt. To use your own inputs, you can modify the image variable within the messages list and the text field for the user prompt.
If you find our work valuable, please cite:
@misc{chung2025dontlookoncemultimodal,
title={Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation},
author={Jiwan Chung and Junhyeok Kim and Siyeol Kim and Jaeyoung Lee and Min Soo Kim and Youngjae Yu},
year={2025},
eprint={2505.18842},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.18842},
}