Comparison with prior work on high-resolution visual reasoning. The yellow box marks the target object, which becomes indiscernible after input-image downsampling. (a) DeepEyes (Zheng et al., 2025) succeeds when the object remains discernible, but at the cost of a large number of vision tokens. (b) DeepEyes (Zheng et al., 2025) fails when the object is indiscernible at low resolution, where fewer vision tokens are available. (c) Our ERGO performs reasoning-driven perception, correctly answering the question even on low-resolution images.
Existing Large Vision-Language Models (LVLMs) incur substantial computational overhead when processing high-resolution images due to the massive number of vision tokens. We introduce a two-stage “coarse-to-fine” reasoning pipeline: a downsampled image is first analyzed to identify task-relevant regions, then only those regions are cropped at full resolution for subsequent reasoning.
A key challenge is that prior methods rely on perception-driven reasoning—they fail to locate relevant regions once fine-grained visual cues are lost through downsampling. ERGO (Efficient Reasoning & Guided Observation) instead performs reasoning-driven perception, leveraging multimodal context to determine where to focus even when target objects become visually indiscernible. We develop simple yet effective reward components in a reinforcement learning framework that encourage both accurate region selection and efficient vision-token usage, achieving higher accuracy with significantly fewer tokens across multiple benchmarks.
Overview of RL-based training pipeline. The red background highlights the components of the proposed TCE reward. The green background highlights the conventional rewards adopted by most reasoning LVLMs.
ERGO’s training objective is explicitly aligned with vision-processing efficiency in a reinforcement learning (RL) framework. Given an image and a text query, the pipeline operates in two stages: (1) the policy model predicts bounding-box coordinates for the task-relevant region with a thinking trace, and (2) the cropped region at original resolution is fed back for final answer generation.
Task performance is evaluated using only the cropped region and the query, without access to the original image. This encourages the policy model to identify informative, self-contained regions that preserve sufficient information for accurate reasoning.
A complementary reward that regularizes the size of the selected region. It penalizes overly large crops based on the area ratio, preventing the trivial strategy of selecting the entire image while allowing flexible region selection.
The main reward combining region-verification and box adjustment: rTCE = α · rregion + β · rbox. This enables the policy model to learn robust and efficient region selection strategies for vision-grounded reasoning.
The overall reward is a linear combination of three components: R = rTCE + racc + rformat, where the accuracy reward bridges the training-test mismatch and the format reward enforces well-structured outputs.
Performance comparison under efficiency-considered scenarios with pixel constraints. ERGO outperforms the original model and post-training methods across all benchmarks. † denotes reproduction with their code using our data, while ‡ denotes inference with their original pipeline.
Performance-efficiency trade-off on the V* benchmark. The total number of vision tokens is the sum of the tokens from the downsampled original image and those from the high-resolution cropped image.
Comparison of vision token counts in coarse-to-fine reasoning.
Latency comparison with models that leverage multiple tool calls on V* using the vLLM engine. Latency represents the average duration to produce a final answer for each image–query pair.
Evaluation of model robustness under target-object masking. Models can only succeed by leveraging contextual information when the object is completely masked. ERGO achieves the most robust performance in the masked condition.
Results on conventional vision–language benchmarks. ERGO maintains or improves the capabilities of the base Qwen2.5-VL-7B model.
Ablation analysis. Average performance is measured over six benchmarks. (a) Reward design, (b) TCE reward weight, (c) Parameter size, (d) Reward model, (e) Box adjustment reward.
ERGO utilizes coarse cues (“the region where the bottle is located”) to provide the answer.
ERGO can also exploit clear visual cues (the purple umbrella and the orange luggage) when the object is still discernible.
@misc{lee2025ergoefficienthighresolutionvisual,
title={ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models},
author={Jewon Lee and Wooksu Shin and Seungmin Yang and Ki-Ung Song and DongUk Lim and Jaeyeon Kim and Tae-Ho Kim and Bo-Kyeong Kim},
year={2025},
eprint={2509.21991},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.21991},
}