VisPruner

Abstract

Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable performance. Our code is available at https://github.com/Theia-4869/VisPruner.

Text-Visual Attention Investigation

Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning.

We conduct an in-depth analysis of text-visual attention inside the LLM and make two findings termed attention shift and attention dispersion, corresponding to the position and intensity of text-visual attention respectively. These phenomena are absent in the visual encoder of VLMs, which motivates our use of visual cues as a more reliable indicator of visual token importance.

Text-Visual Attention Shift

There is a clear positive correlation between the selection frequency and the attention received, along with a noticeable shift in text-visual attention. Text tokens tend to focus on visual tokens with higher indices, due to the long-term decay property of rotary position embedding. In the input sequence of the language model, text tokens are located after visual tokens, thus text tokens exhibit higher attention towards later-positioned visual tokens. It can be seen that positional bias appears from the first layer and is more pronounced in the shallower layers.

A natural question arises: In the presence of inherent positional bias, do higher attention scores necessarily represent richer visual information? In the shallow layers of the model, visual tokens located in the central region maintain the highest performance, rather than those positioned later, which received the most text attention. The difference between these two diminishes as the positional bias weakens after the 10th model layer.

Text-Visual Attention Dispersion

Given the negative impact of attention shift caused by rotary position embedding on pruning, an intuitive idea is to use text-visual attention devoid of position embedding decay as a basis for pruning. Unfortunately, although this practice eliminates positional bias, we find another phenomenon present both before and after the removal of position embedding, which we call attention dispersion.

Different from the extreme peak in [CLS] attention, the density of last attention both before and after removing position embedding shows high entropy and low peak, resembling a more uniform distribution. Here, highly concentrated attention indicates that the model identifies important tokens with high certainty, whereas a uniform distribution implies greater difficulty in selecting important tokens based on attention scores.

After eliminating the long-term decay brought by positional embedding, pruning results improved across all benchmarks, further demonstrating the adverse impact of positional bias. However, pruning based on last attention still lags behind random pruning on some benchmarks. We attribute this gap partly to attention dispersion, which makes it difficult to identify important tokens with rich visual information based on attention. Pruning based on [CLS] attention consistently achieves optimal results, motivating us to exploit visual cues for more effective token pruning.

Exploiting Visual Cues for Token Pruning

Based on the above analysis of attention in VLMs, we propose VisPruner, which exploits visual cues for more effective token pruning. We first select a small portion of important tokens with rich information based on the attention from the visual encoder. Then, we retain another set of diverse tokens from the remaining ones based on their similarity as a complement. These two parts together maintain the high performance of the model even after a significant reduction of visual tokens.

We begin by selecting a small portion of important tokens with rich information, based on the [CLS] attention from the visual encoder. For the remaining tokens, we progressively remove duplicates based on similarity, ultimately retaining another set of diverse tokens. These two parts complement each other, ensuring that the model maintains comparable performance even after a significant reduction of visual tokens.

Performance Comparison

we validate VisPruner against multiple existing methods across various VLM architectures on comprehensive multi-modal benchmarks, including high-resolution image and video understanding tasks.

LLaVA-1.5-7B

LLaVA-NeXT-7B

Video-LLaVA

Other VLM Architectures

Efficiency Comparison

We compare the computational efficiency between FastV and our VisPruner under LLaVA-NeXT-7B. Unlike FastV, which prune visual token within the LLM, FasterVLM prunes tokens before the LLM, enabling compatibility with FlashAttention. This design results in significantly higher efficiency. Note that the original implementation of SDPA also includes FlashAttention, so its computational efficiency is comparable to that of FlashAttention2, with only slight differences. All efficiency analyses are performed on a single NVIDIA A100-80GB GPU, evaluated using the POPE benchmark.

Contact

If you have any questions, please feel free to contact us:

Qizhe Zhang: theia@pku.edu.cn
Shanghang Zhang: shanghang@pku.edu.cn

BibTeX


@article{zhang2025vispruner,
  title={Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs}, 
  author={Zhang, Qizhe and Cheng, Aosong and Lu, Ming and Zhang, Renrui and Zhuo, Zhiyong and Cao, Jiajun and Guo, Shaobo and She, Qi and Zhang, Shanghang},
  journal={arXiv preprint arXiv:2412.01818},
  year={2025},
}

[ICCV 2025] ✂️ VisPruner

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs