FasterVLM

Abstract

Large vision-language models (VLMs) often rely on a substantial number of visual tokens when interacting with large language models (LLMs), which has proven to be inefficient. Recent efforts have aimed to accelerate VLM inference by pruning visual tokens. Most existing methods assess the importance of visual tokens based on the text-visual cross-attentions in LLMs. In this study, we find that the cross-attentions between text and visual tokens in LLMs are inaccurate. Pruning tokens based on these inaccurate attentions leads to significant performance degradation, especially at high reduction ratios. To this end, we introduce FasterVLM, a simple yet effective training-free visual token pruning method that evaluates the importance of visual tokens more accurately by utilizing attentions between the[CLS]token and image tokens from the visual encoder. Since FasterVLM eliminates redundant visual tokens immediately after the visual encoder, ensuring they do not interact with LLMs and resulting in faster VLM inference. Benefiting from the accuracy of[CLS]cross-attentions, FasterVLM can prune 95% of visual tokens while maintaining 90% of the performance of LLaVA-1.5-7B. We apply FasterVLM to various VLMs, including LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA, to demonstrate its effectiveness. Experimental results show that our FasterVLM maintains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing text-visual attention-based methods. Our code is available at https://github.com/Theia-4869/FasterVLM.

Inaccurate Text-Visual Attention in VLMs

Text-visual attention from the LLM Decoder in VLMs is often used to evaluate the importance of visual tokens for pruning. However, we find this attention does not align with the actual importance of visual tokens.

we describe two phenomena observed in the visual attention of LLM decoder, termed attention shift and attention dispersion, corresponding to the position and intensity of attention respectively. These phenomena are absent in the visual encoder of VLMs, which motivates our use of[CLS]attention as a more reliable indicator of visual token importance.

Text-Visual Attention Shift

A clear trend is visible across all attention distributions within the LLM decoder (image, text, and last), where attention scores increase with larger token indices. This suggests that if visual tokens were pruned based on text-visual attention from the LLM decoder, most retained visual tokens would be located in the lower half of the input image, potentially leading to a serious loss of important visual information.

We attribute this phenomenon to the unidirectional nature of attention in the LLM, as this trend does not appear in the visual encoder, which employs global attention. Although this causal attention is the core of the next-token prediction paradigm, it is not well-suited for assessing the importance of visual tokens.

Text-Visual Attention Dispersion

It is apparent that[CLS]attention is highly focused, with only a few tokens receiving significant attention. In contrast, the last output token attention (as well as other text tokens) is more dispersed, with multiple regions across the image receiving high attention, making it challenging to select important visual tokens during pruning.

Most visual tokens receive minimal[CLS]attention, while last attention is more uniformly distributed, which means it contains more noise. The lack of attention dispersion in[CLS]attention suggests it is a more suitable indicator for guiding visual token pruning.

`[CLS]`Attention for Visual Token Pruning

Based on the above analysis of attention in VLMs, we propose FasterVLM, which uses[CLS]attention from the visual encoder as a more accurate indicator for visual token pruning. By removing redundant visual tokens before the LLM decoder, our approach could make VLM inference faster than methods that prune tokens within the LLM.

We first re-rank image tokens using[CLS]attention from the visual encoder and prune the last R%. The remaining image tokens, after passing through the multi-modal projector, are combined with language instructions as input to the language model for response generation. Since redundant image tokens are removed before the language model, FasterVLM can make the inference of the entire VLM even faster than pruning within the language model.

Performance Comparison

We validate our FasterVLM against multiple existing methods across various VLM architectures on comprehensive multi-modal benchmarks, including high-resolution image and video understanding tasks.

LLaVA-1.5-7B

LLaVA-1.5-13B

LLaVA-NeXT-7B

Video-LLaVA

Efficiency Comparison

We compare the computational efficiency between FastV and our FasterVLM under LLaVA-NeXT-7B. Unlike FastV, which prune visual token within the LLM, FasterVLM prunes tokens before the LLM, enabling compatibility with FlashAttention. This design results in significantly higher efficiency. Note that the original implementation of SDPA also includes FlashAttention, so its computational efficiency is comparable to that of FlashAttention2, with only slight differences. All efficiency analyses are performed on a single NVIDIA A100-80GB GPU, evaluated using the POPE benchmark.

Contact

If you have any questions, please feel free to contact us:

Qizhe Zhang: theia@pku.edu.cn
Shanghang Zhang: shanghang@pku.edu.cn

BibTeX


@article{zhang2024fastervlm,
  title={[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster}, 
  author={Zhang, Qizhe and Cheng, Aosong and Lu, Ming and Zhuo, Zhiyong and Wang, MinQi and Cao, Jiajun and Guo, Shaobo and She, Qi and Zhang, Shanghang},
  journal={arXiv preprint arXiv:2412.01818v1},
  year={2024}
}

⚡ FasterVLM

[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster