Qizhe Zhang

PKU Ph.D. candidate

简体中文

About Me

I am currently a Ph.D. candidate at HMI Lab, NERCV²T, School of Computer Science, Peking University, supervised by Prof. Shanghang Zhang. Before that, I received my Bachelor's degree in Artificial Intelligence (Turing Honor Degree) from PKU, where I also obtained a Bachelor's degree in Economics.

Research Interests

My research interests lie in computer vision and multimodal learning, including visual foundation models, vision language models, visual complex reasoning, visual token compression, visual continual learning, and embodied artificial intelligence. The overall goal of my research is to develop a large-scale efficient visual perception system with human-like expression, adaptation, and generalization, equipped with powerful abilities including fundamental perception, cognitive reasoning, and autonomous creativity.

More specifically, my research interests include:

Visual Foundation Models (VFMs)
Vision Language Models (VLMs / MLLMs / dVLMs)
Visual Complex Reasoning (Visual CoT / Thinking with images)
Visual Token Compression (Token Pruning / KV Cache Compression)
Visual Continual Learning (Memory Mechanism / Slow-Fast System)
Embodied Artificial Intelligence (Robotics / Autonomous Driving)

Education

Ph.D. Candidate in Visual Information Processing and Brain-inspired Intelligence
Sep. 2023 -- Jun. 2028 (ETA)

Peking University, Beijing, China
Bachelor of Intelligence Science and Technology & Economics (Dual Degree)
Sep. 2019 -- Jun. 2023

Peking University, Beijing, China

News

06/2025: One paper is accepted by ICCV 2025. (VisPruner)
05/2025: One paper is accepted by ICML 2025. (SAN)
02/2025: One paper is accepted by CVPR 2025. (MoVE-KD)
02/2024: Two papers are accepted by CVPR 2024. (GPS, ADMA)
01/2024: One paper is accepted by ICRA 2024. (BiCross)
12/2023: One paper is accepted by AAAI 2024. (SVDP)

Experience

Intern at AI Lab (Model Efficiency for MLLM)
Mar. 2024 -- Now

ByteDance, Beijing, China
Intern in AGI (Memory Mechanism for MLLM)
Jul. 2023 -- Sep. 2023

BAAI, Beijing, China
Intern in Computer Vision (Autonomous Driving)
Sep. 2022 -- Feb. 2023

OPPO, Beijing, China
Intern at GCV Lab (Multi-Modal Learning)
Oct. 2021 -- Feb. 2022

BIGAI, Beijing, China

Publications

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang†
Arxiv 2025 [Paper] [Code] [Website]
" We propose CDPruner as A training-free and model-agnostic visual token pruning method for MLLM inference acceleration by maximizing the conditional diversity of retained tokens. "

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang†
ICCV 2025 [Paper] [Code] [Website]
" We propose VisPruner as a plug-and-play method that utilizes visual cues for more effective token pruning in large vision language models. "

Gradient-based Parameter Selection for Efficient Fine-Tuning
Zhi Zhang*, Qizhe Zhang*, Zijun Gao, Renrui Zhang, Ekaterina Shutova, Shiji Zhou, Shanghang Zhang†
CVPR 2024 [Paper] [Code]
" We propose a novel gradient-based parameter selection (GPS) method for effeicient fine-tuning. GPS does not introduce any additional storage or computational cost during both training and inference stages. Moreover, it possesses model-agnostic and task-adaptive properties, achieving outstanding performance. "

Unsupervised Spike Depth Estimation via Cross-modality Cross-domain Knowledge Transfer
Jiaming Liu*, Qizhe Zhang*, Jianing Li, Ming Lu, Tiejun Huang, Shanghang Zhang†
ICRA 2024 [Paper] [Code]
" We propose a novel cross-modality cross-domain (BiCross) framework for unsupervised spike depth estimation. To be mentioned, we are the first to exploit the opensource RGB datasets to help unsupervised learning for spike depth estimation. "

SAN: Hypothesizing Long-Term Synaptic Development and Neural Engram Mechanism in Scalable Model's Parameter-Efficient Fine-Tuning
Gaole Dai, Yiming Tang, Chunkai Fan, Qizhe Zhang, Zhi Zhang, Yulu Gan, Chengching Tseng, Shanghang Zhang† Tiejun Huang,
ICML 2025 [Paper] [Code]
" We propose Synapse and Neuron (SAN), which decomposes and propagates scaling components from anterior feature adjusting vectors towards posterior weight matrices. SAN is theoretically grounded in Long-Term Potentiation/Depression phenomena, which govern synapse development through neurotransmitter release modulation. "

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
Jiajun Cao, Yuan Zhang, Tao Huang, Ming Lu, Qizhe Zhang, Ruichuan An, Ningning Ma, Shanghang Zhang†
CVPR 2025 [Paper] [Code]
" We propose Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. "

Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation
Jiaming Liu*, Ran Xu*, Senqiao Yang*, Renrui Zhang†, Qizhe Zhang, Zehui Chen, Yandong Guo, Shanghang Zhang‡
CVPR 2024 [Paper] [Code] [Website]
" We propose Adaptive Distribution Masked Autoencoders (ADMA) as a novel continual self-supervised method. ADMA enhances the extraction of target domain knowledge while mitigating the accumulation of distribution shifts. "

Exploring Sparse Visual Prompt for Cross-domain Semantic Segmentation
Senqiao Yang*, Jiarui Wu*, Jiaming Liu*, Xiaoqi Li, Qizhe Zhang, Mingjie Pan, Shanghang Zhang†
AAAI 2024 [Paper] [Code] [Website]
" We propose a novel Sparse Visual Domain Prompts (SVDP) approach for dense prediction TTA tasks, which holds minimal trainable parameters in the image-level prompt and reserves more spatial information of the input. "

Plan

Visual Token Pruning Evaluation Framework
KV Cache Compression for LVLM/dVLM
Visual CoT Length Reduction
Thinking with Auxiliary Lines for Plane Geometry

Contact

theia@pku.edu.cn theia4869@gmail.com
+86 · 18810920885 +86 · 18700432951
Theia-4869