I'm currently as a special research fellow at Xi'an Jiaotong University and as a member of the AI-SEC research Lab in Xi'an Jiaotong University led by Chinese Academy of Science (CAS, 中国科学院院士) Fellow, Xiaohong Guan and Prof. Chao Shen. My research mainly focuses on AI security and computer vision, in particular Trustworthy AI, Large Vision-Language Models(LVLMs), Dynamic Neural Networks, Efficient Learning/Inference, and Video Understanding.
🔥Our group is looking for self-motivated Ph.D/Master candidates and undergraduate student interns for ongoing research. Please drop me an email with your resume if you are interested.
Ziwei Zheng*, Michael Yang*, Jack Hong*, Chenxiao Zhao*, Guohai Xu✉, Le Yang✉, Chao Shen, XingYu
arxiv.org/abs/2505.14362, 2025
In this paper, we try to build an LVLM with "thinking with images" like Open-AI o3 model. We explore the interleaved multimodal reasoning paradigm and introduce DeepEyes. The "thinking with images" capabilities are incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories.DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes.
Le Yang*, Ziwei Zheng*, Boxun Chen, Zhengyu Zhao, Chenhao Lin, Chao Shen✉.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
The paper proposes Nullu, a method to reduce object hallucinations (OH) in LVLMs by projecting input features into the Null space of an unsafe subspace called HalluSpace. HalluSpace is identified by isolating hallucinated features and removing truthful ones. By suppressing LLM priors causing OH, Nullu enhances contextual accuracy without extra inference cost, showing strong performance across LVLM families.
Ziwei Zheng, Zechuan Zhang, Yulin Wang, Shiji Song, Gao Huang, Le Yang✉.
ACM Multimedia (ACM MM), 2024
In this paper, we experimentally reexamine the architecture of GEBD models and uncover several surprising findings, demonstrating that some of sophisticated designs are unnecessary for building GEBD models. We also show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for the GEBD.
Le Yang*✉, Ziwei Zheng*, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, Fan Li.
European Conference on Computer Vision (ECCV), 2024
Inspired by the successes in dynamic neural networks, in this paper, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations. Moreover, using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multiscale features with adjusted parameters and learned receptive fields better to detect the action instances with diverse ranges from videos.
Ziwei Zheng, Le Yang✉, Yulin Wang, Miao Zhang, Lijun He, Gao Huang, Fan Li.
IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), 2023
We propose the fist dynamic spatial focus video recognition model for compressed video (such as MPEG4 and HEVC).
Le Yang*, Haojun Jiang*, Ruojin Cai, Yulin Wang, Shiji Song, Gao Huang✉, Qi Tian.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
We propose a new feature reusing method in deep networks through dense connectivity, which can simultaneously learn to 1) selectively reuse a set of most important features from preceding layers; and 2) actively update a set of preceding features to increase their utility for later layers.
Le Yang*, Yizeng Han*, Xi Chen*, Shiji Song, Jifeng Dai, Gao Huang✉
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
The proposed Resolution Adaptive Network (RANet) makes use of spatial redundancy in images to conduct the adaptive inference for the first time. The RANet is inspired by the intuition that low-resolution representations are sufficient for classifying “easy” inputs containing large objects with prototypical features, while only some “hard” samples need spatially detailed information.