Vision Language Model Guided Zero-shot Classification

Yao, Haodong

Vision Language Model Guided Zero-shot Classification

Date

2026

Authors

Yao, Haodong

Abstract

Among various core tasks in Computer Vision, 2D image and 3D object classification are fundamental tasks which serve as the foundation for numerous applications including scene understanding, robotics and autonomous navigation. Vision Language Model (VLMs) are deep learning architectures designed to process and understand both visual and textual information simultaneously. This thesis takes a close look at Vision-Language Models in classification tasks, with a particular emphasis on zero-shot settings in both 2D and 3D scenarios. We provide a comprehensive overview of Vision-Language Models, focusing on their pretraining datasets, architectural components, learning strategies, and representative models. By comparing with supervised 2D approaches including shell learning along with conventional 3D classification methods, in-depth experiments and analysis have been conducted from various perspectives, including classification performance, semantic clustering and computational efficiency.