Vision Language Model Guided Zero-shot Classification

Date

Authors

Yao, Haodong

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Among various core tasks in Computer Vision, 2D image and 3D object classification are fundamental tasks which serve as the foundation for numerous applications including scene understanding, robotics and autonomous navigation. Vision Language Model (VLMs) are deep learning architectures designed to process and understand both visual and textual information simultaneously. This thesis takes a close look at Vision-Language Models in classification tasks, with a particular emphasis on zero-shot settings in both 2D and 3D scenarios. We provide a comprehensive overview of Vision-Language Models, focusing on their pretraining datasets, architectural components, learning strategies, and representative models. By comparing with supervised 2D approaches including shell learning along with conventional 3D classification methods, in-depth experiments and analysis have been conducted from various perspectives, including classification performance, semantic clustering and computational efficiency.

Description

Keywords

Citation

Source

Book Title

Entity type

Access Statement

License Rights

Restricted until

Downloads

File
Description