Vision Language Model Guided Zero-shot Classification

dc.contributor.authorYao, Haodong
dc.date.accessioned2026-01-19T08:34:35Z
dc.date.available2026-01-19T08:34:35Z
dc.date.issued2026
dc.description.abstractAmong various core tasks in Computer Vision, 2D image and 3D object classification are fundamental tasks which serve as the foundation for numerous applications including scene understanding, robotics and autonomous navigation. Vision Language Model (VLMs) are deep learning architectures designed to process and understand both visual and textual information simultaneously. This thesis takes a close look at Vision-Language Models in classification tasks, with a particular emphasis on zero-shot settings in both 2D and 3D scenarios. We provide a comprehensive overview of Vision-Language Models, focusing on their pretraining datasets, architectural components, learning strategies, and representative models. By comparing with supervised 2D approaches including shell learning along with conventional 3D classification methods, in-depth experiments and analysis have been conducted from various perspectives, including classification performance, semantic clustering and computational efficiency.
dc.identifier.urihttps://hdl.handle.net/1885/733804755
dc.language.isoen_AU
dc.titleVision Language Model Guided Zero-shot Classification
dc.typeThesis (MPhil)
local.contributor.affiliationCollege of Systems and Society, The Australian National University
local.contributor.supervisorZhang, Jing
local.identifier.doi10.25911/NFSA-MX05
local.identifier.proquestYes
local.identifier.researcherIDNUP-4759-2025
local.mintdoimint
local.thesisANUonly.author61463b2e-b861-4352-91e4-67ea683a29f8
local.thesisANUonly.key4e05d905-918d-5731-e3a8-54d386c5bc1c
local.thesisANUonly.title000000033240_TC_1

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Haodong_Yao_MPhil_Thesis_Revised.pdf
Size:
11.15 MB
Format:
Adobe Portable Document Format
Description:
Thesis Material