3D Point Cloud Shapes Latent Representation Learning and Applications

Li, Shidi

3D Point Cloud Shapes Latent Representation Learning and Applications

Date

2025

Authors

Li, Shidi

Abstract

Due to the rapid advancement of machine learning, 3D object modeling has become a central focus in computer vision. A significant challenge lies in acquiring a latent representation for diverse 3D objects that is information-rich, efficient, and ubiquitous. This thesis tackles the above challenge from three fronts -- 1) how to learn information-rich disentangled latent representations in an unsupervised manner? 2) how to organize the learned disentangled latent representation efficiently in an unsupervised manner? 3) how to develop an ubiquitous general-purpose latent representation that can be readily applied to various downstream tasks with minimal additional training. This thesis first proposes a novel framework, namely \editvae, for unsupervised learning of disentangled latent representations of 3D objects. \editvae\ enables direct generation of individual object parts by training the model with whole object data. In particular, the learnt latent representation can be decomposed into a disentangled representation for each part of the shape. These parts are further decomposed into a shape primitive and a point cloud representation, along with a standardising transformation to a canonical coordinate system. Importantly, the dependencies between these transformations preserve spatial relationships between parts, facilitating meaningful parts-aware point cloud generation and shape editing. Further, beyond the learned information-rich latent representation, this thesis proposes \spavae\ with an unsupervised similar parts assignment module to organize the explicit disentangled representation. To be specific, \spavae\ infers a set of latent canonical candidate shapes for an object, along with a set of rigid body transformations for each such candidate shape to one or more locations within the assembled object. In this way, noisy samples on the surface of, say, each leg of a 3D point cloud table, are effectively organized and combined to estimate a single leg prototype. When parts-based self-similarity exists in the raw data, sharing data among parts in this way confers numerous advantages: modeling accuracy, appropriately self-similar generative outputs, precise in-filling of occlusions, and model parsimony. Finally, this thesis proposes a unified latent representation that facilitates its application to diverse downstream tasks. This thesis propose a novel sparse transformer, \textit{sampled transformer}, with $O(n)$ complexity that can efficiently process point set elements with little additional inductive bias. This transformer is then integrated into a Masked Auto-Encoder (MAE) pre-training framework for learning an ubiquitous latent representation. The success of this representation in fine-tuning tasks like classification, transfer learning, few-shot learning, and generation demonstrates its effectiveness in capturing rich object information.