Data-centric Designs for Reliable Model Training and Deployment

Liu, Yuchi

Data-centric Designs for Reliable Model Training and Deployment

Date

2025

Authors

Liu, Yuchi

Abstract

AI models are integral to real-world applications such as security, healthcare, and autonomous driving, making their reliability and robustness crucial. Like two sides of a coin, model and data jointly shape system performance. The model-centric paradigm focuses on architectures and training algorithms, while the data-centric paradigm improves training data quality and designs methods that leverage data attributes for deployment. Existing work largely favors the former, overlooking data-side issues: training data scarcity, domain gaps, and noisy labels; and deployment challenges like zero-shot adaptation, performance estimation without ground truth, and confidence calibration. This thesis develops a series of data-centric methods to enhance model reliability, generalization, and trustworthiness by improving data quality and exploiting nuanced attributes. At the training stage, three approaches address data scarcity, domain mismatch, and label noise. Chapter 2 introduces MiE-X, a synthetic dataset specifically designed for the micro-expression recognition task where the training data scarcity is serious. MiE-X synthesizes subtle facial expressions by recombining facial muscle movements and the faces in the wild with generative networks, effectively reducing dependency on costly human labeling and significantly improving generalization performance of the trained models. Chapter 3 presents MOTX, an engine developed to synthesize multi-object tracking data, created through a 3D simulation environment Unity, addressing the biases and limited motion scenarios in real-world tracking datasets. MOTX enables precise, systematic evaluation under various motion scenarios and robust training of identity association algorithms without real-world annotation costs and domain adaptation. Chapter 4 focuses on the widespread issue of label noise in face recognition tasks, proposing a robust semi-supervised framework composed of two complementary mechanisms: GroupNet, an ensemble-based label filtering method, and NRoLL, a confidence-driven pseudo-label refinement strategy. Together, these methods effectively stabilize training performance even under severe annotation noise. In deployment, we target efficient adaptation of large pretrained models, performance estimation without labels, and confidence calibration. To adapt large language models (LLMs) to novel scenarios without explicit fine-tuning, Chapter 5 introduces HMAW, a hierarchical multi-agent workflow for prompting. Recognizing the impracticality of extensive fine-tuning and manual prompt engineering, HMAW employs cooperative exploration by multiple language-model agents to systematically generate and refine effective zero-shot prompts. This structured, data-driven prompting significantly enhances LLMs' ability to generalize effectively across diverse unseen tasks. Chapter 6 introduces Vicinal Risk Proxy (VRP), a plug-in that aggregates risk from neighboring samples in a tailored vicinal distribution to adjust existing risk estimators, delivering reliable performance estimates without labels. Chapter 7 further refines model deployment by proposing a correctness-aware confidence calibration approach. This strategy explicitly aligns model confidence with empirical correctness by leveraging transformed or augmented inputs, significantly enhancing the reliability of confidence signals presented to end-users. Overall, this thesis advances data-centric strategies across training and deployment. For training, we improve data quality through transfer, controllable synthesis, and multi-expert pseudo-labeling. For deployment, we develop multi-agent prompt engineering for zero-shot adaptation, vicinal-consistency-based performance estimation, and its application to confidence calibration. These findings highlight the often-overlooked yet critical role of data in shaping reliable AI, calling for increased attention to data-centric strategies in both research and practice.