Learning from Weak Supervision: Theory, Methods, and Applications
Host
The growing demand for high-quality labeled data to train machine learning models has driven widespread adoption of weak supervision and synthetic data methods, which use automated models instead of humans for annotation. Large language models (LLMs) have further accelerated this trend because their zero- and few-shot classification performance enables them to serve as effective "synthetic annotators" for various tasks. In practice, the data generated by these weak annotators is imperfect, but it enables the training of strong models. However, theoretical understanding of why training one model on the outputs of another leads to strong performance remains limited, especially when the annotator model exhibits suboptimal performance on the target task. I develop a theoretical framework for learning from weak supervision that captures the key aspects of the problem better than existing approaches in the crowdsourcing and learning-with-noisy-label literature. This framework establishes structural conditions that explain when and why weak supervision can reliably train strong models. Building on these theoretical results, I introduce methods to improve how models learn from weak supervision and applies these methods to low-labeled-data settings.