A machine learning technique that trains an algorithm using a small amount of labeled data and a large amount of unlabeled data. It bridges the gap between supervised learning (using only labeled data) and unsupervised learning (using only unlabeled data). [1, 2, 3]
How It Works
Instead of requiring manual annotations for every single piece of data, the process generally unfolds in a few steps:
- Initial Training: The model is first trained on the small, available dataset of human-labeled data.
- Pseudo-Labeling: The partially trained model then analyzes the massive pool of unlabeled data, assigning "pseudo-labels" to its own highest-confidence predictions.
- Refinement: The algorithm retrains itself on this combined set of human-labeled and machine-labeled data, continuously improving its accuracy.