When training an ml model, we can do data cleansing. But during real deployment, how to cope with the noise data? Key strategies to cope with noisy data during deployment:
1. Preprocessing Pipeline in Production
Build a real-time data preprocessing pipeline similar to the one used during training. This may include:
-
Normalization/standardization
-
Missing value imputation
-
Outlier filtering
-
Text/token cleanup (e.g., lowercasing, removing symbols)
Use the same logic and codebase (or serialize transformers likesklearn
's scalers,spaCy
pipelines, orTensorFlow
preprocessing layers).
2. Robust Model Training
Train your model to be robust against noise:
-
Add noise (augmentation) during training to simulate real-world scenarios (e.g., drop words, add typos, jitter numeric features).
-
Use regularization techniques (e.g., dropout, L2) to prevent overfitting on overly clean data.
3. Confidence Thresholds
Use prediction confidence scores (from softmax or probability outputs) to:
-
Reject uncertain predictions
-
Flag them for human review or fallback systems
4. Input Validation & Sanity Checks
Before feeding data to the model, validate input:
-
Reject ill-formed entries (e.g., empty strings, NaNs)
-
Ensure values fall within expected ranges or categories
-
Log or alert on anomalies
5. Ensemble Models or Rule-based Fallback
-
Use ensemble models (e.g., majority vote, stacking) that tend to be more stable with noisy input.
-
Add rule-based systems as a fallback for edge cases (e.g., if input is incomplete or invalid).
6. Online Monitoring & Feedback Loop
-
Monitor input quality and model performance in production
-
Detect concept drift or change in noise patterns
-
Use logs to retrain/update the model periodically
7. Data Denoising Models
In high-noise environments, deploy a denoising model or filter before the main model:
-
For images: autoencoders, image filters
-
For text: typo correction, spell checking
-
For time series: smoothing, Kalman filters