วันพฤหัสบดีที่ 22 พฤษภาคม พ.ศ. 2568

Preprocessing VS Remove noisy feature

When training an ml model, we can do data cleansing. But during real deployment, how to cope with the noise data? Key strategies to cope with noisy data during deployment:


1. Preprocessing Pipeline in Production

Build a real-time data preprocessing pipeline similar to the one used during training. This may include:

  • Normalization/standardization

  • Missing value imputation

  • Outlier filtering

  • Text/token cleanup (e.g., lowercasing, removing symbols)
    Use the same logic and codebase (or serialize transformers like sklearn's scalers, spaCy pipelines, or TensorFlow preprocessing layers).


2. Robust Model Training

Train your model to be robust against noise:

  • Add noise (augmentation) during training to simulate real-world scenarios (e.g., drop words, add typos, jitter numeric features).

  • Use regularization techniques (e.g., dropout, L2) to prevent overfitting on overly clean data.


3. Confidence Thresholds

Use prediction confidence scores (from softmax or probability outputs) to:

  • Reject uncertain predictions

  • Flag them for human review or fallback systems


4. Input Validation & Sanity Checks

Before feeding data to the model, validate input:

  • Reject ill-formed entries (e.g., empty strings, NaNs)

  • Ensure values fall within expected ranges or categories

  • Log or alert on anomalies


5. Ensemble Models or Rule-based Fallback

  • Use ensemble models (e.g., majority vote, stacking) that tend to be more stable with noisy input.

  • Add rule-based systems as a fallback for edge cases (e.g., if input is incomplete or invalid).


6. Online Monitoring & Feedback Loop

  • Monitor input quality and model performance in production

  • Detect concept drift or change in noise patterns

  • Use logs to retrain/update the model periodically


7. Data Denoising Models

In high-noise environments, deploy a denoising model or filter before the main model:

  • For images: autoencoders, image filters

  • For text: typo correction, spell checking

  • For time series: smoothing, Kalman filters

So, if a feature is noisy and doesn't contribute useful signal, then it's often better to remove it
--ChatGPT