ml

🟠 ML: Handling Imbalanced Data

If 99% of transactions are legit and 1% are fraud, a model saying "always legit" gets 99% accuracy — and catches zero fraud.

Solutions (in priority order)

  1. Use the right metric. Ditch accuracy. Use F1, PR-AUC, or recall.

  2. Adjust class weights. class_weight='balanced' in sklearn — penalizes misclassifying the minority class more. Easy, no data modification.

  3. Adjust classification threshold. Default is 0.5. Lower it (e.g., 0.3) to catch more positives at the cost of more false positives. Tune to business needs.

  4. SMOTE (Synthetic Minority Oversampling). Creates synthetic minority examples by interpolating between existing ones. 🚨 CRITICAL: Apply ONLY to training data, NEVER to validation/test — otherwise you get data leakage.

  5. Undersample majority class. Simple but loses information.

Practice Questions

Q: Your interviewer asks: "Why not just oversample the minority class by duplicating rows?"