🟠ML: Handling Imbalanced Data
If 99% of transactions are legit and 1% are fraud, a model saying "always legit" gets 99% accuracy — and catches zero fraud.
Solutions (in priority order)
-
Use the right metric. Ditch accuracy. Use F1, PR-AUC, or recall.
-
Adjust class weights.
class_weight='balanced'in sklearn — penalizes misclassifying the minority class more. Easy, no data modification. -
Adjust classification threshold. Default is 0.5. Lower it (e.g., 0.3) to catch more positives at the cost of more false positives. Tune to business needs.
-
SMOTE (Synthetic Minority Oversampling). Creates synthetic minority examples by interpolating between existing ones. 🚨 CRITICAL: Apply ONLY to training data, NEVER to validation/test — otherwise you get data leakage.
-
Undersample majority class. Simple but loses information.