ml

🟠 ML: Cross-Validation

A single train/test split is noisy — results depend on WHICH data ended up where. Cross-validation averages over multiple splits.

K-fold CV: Split into k parts. Train on k-1, validate on 1. Repeat k times. Average the k scores. Typical k = 5 or 10.

Stratified k-fold: Each fold has the same class distribution as the full dataset. Critical for imbalanced data.

When NOT to use standard k-fold

Time series: Random splitting uses future data to predict the past. Use walk-forward validation: train on months 1-6, test on 7; train on 1-7, test on 8; etc.

Grouped data: Multiple measurements per patient → all measurements from one patient must stay in the same fold. Use GroupKFold.

Practice Questions

Q: You're predicting daily stock prices. Can you use standard 5-fold CV?