Chapter 5: Supervised Learning Basics / Lesson 24

Cross-Validation

What is Cross-Validation?

Cross-validation is a technique for assessing how well your model will generalize to new data. Instead of using a single train-test split, cross-validation divides your data into multiple folds and tests the model on each fold.

This gives you a more reliable estimate of model performance and helps you use your limited data more effectively. It's especially important when you have small datasets.

Why Use Cross-Validation?

Simple train-test split has limitations:

why_cv.py
# Problems with Simple Train-Test Split print("Problems with Single Train-Test Split:") print(" 1. Performance depends on random split") print(" - Different splits give different results") print(" - Unreliable performance estimate") print("\n 2. Wastes data") print(" - Test set not used for training") print(" - With small datasets, this is wasteful") print("\n 3. Single evaluation point") print(" - One test score doesn't show variance") print(" - Can't assess model stability") print("\nCross-Validation Benefits:") print(" ✓ More reliable performance estimate") print(" ✓ Uses all data for both training and testing") print(" ✓ Shows performance variance across folds") print(" ✓ Better for small datasets")

K-Fold Cross-Validation

The most common type is k-fold cross-validation, where data is split into k folds:

kfold.py
# K-Fold Cross-Validation Process # Example: 5-fold cross-validation with 20 data points data_size = 20 k = 5 fold_size = data_size // k # 4 points per fold print(f"5-Fold Cross-Validation Example:") print(f" Total data points: {data_size}") print(f" Number of folds: {k}") print(f" Points per fold: {fold_size}") print("\nProcess:") print(" Fold 1: Test on points 0-3, Train on 4-19") print(" Fold 2: Test on points 4-7, Train on 0-3, 8-19") print(" Fold 3: Test on points 8-11, Train on 0-7, 12-19") print(" Fold 4: Test on points 12-15, Train on 0-11, 16-19") print(" Fold 5: Test on points 16-19, Train on 0-15") print("\nResult:") print(" - Each point used for testing exactly once") print(" - Each point used for training (k-1) times") print(" - Get k performance scores, average them") # Simulate scores from 5 folds fold_scores = [0.85, 0.82, 0.88, 0.84, 0.86] avg_score = sum(fold_scores) / len(fold_scores) std_score = (sum((s - avg_score) ** 2 for s in fold_scores) / len(fold_scores)) ** 0.5 print(f"\n Fold scores: {[f'{s:.2%}' for s in fold_scores]}") print(f" Average score: {avg_score:.2%}") print(f" Std deviation: {std_score:.2%}") print(" (Lower std = more stable model)")

Stratified K-Fold

For classification, use stratified k-fold to maintain class distribution:

stratified.py
# Stratified K-Fold Cross-Validation # Example: Binary classification with imbalanced classes # Total: 100 samples, 80 class 0, 20 class 1 print("Problem with Regular K-Fold:") print(" Dataset: 80% class 0, 20% class 1") print(" Regular k-fold might put all class 1 in one fold") print(" That fold would have no class 1 samples!") print("\nStratified K-Fold Solution:") print(" Maintains class distribution in each fold") print(" Each fold: 80% class 0, 20% class 1") print(" More reliable for imbalanced datasets") print("\nExample (5-fold, 100 samples):") print(" Each fold: 16 class 0, 4 class 1") print(" All folds have same class distribution") print("\nUse stratified k-fold for:") print( - Classification problems") print( - Imbalanced datasets") print( - When class distribution matters")

Implementing Cross-Validation

Here's how to implement cross-validation in Python:

implement_cv.py
# Implementing Cross-Validation print("Using sklearn for Cross-Validation:") print(" from sklearn.model_selection import cross_val_score, KFold") print(" from sklearn.linear_model import LogisticRegression") print("\nBasic K-Fold:") print(" kf = KFold(n_splits=5, shuffle=True, random_state=42)") print(" scores = cross_val_score(model, X, y, cv=kf)") print(" print(f'Mean score: {scores.mean():.2%}')") print(" print(f'Std: {scores.std():.2%}')") print("\nStratified K-Fold:") print(" from sklearn.model_selection import StratifiedKFold") print(" skf = StratifiedKFold(n_splits=5, shuffle=True)") print(" scores = cross_val_score(model, X, y, cv=skf)") print("\nManual Implementation:") print(" for train_idx, test_idx in kf.split(X):") print(" X_train, X_test = X[train_idx], X[test_idx]") print(" y_train, y_test = y[train_idx], y[test_idx]") print(" model.fit(X_train, y_train)") print(" score = model.score(X_test, y_test)") print(" scores.append(score)")

Exercise: Implement Cross-Validation

Complete the exercise on the right side:

  • Task 1: Split data into k folds manually
  • Task 2: Calculate performance for each fold
  • Task 3: Calculate mean and standard deviation of scores
  • Task 4: Compare single split vs cross-validation results

Write your code to implement cross-validation and analyze the results!

💡 Learning Tip

Practice is essential. Try modifying the code examples, experiment with different parameters, and see how changes affect the results. Hands-on experience is the best teacher!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...