What is Cross-Validation?
Cross-validation is a technique for assessing how well your model will generalize to new data. Instead of using a single train-test split, cross-validation divides your data into multiple folds and tests the model on each fold.
This gives you a more reliable estimate of model performance and helps you use your limited data more effectively. It's especially important when you have small datasets.
Why Use Cross-Validation?
Simple train-test split has limitations:
print("Problems with Single Train-Test Split:")
print(" 1. Performance depends on random split")
print(" - Different splits give different results")
print(" - Unreliable performance estimate")
print("\n 2. Wastes data")
print(" - Test set not used for training")
print(" - With small datasets, this is wasteful")
print("\n 3. Single evaluation point")
print(" - One test score doesn't show variance")
print(" - Can't assess model stability")
print("\nCross-Validation Benefits:")
print(" ✓ More reliable performance estimate")
print(" ✓ Uses all data for both training and testing")
print(" ✓ Shows performance variance across folds")
print(" ✓ Better for small datasets")
K-Fold Cross-Validation
The most common type is k-fold cross-validation, where data is split into k folds:
data_size = 20
k = 5
fold_size = data_size // k
print(f"5-Fold Cross-Validation Example:")
print(f" Total data points: {data_size}")
print(f" Number of folds: {k}")
print(f" Points per fold: {fold_size}")
print("\nProcess:")
print(" Fold 1: Test on points 0-3, Train on 4-19")
print(" Fold 2: Test on points 4-7, Train on 0-3, 8-19")
print(" Fold 3: Test on points 8-11, Train on 0-7, 12-19")
print(" Fold 4: Test on points 12-15, Train on 0-11, 16-19")
print(" Fold 5: Test on points 16-19, Train on 0-15")
print("\nResult:")
print(" - Each point used for testing exactly once")
print(" - Each point used for training (k-1) times")
print(" - Get k performance scores, average them")
fold_scores = [0.85, 0.82, 0.88, 0.84, 0.86]
avg_score = sum(fold_scores) / len(fold_scores)
std_score = (sum((s - avg_score) ** 2 for s in fold_scores) / len(fold_scores)) ** 0.5
print(f"\n Fold scores: {[f'{s:.2%}' for s in fold_scores]}")
print(f" Average score: {avg_score:.2%}")
print(f" Std deviation: {std_score:.2%}")
print(" (Lower std = more stable model)")
Stratified K-Fold
For classification, use stratified k-fold to maintain class distribution:
print("Problem with Regular K-Fold:")
print(" Dataset: 80% class 0, 20% class 1")
print(" Regular k-fold might put all class 1 in one fold")
print(" That fold would have no class 1 samples!")
print("\nStratified K-Fold Solution:")
print(" Maintains class distribution in each fold")
print(" Each fold: 80% class 0, 20% class 1")
print(" More reliable for imbalanced datasets")
print("\nExample (5-fold, 100 samples):")
print(" Each fold: 16 class 0, 4 class 1")
print(" All folds have same class distribution")
print("\nUse stratified k-fold for:")
print( - Classification problems")
print( - Imbalanced datasets")
print( - When class distribution matters")
Implementing Cross-Validation
Here's how to implement cross-validation in Python:
print("Using sklearn for Cross-Validation:")
print(" from sklearn.model_selection import cross_val_score, KFold")
print(" from sklearn.linear_model import LogisticRegression")
print("\nBasic K-Fold:")
print(" kf = KFold(n_splits=5, shuffle=True, random_state=42)")
print(" scores = cross_val_score(model, X, y, cv=kf)")
print(" print(f'Mean score: {scores.mean():.2%}')")
print(" print(f'Std: {scores.std():.2%}')")
print("\nStratified K-Fold:")
print(" from sklearn.model_selection import StratifiedKFold")
print(" skf = StratifiedKFold(n_splits=5, shuffle=True)")
print(" scores = cross_val_score(model, X, y, cv=skf)")
print("\nManual Implementation:")
print(" for train_idx, test_idx in kf.split(X):")
print(" X_train, X_test = X[train_idx], X[test_idx]")
print(" y_train, y_test = y[train_idx], y[test_idx]")
print(" model.fit(X_train, y_train)")
print(" score = model.score(X_test, y_test)")
print(" scores.append(score)")
Exercise: Implement Cross-Validation
Complete the exercise on the right side:
- Task 1: Split data into k folds manually
- Task 2: Calculate performance for each fold
- Task 3: Calculate mean and standard deviation of scores
- Task 4: Compare single split vs cross-validation results
Write your code to implement cross-validation and analyze the results!
💡 Learning Tip
Practice is essential. Try modifying the code examples, experiment with different parameters, and see how changes affect the results. Hands-on experience is the best teacher!
🎉
Lesson Complete!
Great work! Continue to the next lesson.