Lesson 23: Overfitting and Underfitting

Understanding Overfitting and Underfitting

Overfitting and underfitting are two fundamental problems in machine learning that affect how well your model generalizes to new data. Understanding these concepts is crucial for building effective ML models.

Overfitting occurs when a model learns the training data too well, including noise and irrelevant patterns. Underfitting occurs when a model is too simple to capture the underlying patterns in the data.

What is Underfitting?

Underfitting happens when your model is too simple. It fails to capture the underlying patterns in the data:

underfitting.py
# Understanding Underfitting

# Example: Trying to fit a linear model to non-linear data
# Training data (non-linear relationship)
x_train = [1, 2, 3, 4, 5, 6, 7, 8]
y_train = [2, 8, 18, 32, 50, 72, 98, 128]  # y = 2*x²

print("Training Data (Non-linear: y = 2*x²):")
for x, y in zip(x_train, y_train):
    print(f"  x={x}, y={y}")

# Underfitting: Using a simple linear model
# Linear model: y = a*x + b (too simple for this data)
print("\nUnderfitting Example:")
print("  Model: Linear (y = a*x + b)")
print("  Problem: Too simple, can't capture quadratic pattern")
print("  Result: High error on both training and test data")

# Calculate simple linear prediction (y = 15*x - 10, rough approximation)
linear_predictions = [15 * x - 10 for x in x_train]
errors = [abs(y_true - y_pred) for y_true, y_pred in zip(y_train, linear_predictions)]
avg_error = sum(errors) / len(errors)

print(f"\n  Average prediction error: {avg_error:.1f}")
print("  This is underfitting - model is too simple!")

What is Overfitting?

Overfitting happens when your model is too complex. It memorizes the training data instead of learning general patterns:

overfitting.py
# Understanding Overfitting

# Training data with some noise
x_train = [1, 2, 3, 4, 5]
y_train = [10, 20, 31, 40, 50]  # y = 10*x with small noise

# Test data (new, unseen data)
x_test = [1.5, 2.5, 3.5, 4.5]
y_test = [15, 25, 35, 45]  # True relationship: y = 10*x

print("Training Data:")
for x, y in zip(x_train, y_train):
    print(f"  x={x}, y={y}")

print("\nOverfitting Example:")
print("  Model: Very complex (memorizes every training point)")
print("  Training accuracy: Very high (memorized the data)")
print("  Test accuracy: Poor (can't generalize)")

# Overfitted model: Perfect on training, poor on test
# Training: Model fits exactly (error = 0)
# Test: Model fails because it memorized noise
train_error = 0  # Perfect fit on training
test_errors = [abs(10 * x - y) for x, y in zip(x_test, y_test)]
test_error = sum(test_errors) / len(test_errors)

print(f"\n  Training error: {train_error:.1f} (perfect!)")
print(f"  Test error: {test_error:.1f} (poor generalization)")
print("  This is overfitting - model memorized training data!")

The Bias-Variance Tradeoff

Overfitting and underfitting are related to the bias-variance tradeoff:

bias_variance.py
# Bias-Variance Tradeoff

print("Understanding Bias and Variance:")
print("\nUnderfitting (High Bias, Low Variance):")
print("  - Model is too simple")
print("  - High bias: Can't capture true pattern")
print("  - Low variance: Consistent predictions")
print("  - High error on both training and test")

print("\nOverfitting (Low Bias, High Variance):")
print("  - Model is too complex")
print("  - Low bias: Can fit training data well")
print("  - High variance: Predictions vary a lot")
print("  - Low error on training, high error on test")

print("\nGood Fit (Balanced):")
print("  - Model complexity matches data complexity")
print("  - Moderate bias and variance")
print("  - Good performance on both training and test")

print("\nGoal: Find the sweet spot between bias and variance!")

Detecting Overfitting and Underfitting

You can detect these problems by comparing training and test performance:

detecting_problems.py
# Detecting Overfitting and Underfitting

# Example performance metrics
print("Performance Comparison:")
print("=" * 50)

# Underfitting scenario
print("\nUnderfitting Model:")
train_acc_under = 0.60  # 60% accuracy
test_acc_under = 0.58   # 58% accuracy
print(f"  Training accuracy: {train_acc_under:.0%}")
print(f"  Test accuracy: {test_acc_under:.0%}")
print(f"  Gap: {abs(train_acc_under - test_acc_under):.0%} (small gap)")
print("  Sign: Both accuracies are low - model too simple!")

# Overfitting scenario
print("\nOverfitting Model:")
train_acc_over = 0.98  # 98% accuracy
test_acc_over = 0.70   # 70% accuracy
print(f"  Training accuracy: {train_acc_over:.0%}")
print(f"  Test accuracy: {test_acc_over:.0%}")
print(f"  Gap: {abs(train_acc_over - test_acc_over):.0%} (large gap!)")
print("  Sign: Training much better than test - model memorized!")

# Good fit scenario
print("\nWell-Fitted Model:")
train_acc_good = 0.85  # 85% accuracy
test_acc_good = 0.83   # 83% accuracy
print(f"  Training accuracy: {train_acc_good:.0%}")
print(f"  Test accuracy: {test_acc_good:.0%}")
print(f"  Gap: {abs(train_acc_good - test_acc_good):.0%} (small gap)")
print("  Sign: Both accuracies are good and similar - good fit!")

Solutions to Overfitting and Underfitting

Here are strategies to address these problems:

solutions.py
# Solutions to Overfitting and Underfitting

print("Solutions for Underfitting:")
print("  1. Increase model complexity")
print("     - Add more features")
print("     - Use a more complex algorithm")
print("     - Reduce regularization")
print("  2. Train longer")
print("     - More training iterations")
print("     - Better optimization")

print("\nSolutions for Overfitting:")
print("  1. Reduce model complexity")
print("     - Use simpler model")
print("     - Reduce number of features")
print("     - Increase regularization")
print("  2. Get more training data")
print("     - More data helps model generalize")
print("  3. Use cross-validation")
print("     - Better estimate of true performance")
print("  4. Early stopping")
print("     - Stop training before overfitting")

print("\nKey Principle:")
print("  Balance model complexity with data complexity!")

Exercise: Identify Overfitting and Underfitting

Complete the exercise on the right side:

Task 1: Calculate training and test accuracy for a model
Task 2: Determine if the model is underfitting, overfitting, or well-fitted
Task 3: Suggest solutions based on the problem type
Task 4: Calculate the performance gap between training and test

Write your code to analyze model performance and identify fitting problems!

Overfitting and Underfitting