Chapter 5: Supervised Learning Basics / Lesson 21

Train-Test Split

Train-Test Split

Train-test split is one of the most important concepts in machine learning. You must separate your data into training and testing sets to properly evaluate your model. The training set teaches your model, while the test set evaluates how well it learned.

Why split? If you test on the same data you trained on, you'll get overly optimistic results. The model might just be memorizing the training data (overfitting) rather than learning general patterns.

The Problem Without Splitting

If you train and test on the same data, your model might memorize answers rather than learn patterns:

no_split_problem.py
# BAD: Training and testing on same data from sklearn.linear_model import LinearRegression import numpy as np # All data X = np.array([[1], [2], [3], [4], [5]]) y = np.array([2, 4, 6, 8, 10]) # Train on ALL data model = LinearRegression() model.fit(X, y) # Test on SAME data (BAD!) predictions = model.predict(X) print("Predictions on training data:", predictions) print("This gives misleadingly perfect results!") print("We need to test on NEW, unseen data.")

Proper Train-Test Split

Use scikit-learn's train_test_split to properly separate your data:

train_test_split.py
# Proper train-test split from sklearn.model_selection import train_test_split import numpy as np # All your data X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]) y = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 20]) # Split into train and test (80% train, 20% test) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print("Training set size:", len(X_train)) print("Test set size:", len(X_test)) print("\nTraining data:") print("X_train:", X_train.flatten()) print("y_train:", y_train) print("\nTest data:") print("X_test:", X_test.flatten()) print("y_test:", y_test)

Using the Split

Now train on the training set and evaluate on the test set:

using_split.py
# Train on training set, test on test set from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import numpy as np # Prepare data X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]]) y = np.array([2, 4, 6, 8, 10, 12, 14, 16]) # Split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42 ) # Train ONLY on training data model = LinearRegression() model.fit(X_train, y_train) # Evaluate on test data (unseen during training) y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) print("Model trained on training set") print("Evaluated on test set (unseen data)") print(f"Test MSE: {mse:.2f}") print("\nThis gives honest performance estimate!")

Choosing the Split Ratio

Common split ratios depend on your dataset size:

split_ratios.py
# Different split ratios from sklearn.model_selection import train_test_split import numpy as np X = np.array([[i] for i in range(100)]) y = np.array([i * 2 for i in range(100)]) # 80-20 split (most common) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(f"80-20 split: {len(X_train)} train, {len(X_test)} test") # 70-30 split (smaller datasets) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) print(f"70-30 split: {len(X_train)} train, {len(X_test)} test") print("\nCommon ratios:") print("- Large datasets: 80-20 or 90-10") print("- Small datasets: 70-30 or 60-40")

Stratified Split for Classification

For classification problems, use stratified split to maintain class distribution:

stratified_split.py
# Stratified split maintains class proportions from sklearn.model_selection import train_test_split import numpy as np # Classification data with imbalanced classes X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]) y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1]) # 50% each class print("Original class distribution:") print(f"Class 0: {np.sum(y == 0)}, Class 1: {np.sum(y == 1)}") # Stratified split maintains proportions X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, stratify=y, random_state=42 ) print("\nTraining set class distribution:") print(f"Class 0: {np.sum(y_train == 0)}, Class 1: {np.sum(y_train == 1)}") print("\nTest set class distribution:") print(f"Class 0: {np.sum(y_test == 0)}, Class 1: {np.sum(y_test == 1)}")

Important Rules

Follow these rules when splitting data:

  • Never look at test data: Don't use test set for any decisions about your model
  • Split before preprocessing: But apply same preprocessing to both sets
  • Use random_state: Makes results reproducible
  • Shuffle by default: Unless you have time-series data

⚠️ Critical Rule

The test set should simulate real-world data your model will encounter. Once you've looked at test results and modified your model, you've "used up" the test set. For further tuning, use a validation set or cross-validation!

💡 Best Practice

Always split your data before doing anything else. This prevents data leakage and ensures honest evaluation. A common mistake is exploring the test set—resist the temptation!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...