Lesson 21: Train-Test Split

Train-Test Split

Train-test split is one of the most important concepts in machine learning. You must separate your data into training and testing sets to properly evaluate your model. The training set teaches your model, while the test set evaluates how well it learned.

Why split? If you test on the same data you trained on, you'll get overly optimistic results. The model might just be memorizing the training data (overfitting) rather than learning general patterns.

The Problem Without Splitting

If you train and test on the same data, your model might memorize answers rather than learn patterns:

no_split_problem.py
# BAD: Training and testing on same data
from sklearn.linear_model import LinearRegression
import numpy as np

# All data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# Train on ALL data
model = LinearRegression()
model.fit(X, y)

# Test on SAME data (BAD!)
predictions = model.predict(X)
print("Predictions on training data:", predictions)
print("This gives misleadingly perfect results!")
print("We need to test on NEW, unseen data.")

Proper Train-Test Split

Use scikit-learn's train_test_split to properly separate your data:

train_test_split.py
# Proper train-test split
from sklearn.model_selection import train_test_split
import numpy as np

# All your data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 20])

# Split into train and test (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training set size:", len(X_train))
print("Test set size:", len(X_test))
print("\nTraining data:")
print("X_train:", X_train.flatten())
print("y_train:", y_train)
print("\nTest data:")
print("X_test:", X_test.flatten())
print("y_test:", y_test)

Using the Split

Now train on the training set and evaluate on the test set:

using_split.py
# Train on training set, test on test set
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Prepare data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([2, 4, 6, 8, 10, 12, 14, 16])

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Train ONLY on training data
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate on test data (unseen during training)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Model trained on training set")
print("Evaluated on test set (unseen data)")
print(f"Test MSE: {mse:.2f}")
print("\nThis gives honest performance estimate!")

Choosing the Split Ratio

Common split ratios depend on your dataset size:

split_ratios.py
# Different split ratios
from sklearn.model_selection import train_test_split
import numpy as np

X = np.array([[i] for i in range(100)])
y = np.array([i * 2 for i in range(100)])

# 80-20 split (most common)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"80-20 split: {len(X_train)} train, {len(X_test)} test")

# 70-30 split (smaller datasets)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print(f"70-30 split: {len(X_train)} train, {len(X_test)} test")

print("\nCommon ratios:")
print("- Large datasets: 80-20 or 90-10")
print("- Small datasets: 70-30 or 60-40")

Stratified Split for Classification

For classification problems, use stratified split to maintain class distribution:

stratified_split.py
# Stratified split maintains class proportions
from sklearn.model_selection import train_test_split
import numpy as np

# Classification data with imbalanced classes
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])  # 50% each class

print("Original class distribution:")
print(f"Class 0: {np.sum(y == 0)}, Class 1: {np.sum(y == 1)}")

# Stratified split maintains proportions
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

print("\nTraining set class distribution:")
print(f"Class 0: {np.sum(y_train == 0)}, Class 1: {np.sum(y_train == 1)}")
print("\nTest set class distribution:")
print(f"Class 0: {np.sum(y_test == 0)}, Class 1: {np.sum(y_test == 1)}")

Important Rules

Follow these rules when splitting data:

Never look at test data: Don't use test set for any decisions about your model
Split before preprocessing: But apply same preprocessing to both sets
Use random_state: Makes results reproducible
Shuffle by default: Unless you have time-series data

⚠️ Critical Rule

The test set should simulate real-world data your model will encounter. Once you've looked at test results and modified your model, you've "used up" the test set. For further tuning, use a validation set or cross-validation!

Train-Test Split

Train-Test Split

The Problem Without Splitting

Proper Train-Test Split

Using the Split

Choosing the Split Ratio

Stratified Split for Classification

Important Rules

⚠️ Critical Rule

💡 Best Practice

Lesson Complete!