Train-Test Split
Train-test split is one of the most important concepts in machine learning. You must separate your data into training and testing sets to properly evaluate your model. The training set teaches your model, while the test set evaluates how well it learned.
Why split? If you test on the same data you trained on, you'll get overly optimistic results. The model might just be memorizing the training data (overfitting) rather than learning general patterns.
The Problem Without Splitting
If you train and test on the same data, your model might memorize answers rather than learn patterns:
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
print("Predictions on training data:", predictions)
print("This gives misleadingly perfect results!")
print("We need to test on NEW, unseen data.")
Proper Train-Test Split
Use scikit-learn's train_test_split to properly separate your data:
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 20])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("Training set size:", len(X_train))
print("Test set size:", len(X_test))
print("\nTraining data:")
print("X_train:", X_train.flatten())
print("y_train:", y_train)
print("\nTest data:")
print("X_test:", X_test.flatten())
print("y_test:", y_test)
Using the Split
Now train on the training set and evaluate on the test set:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([2, 4, 6, 8, 10, 12, 14, 16])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Model trained on training set")
print("Evaluated on test set (unseen data)")
print(f"Test MSE: {mse:.2f}")
print("\nThis gives honest performance estimate!")
Choosing the Split Ratio
Common split ratios depend on your dataset size:
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[i] for i in range(100)])
y = np.array([i * 2 for i in range(100)])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"80-20 split: {len(X_train)} train, {len(X_test)} test")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
print(f"70-30 split: {len(X_train)} train, {len(X_test)} test")
print("\nCommon ratios:")
print("- Large datasets: 80-20 or 90-10")
print("- Small datasets: 70-30 or 60-40")
Stratified Split for Classification
For classification problems, use stratified split to maintain class distribution:
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
print("Original class distribution:")
print(f"Class 0: {np.sum(y == 0)}, Class 1: {np.sum(y == 1)}")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)
print("\nTraining set class distribution:")
print(f"Class 0: {np.sum(y_train == 0)}, Class 1: {np.sum(y_train == 1)}")
print("\nTest set class distribution:")
print(f"Class 0: {np.sum(y_test == 0)}, Class 1: {np.sum(y_test == 1)}")
Important Rules
Follow these rules when splitting data:
- Never look at test data: Don't use test set for any decisions about your model
- Split before preprocessing: But apply same preprocessing to both sets
- Use random_state: Makes results reproducible
- Shuffle by default: Unless you have time-series data
⚠️ Critical Rule
The test set should simulate real-world data your model will encounter. Once you've looked at test results and modified your model, you've "used up" the test set. For further tuning, use a validation set or cross-validation!