Lesson 33: Random Forest

Understanding Random Forest

Random Forest is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model. The "random" comes from two sources of randomness: random sampling of data points (bootstrap) and random selection of features at each split.

Think of it like asking multiple experts (trees) for their opinion and then taking a vote. The final prediction is the majority vote (for classification) or average (for regression) of all the trees.

How Random Forest Works

Random Forest builds many decision trees, each trained on a different subset of the data:

Bootstrap Sampling: Each tree is trained on a random sample (with replacement) of the training data
Feature Randomness: At each split, only a random subset of features is considered
Voting/Averaging: All trees vote, and the majority wins (classification) or average is taken (regression)

This randomness reduces overfitting and makes the model more generalizable than a single decision tree.

Basic Random Forest Example

Let's see how to use Random Forest for classification:

random_forest_basic.py
# Random Forest classification
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Training data: [feature1, feature2] -> class
X = np.array([
    [1, 1], [2, 1], [3, 2], [4, 2],
    [5, 1], [6, 2], [7, 1], [8, 2]
])
y = np.array([0, 0, 1, 1, 0, 1, 0, 1])

# Create Random Forest with 10 trees
rf = RandomForestClassifier(n_estimators=10, random_state=42)
rf.fit(X, y)

print("Random Forest trained with 10 trees")
print(f"Number of trees: {rf.n_estimators}")

# Make predictions
new_samples = np.array([[2.5, 1.5], [5.5, 1.5]])
predictions = rf.predict(new_samples)
print("\nPredictions:")
for sample, pred in zip(new_samples, predictions):
    print(f"Features {sample}: Class {pred}")

Why Random Forest is Better

Random Forest addresses the main weaknesses of decision trees:

comparing_trees.py
# Comparing single tree vs Random Forest
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([0, 0, 0, 0, 1, 1, 1, 1])

# Single decision tree
single_tree = DecisionTreeClassifier()
single_tree.fit(X, y)

# Random Forest (multiple trees)
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)

print("Single Tree:")
print(f"  Depth: {single_tree.get_depth()}")
print(f"  Leaves: {single_tree.get_n_leaves()}")

print("\nRandom Forest:")
print(f"  Number of trees: {rf.n_estimators}")
print(f"  Average depth: {np.mean([tree.get_depth() for tree in rf.estimators_]):.1f}")
print("\nRandom Forest is more stable and less prone to overfitting!")

Key Parameters

Random Forest has several important parameters to tune:

parameters.py
# Understanding Random Forest parameters
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([0, 0, 0, 1, 1, 1])

# n_estimators: Number of trees (more = better, but slower)
rf1 = RandomForestClassifier(n_estimators=10)
rf1.fit(X, y)

# max_depth: Maximum depth of each tree
rf2 = RandomForestClassifier(n_estimators=10, max_depth=3)
rf2.fit(X, y)

# min_samples_split: Minimum samples to split a node
rf3 = RandomForestClassifier(n_estimators=10, min_samples_split=2)
rf3.fit(X, y)

print("Different parameter configurations:")
print("  n_estimators: Number of trees in the forest")
print("  max_depth: Limits tree depth to prevent overfitting")
print("  min_samples_split: Controls when to stop splitting")

Feature Importance

Random Forest can show which features are most important across all trees:

feature_importance.py
# Feature importance in Random Forest
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Dataset with multiple features
X = np.array([
    [1, 10, 100],
    [2, 20, 200],
    [3, 30, 300],
    [4, 40, 400],
    [5, 50, 500]
])
y = np.array([0, 0, 1, 1, 1])

# Train Random Forest
rf = RandomForestClassifier(n_estimators=50)
rf.fit(X, y)

# Get feature importance
importance = rf.feature_importances_
feature_names = ['Feature 1', 'Feature 2', 'Feature 3']

print("Feature Importance:")
for name, imp in zip(feature_names, importance):
    print(f"  {name}: {imp:.3f}")

print("\nHigher values indicate more important features.")

Advantages of Random Forest

Random Forest has many advantages over single decision trees:

Reduces Overfitting: Multiple trees average out errors
Handles Missing Values: Can work with incomplete data
Feature Importance: Shows which features matter most
Works with Non-linear Data: Doesn't assume linear relationships
No Feature Scaling Needed: Works with raw data
Handles Large Datasets: Efficient even with many features

When to Use Random Forest

Random Forest is a great choice when:

You need good performance without much tuning
You want to understand feature importance
You have non-linear relationships in your data
You need a robust model that handles outliers well
You want better performance than a single decision tree

It's often used as a baseline model because it performs well out of the box with minimal tuning.

Random Forest