Chapter 7: Classification Models / Lesson 33

Random Forest

Understanding Random Forest

Random Forest is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model. The "random" comes from two sources of randomness: random sampling of data points (bootstrap) and random selection of features at each split.

Think of it like asking multiple experts (trees) for their opinion and then taking a vote. The final prediction is the majority vote (for classification) or average (for regression) of all the trees.

How Random Forest Works

Random Forest builds many decision trees, each trained on a different subset of the data:

  • Bootstrap Sampling: Each tree is trained on a random sample (with replacement) of the training data
  • Feature Randomness: At each split, only a random subset of features is considered
  • Voting/Averaging: All trees vote, and the majority wins (classification) or average is taken (regression)

This randomness reduces overfitting and makes the model more generalizable than a single decision tree.

Basic Random Forest Example

Let's see how to use Random Forest for classification:

random_forest_basic.py
# Random Forest classification from sklearn.ensemble import RandomForestClassifier import numpy as np # Training data: [feature1, feature2] -> class X = np.array([ [1, 1], [2, 1], [3, 2], [4, 2], [5, 1], [6, 2], [7, 1], [8, 2] ]) y = np.array([0, 0, 1, 1, 0, 1, 0, 1]) # Create Random Forest with 10 trees rf = RandomForestClassifier(n_estimators=10, random_state=42) rf.fit(X, y) print("Random Forest trained with 10 trees") print(f"Number of trees: {rf.n_estimators}") # Make predictions new_samples = np.array([[2.5, 1.5], [5.5, 1.5]]) predictions = rf.predict(new_samples) print("\nPredictions:") for sample, pred in zip(new_samples, predictions): print(f"Features {sample}: Class {pred}")

Why Random Forest is Better

Random Forest addresses the main weaknesses of decision trees:

comparing_trees.py
# Comparing single tree vs Random Forest from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier import numpy as np X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]]) y = np.array([0, 0, 0, 0, 1, 1, 1, 1]) # Single decision tree single_tree = DecisionTreeClassifier() single_tree.fit(X, y) # Random Forest (multiple trees) rf = RandomForestClassifier(n_estimators=100) rf.fit(X, y) print("Single Tree:") print(f" Depth: {single_tree.get_depth()}") print(f" Leaves: {single_tree.get_n_leaves()}") print("\nRandom Forest:") print(f" Number of trees: {rf.n_estimators}") print(f" Average depth: {np.mean([tree.get_depth() for tree in rf.estimators_]):.1f}") print("\nRandom Forest is more stable and less prone to overfitting!")

Key Parameters

Random Forest has several important parameters to tune:

parameters.py
# Understanding Random Forest parameters from sklearn.ensemble import RandomForestClassifier import numpy as np X = np.array([[1], [2], [3], [4], [5], [6]]) y = np.array([0, 0, 0, 1, 1, 1]) # n_estimators: Number of trees (more = better, but slower) rf1 = RandomForestClassifier(n_estimators=10) rf1.fit(X, y) # max_depth: Maximum depth of each tree rf2 = RandomForestClassifier(n_estimators=10, max_depth=3) rf2.fit(X, y) # min_samples_split: Minimum samples to split a node rf3 = RandomForestClassifier(n_estimators=10, min_samples_split=2) rf3.fit(X, y) print("Different parameter configurations:") print(" n_estimators: Number of trees in the forest") print(" max_depth: Limits tree depth to prevent overfitting") print(" min_samples_split: Controls when to stop splitting")

Feature Importance

Random Forest can show which features are most important across all trees:

feature_importance.py
# Feature importance in Random Forest from sklearn.ensemble import RandomForestClassifier import numpy as np # Dataset with multiple features X = np.array([ [1, 10, 100], [2, 20, 200], [3, 30, 300], [4, 40, 400], [5, 50, 500] ]) y = np.array([0, 0, 1, 1, 1]) # Train Random Forest rf = RandomForestClassifier(n_estimators=50) rf.fit(X, y) # Get feature importance importance = rf.feature_importances_ feature_names = ['Feature 1', 'Feature 2', 'Feature 3'] print("Feature Importance:") for name, imp in zip(feature_names, importance): print(f" {name}: {imp:.3f}") print("\nHigher values indicate more important features.")

Advantages of Random Forest

Random Forest has many advantages over single decision trees:

  • Reduces Overfitting: Multiple trees average out errors
  • Handles Missing Values: Can work with incomplete data
  • Feature Importance: Shows which features matter most
  • Works with Non-linear Data: Doesn't assume linear relationships
  • No Feature Scaling Needed: Works with raw data
  • Handles Large Datasets: Efficient even with many features

When to Use Random Forest

Random Forest is a great choice when:

  • You need good performance without much tuning
  • You want to understand feature importance
  • You have non-linear relationships in your data
  • You need a robust model that handles outliers well
  • You want better performance than a single decision tree

It's often used as a baseline model because it performs well out of the box with minimal tuning.

💡 Pro Tip

Random Forest is one of the most popular ML algorithms because it's easy to use, performs well, and requires little preprocessing. Start with 100-200 trees and adjust based on your needs!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...