Decision Trees

Understanding Decision Trees

Decision trees are a powerful and intuitive machine learning algorithm that makes decisions by asking a series of yes/no questions. Think of it like a flowchart or a game of 20 questions—each question splits the data into smaller groups until you reach a prediction.

Decision trees are popular because they're easy to understand, can handle both classification and regression tasks, and don't require feature scaling. However, they can overfit if not properly tuned.

How Decision Trees Work

A decision tree builds a model by:

Starting at the root: The tree begins with all data at the top
Splitting on features: At each node, it asks a question about a feature
Creating branches: Each answer creates a new branch
Reaching leaves: Eventually, you reach a leaf node with a prediction

The algorithm chooses splits that best separate the data, typically using metrics like Gini impurity or entropy for classification.

Simple Classification Example

Let's see how a decision tree classifies data. We'll use a simple example with two features:

simple_tree.py
# Simple decision tree example
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Training data: [feature1, feature2] -> class
X = np.array([
    [1, 1],  # Small, Red
    [2, 1],  # Medium, Red
    [3, 2],  # Large, Blue
    [4, 2]   # Extra Large, Blue
])
y = np.array([0, 0, 1, 1])  # 0 = Class A, 1 = Class B

# Create and train the tree
tree = DecisionTreeClassifier()
tree.fit(X, y)

print("Training data:")
for i, (features, label) in enumerate(zip(X, y)):
    print(f"Sample {i+1}: Features={features}, Class={label}")

# Make a prediction
new_sample = np.array([[2.5, 1.5]])
prediction = tree.predict(new_sample)
print(f"\nPrediction for {new_sample[0]}: Class {prediction[0]}")

Visualizing the Tree Structure

Decision trees can be visualized to understand how they make decisions. Each node shows the question being asked:

visualize_tree.py
# Visualizing decision tree structure
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
import numpy as np

# Simple dataset
X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([0, 0, 0, 1, 1, 1])

# Train tree
tree = DecisionTreeClassifier(max_depth=2)
tree.fit(X, y)

print("Tree structure:")
print("- Root node: Asks about feature value")
print("- Internal nodes: Continue asking questions")
print("- Leaf nodes: Make final predictions")
print(f"\nTree depth: {tree.get_depth()}")
print(f"Number of leaves: {tree.get_n_leaves()}")

Controlling Tree Complexity

Decision trees can easily overfit. Use these parameters to control complexity:

tree_parameters.py
# Controlling decision tree complexity
from sklearn.tree import DecisionTreeClassifier
import numpy as np

X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([0, 0, 0, 0, 1, 1, 1, 1])

# Shallow tree (less complex, less overfitting)
shallow_tree = DecisionTreeClassifier(max_depth=2, min_samples_split=2)
shallow_tree.fit(X, y)

# Deep tree (more complex, may overfit)
deep_tree = DecisionTreeClassifier(max_depth=10, min_samples_split=1)
deep_tree.fit(X, y)

print("Shallow tree depth:", shallow_tree.get_depth())
print("Deep tree depth:", deep_tree.get_depth())
print("\nKey parameters:")
print("- max_depth: Maximum depth of the tree")
print("- min_samples_split: Minimum samples to split a node")
print("- min_samples_leaf: Minimum samples in a leaf node")

Feature Importance

Decision trees can show which features are most important for making predictions:

feature_importance.py
# Understanding feature importance
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Dataset with multiple features
X = np.array([
    [1, 10, 100],
    [2, 20, 200],
    [3, 30, 300],
    [4, 40, 400]
])
y = np.array([0, 0, 1, 1])

# Train tree
tree = DecisionTreeClassifier()
tree.fit(X, y)

# Get feature importance
importance = tree.feature_importances_
print("Feature importance:")
for i, imp in enumerate(importance):
    print(f"Feature {i}: {imp:.3f}")

print("\nHigher values mean the feature is more important for predictions.")

Advantages and Disadvantages

Decision trees have several advantages:

Easy to understand: The tree structure is interpretable
No feature scaling needed: Works with raw data
Handles non-linear relationships: Can capture complex patterns
Feature importance: Shows which features matter most

But they also have disadvantages:

Overfitting: Can memorize training data
Instability: Small data changes can create very different trees
Bias toward features with more levels: May favor categorical features with many categories

These limitations are often addressed by using ensemble methods like Random Forests, which we'll cover in the next lesson.

💡 When to Use Decision Trees

Use decision trees when you need interpretability, have non-linear relationships, or want to understand feature importance. For better performance, consider Random Forests or Gradient Boosting, which combine multiple trees.

🎉

Lesson Complete!

Great work! Continue to the next lesson.