Chapter 7: Classification Models / Lesson 32

Decision Trees

Understanding Decision Trees

Decision trees are a powerful and intuitive machine learning algorithm that makes decisions by asking a series of yes/no questions. Think of it like a flowchart or a game of 20 questions—each question splits the data into smaller groups until you reach a prediction.

Decision trees are popular because they're easy to understand, can handle both classification and regression tasks, and don't require feature scaling. However, they can overfit if not properly tuned.

How Decision Trees Work

A decision tree builds a model by:

  • Starting at the root: The tree begins with all data at the top
  • Splitting on features: At each node, it asks a question about a feature
  • Creating branches: Each answer creates a new branch
  • Reaching leaves: Eventually, you reach a leaf node with a prediction

The algorithm chooses splits that best separate the data, typically using metrics like Gini impurity or entropy for classification.

Simple Classification Example

Let's see how a decision tree classifies data. We'll use a simple example with two features:

simple_tree.py
# Simple decision tree example from sklearn.tree import DecisionTreeClassifier import numpy as np # Training data: [feature1, feature2] -> class X = np.array([ [1, 1], # Small, Red [2, 1], # Medium, Red [3, 2], # Large, Blue [4, 2] # Extra Large, Blue ]) y = np.array([0, 0, 1, 1]) # 0 = Class A, 1 = Class B # Create and train the tree tree = DecisionTreeClassifier() tree.fit(X, y) print("Training data:") for i, (features, label) in enumerate(zip(X, y)): print(f"Sample {i+1}: Features={features}, Class={label}") # Make a prediction new_sample = np.array([[2.5, 1.5]]) prediction = tree.predict(new_sample) print(f"\nPrediction for {new_sample[0]}: Class {prediction[0]}")

Visualizing the Tree Structure

Decision trees can be visualized to understand how they make decisions. Each node shows the question being asked:

visualize_tree.py
# Visualizing decision tree structure from sklearn.tree import DecisionTreeClassifier, plot_tree import matplotlib.pyplot as plt import numpy as np # Simple dataset X = np.array([[1], [2], [3], [4], [5], [6]]) y = np.array([0, 0, 0, 1, 1, 1]) # Train tree tree = DecisionTreeClassifier(max_depth=2) tree.fit(X, y) print("Tree structure:") print("- Root node: Asks about feature value") print("- Internal nodes: Continue asking questions") print("- Leaf nodes: Make final predictions") print(f"\nTree depth: {tree.get_depth()}") print(f"Number of leaves: {tree.get_n_leaves()}")

Controlling Tree Complexity

Decision trees can easily overfit. Use these parameters to control complexity:

tree_parameters.py
# Controlling decision tree complexity from sklearn.tree import DecisionTreeClassifier import numpy as np X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]]) y = np.array([0, 0, 0, 0, 1, 1, 1, 1]) # Shallow tree (less complex, less overfitting) shallow_tree = DecisionTreeClassifier(max_depth=2, min_samples_split=2) shallow_tree.fit(X, y) # Deep tree (more complex, may overfit) deep_tree = DecisionTreeClassifier(max_depth=10, min_samples_split=1) deep_tree.fit(X, y) print("Shallow tree depth:", shallow_tree.get_depth()) print("Deep tree depth:", deep_tree.get_depth()) print("\nKey parameters:") print("- max_depth: Maximum depth of the tree") print("- min_samples_split: Minimum samples to split a node") print("- min_samples_leaf: Minimum samples in a leaf node")

Feature Importance

Decision trees can show which features are most important for making predictions:

feature_importance.py
# Understanding feature importance from sklearn.tree import DecisionTreeClassifier import numpy as np # Dataset with multiple features X = np.array([ [1, 10, 100], [2, 20, 200], [3, 30, 300], [4, 40, 400] ]) y = np.array([0, 0, 1, 1]) # Train tree tree = DecisionTreeClassifier() tree.fit(X, y) # Get feature importance importance = tree.feature_importances_ print("Feature importance:") for i, imp in enumerate(importance): print(f"Feature {i}: {imp:.3f}") print("\nHigher values mean the feature is more important for predictions.")

Advantages and Disadvantages

Decision trees have several advantages:

  • Easy to understand: The tree structure is interpretable
  • No feature scaling needed: Works with raw data
  • Handles non-linear relationships: Can capture complex patterns
  • Feature importance: Shows which features matter most

But they also have disadvantages:

  • Overfitting: Can memorize training data
  • Instability: Small data changes can create very different trees
  • Bias toward features with more levels: May favor categorical features with many categories

These limitations are often addressed by using ensemble methods like Random Forests, which we'll cover in the next lesson.

💡 When to Use Decision Trees

Use decision trees when you need interpretability, have non-linear relationships, or want to understand feature importance. For better performance, consider Random Forests or Gradient Boosting, which combine multiple trees.

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...