Lesson 9: Feature Engineering

What is Feature Engineering?

Feature engineering is the process of creating new features from existing data to improve ML model performance. It's often said that feature engineering is more art than science - it requires domain knowledge and creativity to create features that help your model learn better patterns.

Good features can make a simple model perform well, while poor features can make even the most sophisticated model fail. This is why feature engineering is considered one of the most important skills in ML.

Creating Derived Features

You can create new features by combining or transforming existing ones. Let's see examples:

derived_features.py
# Creating derived features

# Original features
length = [10, 15, 20, 12, 18]
width = [5, 8, 10, 6, 9]

print("Original features:")
print("  Length:", length)
print("  Width:", width)

# Derived feature 1: Area (length * width)
area = [l * w for l, w in zip(length, width)]
print("\nDerived feature - Area:", area)
print("  More meaningful than separate length/width")

# Derived feature 2: Aspect ratio (length / width)
aspect_ratio = [l / w for l, w in zip(length, width)]
print("\nDerived feature - Aspect Ratio:", [round(x, 2) for x in aspect_ratio])
print("  Describes shape (square vs rectangle)")

# Derived feature 3: Perimeter (2 * (length + width))
perimeter = [2 * (l + w) for l, w in zip(length, width)]
print("\nDerived feature - Perimeter:", perimeter)

print("\nWhy derived features help:")
print("  - Capture relationships between features")
print("  - More informative than raw features")
print("  - Help model learn better patterns")

Binning and Discretization

Converting continuous features into categorical bins can help capture non-linear relationships:

binning.py
# Binning continuous features

# Continuous age data
ages = [18, 25, 35, 45, 55, 65, 75, 22, 30, 40, 50, 60, 70]
print("Original ages:", ages)

# Create age groups (bins)
def age_group(age):
    if age < 25:
        return "Young"
    elif age < 40:
        return "Adult"
    elif age < 60:
        return "Middle-aged"
    else:
        return "Senior"

age_groups = [age_group(age) for age in ages]
print("\nAge groups (binned):", age_groups)

print("\nBinning benefits:")
print("  - Captures non-linear relationships")
print("  - Reduces impact of outliers")
print("  - Easier for some algorithms to learn")

# Income binning example
incomes = [25000, 45000, 75000, 120000, 200000, 50000, 90000]
print("\nOriginal incomes:", incomes)

def income_category(income):
    if income < 50000:
        return "Low"
    elif income < 100000:
        return "Medium"
    else:
        return "High"

income_categories = [income_category(inc) for inc in incomes]
print("Income categories:", income_categories)

Encoding Categorical Features

ML algorithms need numerical input, so categorical features must be encoded:

encoding.py
# Encoding categorical features

# Categorical data
colors = ["red", "blue", "green", "red", "blue"]
print("Original categorical data:", colors)

# Method 1: Label Encoding (assign numbers)
unique_colors = list(set(colors))
color_to_num = {color: i for i, color in enumerate(unique_colors)}
label_encoded = [color_to_num[c] for c in colors]
print("\nLabel Encoding:", label_encoded)
print("  Mapping:", color_to_num)
print("  Use when: Categories have no order")

# Method 2: One-Hot Encoding (binary columns)
print("\nOne-Hot Encoding (concept):")
print("  red   blue  green")
for color in colors:
    one_hot = [
        1 if color == "red" else 0,
        1 if color == "blue" else 0,
        1 if color == "green" else 0
    ]
    print(f"  {one_hot}")
print("  Use when: Categories are nominal (no order)")

# Method 3: Ordinal Encoding (for ordered categories)
sizes = ["small", "medium", "large", "small", "large"]
size_order = {"small": 0, "medium": 1, "large": 2}
ordinal_encoded = [size_order[s] for s in sizes]
print("\nOrdinal Encoding (preserves order):")
print("  Original:", sizes)
print("  Encoded:", ordinal_encoded)
print("  Use when: Categories have meaningful order")

Feature Scaling and Transformation

Transforming features can help models learn better patterns:

transformation.py
# Feature transformation

# Original feature (skewed data)
data = [1, 2, 3, 4, 5, 10, 20, 30, 50, 100]
print("Original data:", data)

# Log transformation (for skewed data)
import math
log_data = [math.log(x) for x in data]
print("\nLog transformation:", [round(x, 2) for x in log_data])
print("  Reduces impact of large values")
print("  Makes data more normally distributed")

# Square root transformation
sqrt_data = [math.sqrt(x) for x in data]
print("\nSquare root transformation:", [round(x, 2) for x in sqrt_data])
print("  Less aggressive than log, good for count data")

# Polynomial features (creating interactions)
x = [1, 2, 3, 4, 5]
x_squared = [xi ** 2 for xi in x]
x_cubed = [xi ** 3 for xi in x]
print("\nPolynomial features:")
print("  x:", x)
print("  x²:", x_squared)
print("  x³:", x_cubed)
print("  Captures non-linear relationships")

Feature Engineering Best Practices

Here are key principles for effective feature engineering:

Domain Knowledge: Understanding your problem helps create meaningful features
Start Simple: Begin with basic features, then add complexity
Feature Selection: Not all features are useful - remove redundant ones
Validation: Test if new features actually improve model performance
Avoid Leakage: Don't use future information to create features

Remember: Good features are often more important than complex algorithms!

Exercise: Create New Features

Complete the exercise on the right side:

Task 1: Create a derived feature: area = length × width
Task 2: Create age groups by binning ages into categories
Task 3: Encode categorical data (cities) using label encoding
Task 4: Print all the new features you created

Write your code to engineer these new features!

Feature Engineering