Chapter 2: Data Fundamentals / Lesson 9

Feature Engineering

What is Feature Engineering?

Feature engineering is the process of creating new features from existing data to improve ML model performance. It's often said that feature engineering is more art than science - it requires domain knowledge and creativity to create features that help your model learn better patterns.

Good features can make a simple model perform well, while poor features can make even the most sophisticated model fail. This is why feature engineering is considered one of the most important skills in ML.

Creating Derived Features

You can create new features by combining or transforming existing ones. Let's see examples:

derived_features.py
# Creating derived features # Original features length = [10, 15, 20, 12, 18] width = [5, 8, 10, 6, 9] print("Original features:") print(" Length:", length) print(" Width:", width) # Derived feature 1: Area (length * width) area = [l * w for l, w in zip(length, width)] print("\nDerived feature - Area:", area) print(" More meaningful than separate length/width") # Derived feature 2: Aspect ratio (length / width) aspect_ratio = [l / w for l, w in zip(length, width)] print("\nDerived feature - Aspect Ratio:", [round(x, 2) for x in aspect_ratio]) print(" Describes shape (square vs rectangle)") # Derived feature 3: Perimeter (2 * (length + width)) perimeter = [2 * (l + w) for l, w in zip(length, width)] print("\nDerived feature - Perimeter:", perimeter) print("\nWhy derived features help:") print(" - Capture relationships between features") print(" - More informative than raw features") print(" - Help model learn better patterns")

Binning and Discretization

Converting continuous features into categorical bins can help capture non-linear relationships:

binning.py
# Binning continuous features # Continuous age data ages = [18, 25, 35, 45, 55, 65, 75, 22, 30, 40, 50, 60, 70] print("Original ages:", ages) # Create age groups (bins) def age_group(age): if age < 25: return "Young" elif age < 40: return "Adult" elif age < 60: return "Middle-aged" else: return "Senior" age_groups = [age_group(age) for age in ages] print("\nAge groups (binned):", age_groups) print("\nBinning benefits:") print(" - Captures non-linear relationships") print(" - Reduces impact of outliers") print(" - Easier for some algorithms to learn") # Income binning example incomes = [25000, 45000, 75000, 120000, 200000, 50000, 90000] print("\nOriginal incomes:", incomes) def income_category(income): if income < 50000: return "Low" elif income < 100000: return "Medium" else: return "High" income_categories = [income_category(inc) for inc in incomes] print("Income categories:", income_categories)

Encoding Categorical Features

ML algorithms need numerical input, so categorical features must be encoded:

encoding.py
# Encoding categorical features # Categorical data colors = ["red", "blue", "green", "red", "blue"] print("Original categorical data:", colors) # Method 1: Label Encoding (assign numbers) unique_colors = list(set(colors)) color_to_num = {color: i for i, color in enumerate(unique_colors)} label_encoded = [color_to_num[c] for c in colors] print("\nLabel Encoding:", label_encoded) print(" Mapping:", color_to_num) print(" Use when: Categories have no order") # Method 2: One-Hot Encoding (binary columns) print("\nOne-Hot Encoding (concept):") print(" red blue green") for color in colors: one_hot = [ 1 if color == "red" else 0, 1 if color == "blue" else 0, 1 if color == "green" else 0 ] print(f" {one_hot}") print(" Use when: Categories are nominal (no order)") # Method 3: Ordinal Encoding (for ordered categories) sizes = ["small", "medium", "large", "small", "large"] size_order = {"small": 0, "medium": 1, "large": 2} ordinal_encoded = [size_order[s] for s in sizes] print("\nOrdinal Encoding (preserves order):") print(" Original:", sizes) print(" Encoded:", ordinal_encoded) print(" Use when: Categories have meaningful order")

Feature Scaling and Transformation

Transforming features can help models learn better patterns:

transformation.py
# Feature transformation # Original feature (skewed data) data = [1, 2, 3, 4, 5, 10, 20, 30, 50, 100] print("Original data:", data) # Log transformation (for skewed data) import math log_data = [math.log(x) for x in data] print("\nLog transformation:", [round(x, 2) for x in log_data]) print(" Reduces impact of large values") print(" Makes data more normally distributed") # Square root transformation sqrt_data = [math.sqrt(x) for x in data] print("\nSquare root transformation:", [round(x, 2) for x in sqrt_data]) print(" Less aggressive than log, good for count data") # Polynomial features (creating interactions) x = [1, 2, 3, 4, 5] x_squared = [xi ** 2 for xi in x] x_cubed = [xi ** 3 for xi in x] print("\nPolynomial features:") print(" x:", x) print(" x²:", x_squared) print(" x³:", x_cubed) print(" Captures non-linear relationships")

Feature Engineering Best Practices

Here are key principles for effective feature engineering:

  • Domain Knowledge: Understanding your problem helps create meaningful features
  • Start Simple: Begin with basic features, then add complexity
  • Feature Selection: Not all features are useful - remove redundant ones
  • Validation: Test if new features actually improve model performance
  • Avoid Leakage: Don't use future information to create features

Remember: Good features are often more important than complex algorithms!

Exercise: Create New Features

Complete the exercise on the right side:

  • Task 1: Create a derived feature: area = length × width
  • Task 2: Create age groups by binning ages into categories
  • Task 3: Encode categorical data (cities) using label encoding
  • Task 4: Print all the new features you created

Write your code to engineer these new features!

💡 Learning Tip

Practice is essential. Try modifying the code examples, experiment with different parameters, and see how changes affect the results. Hands-on experience is the best teacher!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...