What is Feature Engineering?
Feature engineering is the process of creating new features from existing data to improve ML model performance. It's often said that feature engineering is more art than science - it requires domain knowledge and creativity to create features that help your model learn better patterns.
Good features can make a simple model perform well, while poor features can make even the most sophisticated model fail. This is why feature engineering is considered one of the most important skills in ML.
Creating Derived Features
You can create new features by combining or transforming existing ones. Let's see examples:
length = [10, 15, 20, 12, 18]
width = [5, 8, 10, 6, 9]
print("Original features:")
print(" Length:", length)
print(" Width:", width)
area = [l * w for l, w in zip(length, width)]
print("\nDerived feature - Area:", area)
print(" More meaningful than separate length/width")
aspect_ratio = [l / w for l, w in zip(length, width)]
print("\nDerived feature - Aspect Ratio:", [round(x, 2) for x in aspect_ratio])
print(" Describes shape (square vs rectangle)")
perimeter = [2 * (l + w) for l, w in zip(length, width)]
print("\nDerived feature - Perimeter:", perimeter)
print("\nWhy derived features help:")
print(" - Capture relationships between features")
print(" - More informative than raw features")
print(" - Help model learn better patterns")
Binning and Discretization
Converting continuous features into categorical bins can help capture non-linear relationships:
ages = [18, 25, 35, 45, 55, 65, 75, 22, 30, 40, 50, 60, 70]
print("Original ages:", ages)
def age_group(age):
if age < 25:
return "Young"
elif age < 40:
return "Adult"
elif age < 60:
return "Middle-aged"
else:
return "Senior"
age_groups = [age_group(age) for age in ages]
print("\nAge groups (binned):", age_groups)
print("\nBinning benefits:")
print(" - Captures non-linear relationships")
print(" - Reduces impact of outliers")
print(" - Easier for some algorithms to learn")
incomes = [25000, 45000, 75000, 120000, 200000, 50000, 90000]
print("\nOriginal incomes:", incomes)
def income_category(income):
if income < 50000:
return "Low"
elif income < 100000:
return "Medium"
else:
return "High"
income_categories = [income_category(inc) for inc in incomes]
print("Income categories:", income_categories)
Encoding Categorical Features
ML algorithms need numerical input, so categorical features must be encoded:
colors = ["red", "blue", "green", "red", "blue"]
print("Original categorical data:", colors)
unique_colors = list(set(colors))
color_to_num = {color: i for i, color in enumerate(unique_colors)}
label_encoded = [color_to_num[c] for c in colors]
print("\nLabel Encoding:", label_encoded)
print(" Mapping:", color_to_num)
print(" Use when: Categories have no order")
print("\nOne-Hot Encoding (concept):")
print(" red blue green")
for color in colors:
one_hot = [
1 if color == "red" else 0,
1 if color == "blue" else 0,
1 if color == "green" else 0
]
print(f" {one_hot}")
print(" Use when: Categories are nominal (no order)")
sizes = ["small", "medium", "large", "small", "large"]
size_order = {"small": 0, "medium": 1, "large": 2}
ordinal_encoded = [size_order[s] for s in sizes]
print("\nOrdinal Encoding (preserves order):")
print(" Original:", sizes)
print(" Encoded:", ordinal_encoded)
print(" Use when: Categories have meaningful order")
Feature Scaling and Transformation
Transforming features can help models learn better patterns:
data = [1, 2, 3, 4, 5, 10, 20, 30, 50, 100]
print("Original data:", data)
import math
log_data = [math.log(x) for x in data]
print("\nLog transformation:", [round(x, 2) for x in log_data])
print(" Reduces impact of large values")
print(" Makes data more normally distributed")
sqrt_data = [math.sqrt(x) for x in data]
print("\nSquare root transformation:", [round(x, 2) for x in sqrt_data])
print(" Less aggressive than log, good for count data")
x = [1, 2, 3, 4, 5]
x_squared = [xi ** 2 for xi in x]
x_cubed = [xi ** 3 for xi in x]
print("\nPolynomial features:")
print(" x:", x)
print(" x²:", x_squared)
print(" x³:", x_cubed)
print(" Captures non-linear relationships")
Feature Engineering Best Practices
Here are key principles for effective feature engineering:
- Domain Knowledge: Understanding your problem helps create meaningful features
- Start Simple: Begin with basic features, then add complexity
- Feature Selection: Not all features are useful - remove redundant ones
- Validation: Test if new features actually improve model performance
- Avoid Leakage: Don't use future information to create features
Remember: Good features are often more important than complex algorithms!
Exercise: Create New Features
Complete the exercise on the right side:
- Task 1: Create a derived feature: area = length × width
- Task 2: Create age groups by binning ages into categories
- Task 3: Encode categorical data (cities) using label encoding
- Task 4: Print all the new features you created
Write your code to engineer these new features!
💡 Learning Tip
Practice is essential. Try modifying the code examples, experiment with different parameters, and see how changes affect the results. Hands-on experience is the best teacher!
🎉
Lesson Complete!
Great work! Continue to the next lesson.