π― Project: Data Preprocessing Project
This project will help you apply everything you've learned about data preparation. You'll build a complete data preprocessing pipeline that handles missing values, outliers, encoding, and normalization.
Projects are where theory meets practice. This is your chance to integrate multiple concepts and create something real.
Project Overview
We'll work with a sample dataset containing house information. Our goal is to clean and preprocess this data for machine learning:
raw_data = [
{"size": 1200, "bedrooms": 2, "age": 5, "location": "urban", "price": 250000},
{"size": 1500, "bedrooms": None, "age": 10, "location": "suburban", "price": 320000},
{"size": 2000, "bedrooms": 4, "age": 2, "location": "urban", "price": 450000},
{"size": None, "bedrooms": 3, "age": 15, "location": "rural", "price": 280000},
{"size": 1600, "bedrooms": 3, "age": 8, "location": "suburban", "price": 350000},
]
print("Raw Data (with issues):")
print(" - Missing values (None)")
print(" - Categorical data (location)")
print(" - Different scales (size vs age)")
print(" - Need preprocessing for ML!")
Step 1: Handle Missing Values
First, we need to identify and handle missing values:
data = [
{"size": 1200, "bedrooms": 2, "age": 5},
{"size": 1500, "bedrooms": None, "age": 10},
{"size": 2000, "bedrooms": 4, "age": 2},
{"size": None, "bedrooms": 3, "age": 15},
{"size": 1600, "bedrooms": 3, "age": 8},
]
sizes = [d["size"] for d in data if d["size"] is not None]
bedrooms = [d["bedrooms"] for d in data if d["bedrooms"] is not None]
mean_size = sum(sizes) / len(sizes)
mean_bedrooms = sum(bedrooms) / len(bedrooms)
print("Missing Value Handling:")
print(f" Mean size: {mean_size:.0f}")
print(f" Mean bedrooms: {mean_bedrooms:.1f}")
for record in data:
if record["size"] is None:
record["size"] = mean_size
if record["bedrooms"] is None:
record["bedrooms"] = round(mean_bedrooms)
print("\nData after filling missing values:")
for i, record in enumerate(data, 1):
print(f" {i}. {record}")
Step 2: Encode Categorical Data
Convert categorical features to numerical format:
data = [
{"location": "urban"},
{"location": "suburban"},
{"location": "urban"},
{"location": "rural"},
{"location": "suburban"},
]
unique_locations = list(set(d["location"] for d in data))
location_to_num = {loc: i for i, loc in enumerate(unique_locations)}
print("Categorical Encoding:")
print(" Mapping:", location_to_num)
for record in data:
record["location_encoded"] = location_to_num[record["location"]]
print("\nEncoded Data:")
for i, record in enumerate(data, 1):
print(f" {i}. {record["location"]} β {record["location_encoded"]}")
Step 3: Normalize Features
Scale features to similar ranges for better model performance:
sizes = [1200, 1500, 2000, 1575, 1600]
ages = [5, 10, 2, 15, 8]
print("Original features (different scales):")
print(" Sizes:", sizes)
print(" Ages:", ages)
def normalize(values):
min_val = min(values)
max_val = max(values)
return [(v - min_val) / (max_val - min_val) for v in values]
sizes_normalized = normalize(sizes)
ages_normalized = normalize(ages)
print("\nNormalized features (0-1 scale):")
print(" Sizes normalized:", [round(x, 3) for x in sizes_normalized])
print(" Ages normalized:", [round(x, 3) for x in ages_normalized])
print("\n Now both features are on the same scale!")
Step 4: Complete Preprocessing Pipeline
Putting it all together - a complete preprocessing workflow:
raw_data = [
{"size": 1200, "bedrooms": 2, "location": "urban"},
{"size": 1500, "bedrooms": None, "location": "suburban"},
{"size": 2000, "bedrooms": 4, "location": "urban"},
]
print("Step 1: Handle missing values")
mean_bedrooms = 3.0
for record in raw_data:
if record["bedrooms"] is None:
record["bedrooms"] = mean_bedrooms
print("Step 2: Encode categorical data")
location_map = {"urban": 0, "suburban": 1, "rural": 2}
for record in raw_data:
record["location_encoded"] = location_map[record["location"]]
print("Step 3: Normalize numerical features")
sizes = [r["size"] for r in raw_data]
min_size, max_size = min(sizes), max(sizes)
for i, record in enumerate(raw_data):
record["size_normalized"] = (record["size"] - min_size) / (max_size - min_size)
print("\nPreprocessed Data (ready for ML):")
for i, record in enumerate(raw_data, 1):
print(f" {i}. Size: {record["size_normalized"]:.3f}, Bedrooms: {record["bedrooms"]}, Location: {record["location_encoded"]}")
Project Checklist
Complete these steps for your project:
- β Handle Missing Values: Identify and fill or remove missing data
- β Remove Duplicates: Eliminate duplicate records
- β Encode Categorical: Convert text categories to numbers
- β Normalize Features: Scale features to similar ranges
- β Handle Outliers: Detect and treat extreme values
- β Validate Data: Check data quality and consistency
Once preprocessing is complete, your data is ready for machine learning models!
Exercise: Complete Preprocessing Pipeline
Complete the exercise on the right side to build a full preprocessing pipeline:
- Step 1: Fill missing values in 'bedrooms' with the mean
- Step 2: Encode the 'location' categorical feature
- Step 3: Normalize the 'size' feature using min-max normalization
- Step 4: Print the final preprocessed dataset
Write your code to complete all preprocessing steps!
π‘ Project Tips
Break the project into smaller tasks. Complete and test each part before moving to the next. Don't try to do everything at onceβiterative development leads to better results!
π
Lesson Complete!
Great work! Continue to the next lesson.