Chapter 2: Data Fundamentals / Lesson 10

Data Preprocessing Project

🎯 Project: Data Preprocessing Project

This project will help you apply everything you've learned about data preparation. You'll build a complete data preprocessing pipeline that handles missing values, outliers, encoding, and normalization.

Projects are where theory meets practice. This is your chance to integrate multiple concepts and create something real.

Project Overview

We'll work with a sample dataset containing house information. Our goal is to clean and preprocess this data for machine learning:

project_overview.py
# Data Preprocessing Project - Overview # Sample dataset (with issues to fix) raw_data = [ {"size": 1200, "bedrooms": 2, "age": 5, "location": "urban", "price": 250000}, {"size": 1500, "bedrooms": None, "age": 10, "location": "suburban", "price": 320000}, {"size": 2000, "bedrooms": 4, "age": 2, "location": "urban", "price": 450000}, {"size": None, "bedrooms": 3, "age": 15, "location": "rural", "price": 280000}, {"size": 1600, "bedrooms": 3, "age": 8, "location": "suburban", "price": 350000}, ] print("Raw Data (with issues):") print(" - Missing values (None)") print(" - Categorical data (location)") print(" - Different scales (size vs age)") print(" - Need preprocessing for ML!")

Step 1: Handle Missing Values

First, we need to identify and handle missing values:

step1_missing_values.py
# Step 1: Handle Missing Values data = [ {"size": 1200, "bedrooms": 2, "age": 5}, {"size": 1500, "bedrooms": None, "age": 10}, {"size": 2000, "bedrooms": 4, "age": 2}, {"size": None, "bedrooms": 3, "age": 15}, {"size": 1600, "bedrooms": 3, "age": 8}, ] # Calculate mean for numerical features sizes = [d["size"] for d in data if d["size"] is not None] bedrooms = [d["bedrooms"] for d in data if d["bedrooms"] is not None] mean_size = sum(sizes) / len(sizes) mean_bedrooms = sum(bedrooms) / len(bedrooms) print("Missing Value Handling:") print(f" Mean size: {mean_size:.0f}") print(f" Mean bedrooms: {mean_bedrooms:.1f}") # Fill missing values for record in data: if record["size"] is None: record["size"] = mean_size if record["bedrooms"] is None: record["bedrooms"] = round(mean_bedrooms) print("\nData after filling missing values:") for i, record in enumerate(data, 1): print(f" {i}. {record}")

Step 2: Encode Categorical Data

Convert categorical features to numerical format:

step2_encoding.py
# Step 2: Encode Categorical Data data = [ {"location": "urban"}, {"location": "suburban"}, {"location": "urban"}, {"location": "rural"}, {"location": "suburban"}, ] # Create encoding mapping unique_locations = list(set(d["location"] for d in data)) location_to_num = {loc: i for i, loc in enumerate(unique_locations)} print("Categorical Encoding:") print(" Mapping:", location_to_num) # Encode locations for record in data: record["location_encoded"] = location_to_num[record["location"]] print("\nEncoded Data:") for i, record in enumerate(data, 1): print(f" {i}. {record["location"]} β†’ {record["location_encoded"]}")

Step 3: Normalize Features

Scale features to similar ranges for better model performance:

step3_normalization.py
# Step 3: Normalize Features # Features with different scales sizes = [1200, 1500, 2000, 1575, 1600] # Large values ages = [5, 10, 2, 15, 8] # Small values print("Original features (different scales):") print(" Sizes:", sizes) print(" Ages:", ages) # Min-Max Normalization def normalize(values): min_val = min(values) max_val = max(values) return [(v - min_val) / (max_val - min_val) for v in values] sizes_normalized = normalize(sizes) ages_normalized = normalize(ages) print("\nNormalized features (0-1 scale):") print(" Sizes normalized:", [round(x, 3) for x in sizes_normalized]) print(" Ages normalized:", [round(x, 3) for x in ages_normalized]) print("\n Now both features are on the same scale!")

Step 4: Complete Preprocessing Pipeline

Putting it all together - a complete preprocessing workflow:

complete_pipeline.py
# Complete Data Preprocessing Pipeline # 1. Load raw data raw_data = [ {"size": 1200, "bedrooms": 2, "location": "urban"}, {"size": 1500, "bedrooms": None, "location": "suburban"}, {"size": 2000, "bedrooms": 4, "location": "urban"}, ] print("Step 1: Handle missing values") # (Fill missing bedrooms with mean) mean_bedrooms = 3.0 for record in raw_data: if record["bedrooms"] is None: record["bedrooms"] = mean_bedrooms print("Step 2: Encode categorical data") location_map = {"urban": 0, "suburban": 1, "rural": 2} for record in raw_data: record["location_encoded"] = location_map[record["location"]] print("Step 3: Normalize numerical features") sizes = [r["size"] for r in raw_data] min_size, max_size = min(sizes), max(sizes) for i, record in enumerate(raw_data): record["size_normalized"] = (record["size"] - min_size) / (max_size - min_size) print("\nPreprocessed Data (ready for ML):") for i, record in enumerate(raw_data, 1): print(f" {i}. Size: {record["size_normalized"]:.3f}, Bedrooms: {record["bedrooms"]}, Location: {record["location_encoded"]}")

Project Checklist

Complete these steps for your project:

  • βœ“ Handle Missing Values: Identify and fill or remove missing data
  • βœ“ Remove Duplicates: Eliminate duplicate records
  • βœ“ Encode Categorical: Convert text categories to numbers
  • βœ“ Normalize Features: Scale features to similar ranges
  • βœ“ Handle Outliers: Detect and treat extreme values
  • βœ“ Validate Data: Check data quality and consistency

Once preprocessing is complete, your data is ready for machine learning models!

Exercise: Complete Preprocessing Pipeline

Complete the exercise on the right side to build a full preprocessing pipeline:

  • Step 1: Fill missing values in 'bedrooms' with the mean
  • Step 2: Encode the 'location' categorical feature
  • Step 3: Normalize the 'size' feature using min-max normalization
  • Step 4: Print the final preprocessed dataset

Write your code to complete all preprocessing steps!

πŸ’‘ Project Tips

Break the project into smaller tasks. Complete and test each part before moving to the next. Don't try to do everything at onceβ€”iterative development leads to better results!

πŸŽ‰

Lesson Complete!

Great work! Continue to the next lesson.

main.py
πŸ“€ Output
Click "Run" to execute...