Chapter 3: NumPy & Pandas / Lesson 15

Data Manipulation Project

🎯 Project: Data Manipulation with NumPy and Pandas

This project will help you apply everything you've learned about NumPy and Pandas. You'll work with real data, perform manipulations, and prepare it for machine learning.

Data manipulation is a crucial skill in ML. You'll use NumPy for numerical operations and Pandas for structured data handling.

Working with NumPy Arrays

NumPy arrays are the foundation for numerical computations. Let's see how to manipulate them:

numpy_operations.py
# NumPy Array Manipulation import numpy as np # Create sample data sales = np.array([100, 150, 200, 180, 220, 250, 300]) months = np.array(["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul"]) print("Sales Data:", sales) print("Mean sales:", np.mean(sales)) print("Max sales:", np.max(sales)) print("Min sales:", np.min(sales)) # Filter data (sales > 200) high_sales = sales[sales > 200] print("\nHigh sales months (>200):", high_sales) # Calculate percentage change pct_change = np.diff(sales) / sales[:-1] * 100 print("\nMonth-over-month change (%):", pct_change)

Pandas DataFrame Operations

Pandas makes it easy to work with structured data. Here are common operations:

pandas_operations.py
# Pandas DataFrame Manipulation import pandas as pd # Create DataFrame data = { 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'age': [25, 30, 35, 28], 'salary': [50000, 75000, 90000, 65000], 'department': ['Sales', 'IT', 'IT', 'Sales'] } df = pd.DataFrame(data) print("Original DataFrame:") print(df) # Filter rows it_employees = df[df['department'] == 'IT'] print("\nIT Department:") print(it_employees) # Calculate statistics print("\nSalary Statistics:") print(" Mean:", df['salary'].mean()) print(" Median:", df['salary'].median()) print(" Max:", df['salary'].max()) # Group by department dept_stats = df.groupby('department')['salary'].mean() print("\nAverage salary by department:") print(dept_stats)

Data Cleaning and Transformation

Real data often needs cleaning. Here's how to handle common issues:

data_cleaning.py
# Data Cleaning with Pandas import pandas as pd import numpy as np # Data with issues data = { 'product': ['A', 'B', 'C', 'D', 'E'], 'price': [10.5, 20.0, None, 30.5, 15.0], 'quantity': [100, 50, 75, None, 200] } df = pd.DataFrame(data) print("Data with missing values:") print(df) # Fill missing values df['price'] = df['price'].fillna(df['price'].mean()) df['quantity'] = df['quantity'].fillna(df['quantity'].median()) print("\nAfter filling missing values:") print(df) # Create new column (total revenue) df['revenue'] = df['price'] * df['quantity'] print("\nWith revenue column:") print(df)

Exercise: Complete Data Manipulation Project

Complete the exercise on the right side:

  • Task 1: Create a NumPy array with sales data and calculate statistics
  • Task 2: Create a Pandas DataFrame with employee data
  • Task 3: Filter data based on conditions (salary > 60000)
  • Task 4: Calculate average salary by department
  • Task 5: Create a new column (bonus = salary * 0.1)

Write your code to complete all data manipulation tasks!

💡 Project Tips

Break the project into smaller tasks. Complete and test each part before moving to the next. Don't try to do everything at once—iterative development leads to better results!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...