Data Manipulation Project

🎯 Project: Data Manipulation with NumPy and Pandas

This project will help you apply everything you've learned about NumPy and Pandas. You'll work with real data, perform manipulations, and prepare it for machine learning.

Data manipulation is a crucial skill in ML. You'll use NumPy for numerical operations and Pandas for structured data handling.

Working with NumPy Arrays

NumPy arrays are the foundation for numerical computations. Let's see how to manipulate them:

numpy_operations.py
# NumPy Array Manipulation
import numpy as np

# Create sample data
sales = np.array([100, 150, 200, 180, 220, 250, 300])
months = np.array(["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul"])

print("Sales Data:", sales)
print("Mean sales:", np.mean(sales))
print("Max sales:", np.max(sales))
print("Min sales:", np.min(sales))

# Filter data (sales > 200)
high_sales = sales[sales > 200]
print("\nHigh sales months (>200):", high_sales)

# Calculate percentage change
pct_change = np.diff(sales) / sales[:-1] * 100
print("\nMonth-over-month change (%):", pct_change)

Pandas DataFrame Operations

Pandas makes it easy to work with structured data. Here are common operations:

pandas_operations.py
# Pandas DataFrame Manipulation
import pandas as pd

# Create DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'salary': [50000, 75000, 90000, 65000],
    'department': ['Sales', 'IT', 'IT', 'Sales']
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Filter rows
it_employees = df[df['department'] == 'IT']
print("\nIT Department:")
print(it_employees)

# Calculate statistics
print("\nSalary Statistics:")
print("  Mean:", df['salary'].mean())
print("  Median:", df['salary'].median())
print("  Max:", df['salary'].max())

# Group by department
dept_stats = df.groupby('department')['salary'].mean()
print("\nAverage salary by department:")
print(dept_stats)

Data Cleaning and Transformation

Real data often needs cleaning. Here's how to handle common issues:

data_cleaning.py
# Data Cleaning with Pandas
import pandas as pd
import numpy as np

# Data with issues
data = {
    'product': ['A', 'B', 'C', 'D', 'E'],
    'price': [10.5, 20.0, None, 30.5, 15.0],
    'quantity': [100, 50, 75, None, 200]
}
df = pd.DataFrame(data)

print("Data with missing values:")
print(df)

# Fill missing values
df['price'] = df['price'].fillna(df['price'].mean())
df['quantity'] = df['quantity'].fillna(df['quantity'].median())

print("\nAfter filling missing values:")
print(df)

# Create new column (total revenue)
df['revenue'] = df['price'] * df['quantity']
print("\nWith revenue column:")
print(df)

Exercise: Complete Data Manipulation Project

Complete the exercise on the right side:

Task 1: Create a NumPy array with sales data and calculate statistics
Task 2: Create a Pandas DataFrame with employee data
Task 3: Filter data based on conditions (salary > 60000)
Task 4: Calculate average salary by department
Task 5: Create a new column (bonus = salary * 0.1)

Write your code to complete all data manipulation tasks!

💡 Project Tips

Break the project into smaller tasks. Complete and test each part before moving to the next. Don't try to do everything at once—iterative development leads to better results!

🎉

Lesson Complete!

Great work! Continue to the next lesson.