🎯 Project: Data Manipulation with NumPy and Pandas
This project will help you apply everything you've learned about NumPy and Pandas. You'll work with real data, perform manipulations, and prepare it for machine learning.
Data manipulation is a crucial skill in ML. You'll use NumPy for numerical operations and Pandas for structured data handling.
Working with NumPy Arrays
NumPy arrays are the foundation for numerical computations. Let's see how to manipulate them:
import numpy as np
sales = np.array([100, 150, 200, 180, 220, 250, 300])
months = np.array(["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul"])
print("Sales Data:", sales)
print("Mean sales:", np.mean(sales))
print("Max sales:", np.max(sales))
print("Min sales:", np.min(sales))
high_sales = sales[sales > 200]
print("\nHigh sales months (>200):", high_sales)
pct_change = np.diff(sales) / sales[:-1] * 100
print("\nMonth-over-month change (%):", pct_change)
Pandas DataFrame Operations
Pandas makes it easy to work with structured data. Here are common operations:
import pandas as pd
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'salary': [50000, 75000, 90000, 65000],
'department': ['Sales', 'IT', 'IT', 'Sales']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
it_employees = df[df['department'] == 'IT']
print("\nIT Department:")
print(it_employees)
print("\nSalary Statistics:")
print(" Mean:", df['salary'].mean())
print(" Median:", df['salary'].median())
print(" Max:", df['salary'].max())
dept_stats = df.groupby('department')['salary'].mean()
print("\nAverage salary by department:")
print(dept_stats)
Data Cleaning and Transformation
Real data often needs cleaning. Here's how to handle common issues:
import pandas as pd
import numpy as np
data = {
'product': ['A', 'B', 'C', 'D', 'E'],
'price': [10.5, 20.0, None, 30.5, 15.0],
'quantity': [100, 50, 75, None, 200]
}
df = pd.DataFrame(data)
print("Data with missing values:")
print(df)
df['price'] = df['price'].fillna(df['price'].mean())
df['quantity'] = df['quantity'].fillna(df['quantity'].median())
print("\nAfter filling missing values:")
print(df)
df['revenue'] = df['price'] * df['quantity']
print("\nWith revenue column:")
print(df)
Exercise: Complete Data Manipulation Project
Complete the exercise on the right side:
- Task 1: Create a NumPy array with sales data and calculate statistics
- Task 2: Create a Pandas DataFrame with employee data
- Task 3: Filter data based on conditions (salary > 60000)
- Task 4: Calculate average salary by department
- Task 5: Create a new column (bonus = salary * 0.1)
Write your code to complete all data manipulation tasks!