Introduction to Pandas

Pandas is the most popular library for data manipulation and analysis in Python. It provides powerful data structures (DataFrames and Series) that make working with structured data intuitive and efficient.

While NumPy works with arrays, pandas works with labeled data—think of it as Excel spreadsheets in Python. This makes it perfect for cleaning, transforming, and analyzing real-world datasets before feeding them into machine learning models.

Why Pandas for Machine Learning?

Pandas is essential for ML because:

Data Cleaning: Handle missing values, remove duplicates, fix inconsistencies
Data Exploration: Quickly understand your data with summary statistics
Data Transformation: Reshape, filter, and transform data easily
Integration: Works seamlessly with NumPy and scikit-learn

Creating DataFrames

A DataFrame is a 2D labeled data structure with columns of potentially different types. You can create one from dictionaries, lists, or CSV files:

creating_dataframes.py
import pandas as pd

# Create DataFrame from dictionary
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'Chicago', 'NYC'],
    'salary': [50000, 60000, 70000, 55000]
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print("\nShape:", df.shape)  # (rows, columns)
print("Columns:", df.columns.tolist())
print("Data types:")
print(df.dtypes)

Basic DataFrame Operations

Pandas provides many convenient methods for exploring and manipulating data:

dataframe_operations.py
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
})

# View first few rows
print("First 2 rows:")
print(df.head(2))

# Get basic statistics
print("\nSummary statistics:")
print(df.describe())

# Access columns
print("\nAge column:")
print(df['age'])

# Access rows by index
print("\nFirst row:")
print(df.iloc[0])

Filtering and Selecting Data

Pandas makes it easy to filter and select specific rows and columns:

filtering.py
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'salary': [50000, 60000, 70000, 55000]
})

# Filter rows where age > 28
older = df[df['age'] > 28]
print("People older than 28:")
print(older)

# Filter with multiple conditions
high_salary = df[(df['age'] > 28) & (df['salary'] > 55000)]
print("\nAge > 28 AND salary > 55000:")
print(high_salary)

# Select specific columns
print("\nName and salary columns:")
print(df[['name', 'salary']])

Handling Missing Data

Real-world data often has missing values. Pandas provides tools to handle them:

missing_data.py
import pandas as pd
import numpy as np

# DataFrame with missing values
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, np.nan, 35],
    'salary': [50000, 60000, np.nan]
})

print("DataFrame with missing values:")
print(df)

# Check for missing values
print("\nMissing values:")
print(df.isnull())

# Fill missing values
df_filled = df.fillna(0)  # Fill with 0
print("\nAfter filling with 0:")
print(df_filled)

# Drop rows with missing values
df_dropped = df.dropna()
print("\nAfter dropping rows with missing values:")
print(df_dropped)

Grouping and Aggregating

Pandas makes it easy to group data and calculate statistics:

grouping.py
import pandas as pd

df = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC', 'LA', 'NYC'],
    'sales': [100, 150, 120, 180, 90]
})

print("Original data:")
print(df)

# Group by city and calculate mean
grouped = df.groupby('city')['sales'].mean()
print("\nAverage sales by city:")
print(grouped)

# Multiple aggregations
agg_stats = df.groupby('city')['sales'].agg(['mean', 'sum', 'count'])
print("\nMultiple statistics:")
print(agg_stats)

💡 Key Takeaway

Pandas is your go-to tool for data preparation before machine learning. Most of your time in ML projects will be spent cleaning and exploring data with pandas, so mastering it is essential!

🎉

Lesson Complete!

Great work! Continue to the next lesson.