Chapter 3: NumPy & Pandas / Lesson 13

Introduction to Pandas

Introduction to Pandas

Pandas is the most popular library for data manipulation and analysis in Python. It provides powerful data structures (DataFrames and Series) that make working with structured data intuitive and efficient.

While NumPy works with arrays, pandas works with labeled data—think of it as Excel spreadsheets in Python. This makes it perfect for cleaning, transforming, and analyzing real-world datasets before feeding them into machine learning models.

Why Pandas for Machine Learning?

Pandas is essential for ML because:

  • Data Cleaning: Handle missing values, remove duplicates, fix inconsistencies
  • Data Exploration: Quickly understand your data with summary statistics
  • Data Transformation: Reshape, filter, and transform data easily
  • Integration: Works seamlessly with NumPy and scikit-learn

Creating DataFrames

A DataFrame is a 2D labeled data structure with columns of potentially different types. You can create one from dictionaries, lists, or CSV files:

creating_dataframes.py
import pandas as pd # Create DataFrame from dictionary data = { 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'age': [25, 30, 35, 28], 'city': ['NYC', 'LA', 'Chicago', 'NYC'], 'salary': [50000, 60000, 70000, 55000] } df = pd.DataFrame(data) print("DataFrame:") print(df) print("\nShape:", df.shape) # (rows, columns) print("Columns:", df.columns.tolist()) print("Data types:") print(df.dtypes)

Basic DataFrame Operations

Pandas provides many convenient methods for exploring and manipulating data:

dataframe_operations.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'salary': [50000, 60000, 70000] }) # View first few rows print("First 2 rows:") print(df.head(2)) # Get basic statistics print("\nSummary statistics:") print(df.describe()) # Access columns print("\nAge column:") print(df['age']) # Access rows by index print("\nFirst row:") print(df.iloc[0])

Filtering and Selecting Data

Pandas makes it easy to filter and select specific rows and columns:

filtering.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'age': [25, 30, 35, 28], 'salary': [50000, 60000, 70000, 55000] }) # Filter rows where age > 28 older = df[df['age'] > 28] print("People older than 28:") print(older) # Filter with multiple conditions high_salary = df[(df['age'] > 28) & (df['salary'] > 55000)] print("\nAge > 28 AND salary > 55000:") print(high_salary) # Select specific columns print("\nName and salary columns:") print(df[['name', 'salary']])

Handling Missing Data

Real-world data often has missing values. Pandas provides tools to handle them:

missing_data.py
import pandas as pd import numpy as np # DataFrame with missing values df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, np.nan, 35], 'salary': [50000, 60000, np.nan] }) print("DataFrame with missing values:") print(df) # Check for missing values print("\nMissing values:") print(df.isnull()) # Fill missing values df_filled = df.fillna(0) # Fill with 0 print("\nAfter filling with 0:") print(df_filled) # Drop rows with missing values df_dropped = df.dropna() print("\nAfter dropping rows with missing values:") print(df_dropped)

Grouping and Aggregating

Pandas makes it easy to group data and calculate statistics:

grouping.py
import pandas as pd df = pd.DataFrame({ 'city': ['NYC', 'LA', 'NYC', 'LA', 'NYC'], 'sales': [100, 150, 120, 180, 90] }) print("Original data:") print(df) # Group by city and calculate mean grouped = df.groupby('city')['sales'].mean() print("\nAverage sales by city:") print(grouped) # Multiple aggregations agg_stats = df.groupby('city')['sales'].agg(['mean', 'sum', 'count']) print("\nMultiple statistics:") print(agg_stats)

💡 Key Takeaway

Pandas is your go-to tool for data preparation before machine learning. Most of your time in ML projects will be spent cleaning and exploring data with pandas, so mastering it is essential!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...