Introduction to Pandas
Pandas is the most popular library for data manipulation and analysis in Python. It provides powerful data structures (DataFrames and Series) that make working with structured data intuitive and efficient.
While NumPy works with arrays, pandas works with labeled data—think of it as Excel spreadsheets in Python. This makes it perfect for cleaning, transforming, and analyzing real-world datasets before feeding them into machine learning models.
Why Pandas for Machine Learning?
Pandas is essential for ML because:
- Data Cleaning: Handle missing values, remove duplicates, fix inconsistencies
- Data Exploration: Quickly understand your data with summary statistics
- Data Transformation: Reshape, filter, and transform data easily
- Integration: Works seamlessly with NumPy and scikit-learn
Creating DataFrames
A DataFrame is a 2D labeled data structure with columns of potentially different types. You can create one from dictionaries, lists, or CSV files:
import pandas as pd
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NYC', 'LA', 'Chicago', 'NYC'],
'salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print("\nShape:", df.shape)
print("Columns:", df.columns.tolist())
print("Data types:")
print(df.dtypes)
Basic DataFrame Operations
Pandas provides many convenient methods for exploring and manipulating data:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000, 60000, 70000]
})
print("First 2 rows:")
print(df.head(2))
print("\nSummary statistics:")
print(df.describe())
print("\nAge column:")
print(df['age'])
print("\nFirst row:")
print(df.iloc[0])
Filtering and Selecting Data
Pandas makes it easy to filter and select specific rows and columns:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'salary': [50000, 60000, 70000, 55000]
})
older = df[df['age'] > 28]
print("People older than 28:")
print(older)
high_salary = df[(df['age'] > 28) & (df['salary'] > 55000)]
print("\nAge > 28 AND salary > 55000:")
print(high_salary)
print("\nName and salary columns:")
print(df[['name', 'salary']])
Handling Missing Data
Real-world data often has missing values. Pandas provides tools to handle them:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, np.nan, 35],
'salary': [50000, 60000, np.nan]
})
print("DataFrame with missing values:")
print(df)
print("\nMissing values:")
print(df.isnull())
df_filled = df.fillna(0)
print("\nAfter filling with 0:")
print(df_filled)
df_dropped = df.dropna()
print("\nAfter dropping rows with missing values:")
print(df_dropped)
Grouping and Aggregating
Pandas makes it easy to group data and calculate statistics:
import pandas as pd
df = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC', 'LA', 'NYC'],
'sales': [100, 150, 120, 180, 90]
})
print("Original data:")
print(df)
grouped = df.groupby('city')['sales'].mean()
print("\nAverage sales by city:")
print(grouped)
agg_stats = df.groupby('city')['sales'].agg(['mean', 'sum', 'count'])
print("\nMultiple statistics:")
print(agg_stats)
💡 Key Takeaway
Pandas is your go-to tool for data preparation before machine learning. Most of your time in ML projects will be spent cleaning and exploring data with pandas, so mastering it is essential!
🎉
Lesson Complete!
Great work! Continue to the next lesson.