DataFrames and Series
Pandas has two main data structures: Series (1D) and DataFrame (2D). A Series is like a single column, while a DataFrame is like a table with multiple columns. Understanding both is essential for data manipulation in ML.
Series are useful for single-variable operations, while DataFrames handle multi-dimensional data—which is what you'll work with in most ML projects.
Understanding Series
A Series is a one-dimensional labeled array. Think of it as a single column from a spreadsheet:
import pandas as pd
ages = pd.Series([25, 30, 35, 28, 32])
print("Series:")
print(ages)
print("\nData type:", ages.dtype)
print("Mean:", ages.mean())
print("Max:", ages.max())
names = pd.Series(['Alice', 'Bob', 'Charlie'], index=['a', 'b', 'c'])
print("\nSeries with custom index:")
print(names)
print("Access by index:", names['a'])
Working with DataFrames
DataFrames are 2D structures with rows and columns. They're the primary tool for working with structured data:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'salary': [50000, 60000, 70000, 55000]
})
print("DataFrame:")
print(df)
print("\nAge column (Series):")
print(df['age'])
print("Type:", type(df['age']))
print("\nFirst row:")
print(df.iloc[0])
print("\nName and salary:")
print(df[['name', 'salary']])
Series Operations
Series support many operations similar to NumPy arrays:
import pandas as pd
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([10, 20, 30, 40, 50])
print("Series 1:", s1)
print("Series 2:", s2)
print("\nOperations:")
print("Addition:", s1 + s2)
print("Multiplication:", s1 * 2)
print("Sum:", s1.sum())
print("Mean:", s1.mean())
print("\nValues > 3:", s1[s1 > 3])
DataFrame Methods
DataFrames have many useful methods for data exploration and manipulation:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000, 60000, 70000]
})
print("DataFrame info:")
print(df.info())
print("\nSummary statistics:")
print(df.describe())
print("\nFirst 2 rows:")
print(df.head(2))
print("\nShape (rows, columns):", df.shape)
print("Column names:", df.columns.tolist())
Converting Between Series and DataFrame
You can easily convert between Series and DataFrames:
import pandas as pd
s = pd.Series([1, 2, 3, 4], name='values')
df_from_series = s.to_frame()
print("Series converted to DataFrame:")
print(df_from_series)
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
series_from_df = df['col1']
print("\nDataFrame column as Series:")
print(series_from_df)
print("Type:", type(series_from_df))
💡 Key Insight
In pandas, a DataFrame is essentially a collection of Series (columns). Each column is a Series, and operations on columns work on Series. Understanding this relationship helps you work more effectively with pandas!
🎉
Lesson Complete!
Great work! Continue to the next lesson.