Lesson 6: Understanding Data

What is Data in Machine Learning?

Data is the foundation of machine learning. Without data, there's nothing for algorithms to learn from. In ML, data comes in many forms: numbers, text, images, audio, and more. Understanding your data is the first and most crucial step in any ML project.

Data can be structured (like spreadsheets with rows and columns) or unstructured (like images, text documents, or videos). The type of data you have determines which ML techniques you can use.

Types of Data

Let's explore the different types of data you'll encounter in ML:

data_types.py
# Understanding different data types in ML

# 1. Numerical Data (Continuous)
# Examples: height, weight, temperature, price
numerical_data = [175.5, 68.2, 22.3, 99.99]
print("Numerical Data:", numerical_data)
print("Type: Continuous - can take any value in a range")

# 2. Categorical Data (Discrete)
# Examples: color, city, product category
categorical_data = ["red", "blue", "green", "red"]
print("\nCategorical Data:", categorical_data)
print("Type: Discrete - limited set of values")

# 3. Ordinal Data (Ordered categories)
# Examples: rating (1-5), size (S, M, L, XL)
ordinal_data = ["small", "medium", "large", "small"]
print("\nOrdinal Data:", ordinal_data)
print("Type: Ordered categories with meaningful order")

# 4. Binary Data (Yes/No, True/False)
binary_data = [True, False, True, True]
print("\nBinary Data:", binary_data)
print("Type: Two possible values")

# 5. Text Data
text_data = ["Hello world", "Machine learning", "Data science"]
print("\nText Data:", text_data)
print("Type: Unstructured text that needs special processing")

Understanding Data Structure

Data in ML is typically organized in a structured format. Let's see how data is represented:

data_structure.py
# Understanding data structure in ML

# Example: House price prediction dataset
# Each row is an example (instance), each column is a feature

# Features (X) - Input variables
features = {
    "size_sqft": [1200, 1500, 2000, 1800, 1600],
    "bedrooms": [2, 3, 4, 3, 3],
    "age_years": [5, 10, 2, 15, 8],
    "location": ["urban", "suburban", "urban", "rural", "suburban"]
}

# Target (y) - What we want to predict
target = [250000, 320000, 450000, 280000, 350000]

print("Features (Input):")
for feature, values in features.items():
    print(f"  {feature}: {values}")

print("\nTarget (Output to predict):")
print("  Price: ", target)

print("\nData Structure:")
print("  - 5 examples (rows)")
print("  - 4 features (columns)")
print("  - 1 target variable")

Data Quality Characteristics

Good data for ML should have certain characteristics. Let's examine them:

data_quality.py
# Understanding data quality characteristics

# 1. Completeness - No missing values
complete_data = [1, 2, 3, 4, 5]
incomplete_data = [1, None, 3, 4, None]
print("Complete data:", complete_data)
print("Incomplete data:", incomplete_data)
print("Missing values need to be handled!")

# 2. Consistency - Same format and units
consistent_temperatures = [20, 25, 30, 22, 28]  # All in Celsius
inconsistent_temperatures = [20, "77F", 30, 22, "82F"]  # Mixed units
print("\nConsistent:", consistent_temperatures)
print("Inconsistent:", inconsistent_temperatures)

# 3. Accuracy - Correct and reliable
accurate_ages = [25, 30, 35, 28, 32]  # Real ages
inaccurate_ages = [25, 300, 35, -5, 32]  # Impossible values
print("\nAccurate ages:", accurate_ages)
print("Inaccurate ages:", inaccurate_ages)
print("Outliers and errors need detection!")

# 4. Relevance - Data relates to the problem
print("\nRelevance:")
print("  For house price prediction:")
print("  ✓ Relevant: size, location, age")
print("  ✗ Not relevant: owner's favorite color")

Data Size and Scale

Understanding the scale of your data is important for choosing the right ML approach:

data_scale.py
# Understanding data size and scale

# Small dataset
small_dataset = [1, 2, 3, 4, 5]
print("Small dataset:", len(small_dataset), "examples")
print("  Use case: Simple problems, quick prototyping")

# Medium dataset (simulated)
medium_dataset_size = 1000
print(f"\nMedium dataset: {medium_dataset_size} examples")
print("  Use case: Most ML projects, good for learning")

# Large dataset (simulated)
large_dataset_size = 100000
print(f"\nLarge dataset: {large_dataset_size} examples")
print("  Use case: Production systems, deep learning")

print("\nData Scale Considerations:")
print("  - More data usually = better model")
print("  - But requires more processing time")
print("  - Need more computational resources")

# Feature count
print("\nFeature Count:")
print("  Low (1-10): Simple models work well")
print("  Medium (10-100): Standard ML algorithms")
print("  High (100+): Feature selection may be needed")

Why Understanding Data Matters

Before building any ML model, you must understand your data. This helps you:

Choose the right algorithm: Different data types require different approaches
Identify problems early: Missing values, outliers, and inconsistencies
Engineer better features: Understanding data helps create meaningful features
Set realistic expectations: Data quality determines model performance limits

Remember: Garbage in, garbage out. No ML algorithm can perform well with poor quality data.

Exercise: Analyze Your Data

Complete the exercise on the right side:

Task 1: Identify the data types in the sample dataset (numerical, categorical, etc.)
Task 2: Count how many examples and features are in the dataset
Task 3: Check for missing values and print how many are missing
Task 4: Calculate basic statistics (mean, min, max) for numerical features

Write your code to complete all tasks and run it to see the results!

Understanding Data