Chapter 2: Data Fundamentals / Lesson 6

Understanding Data

What is Data in Machine Learning?

Data is the foundation of machine learning. Without data, there's nothing for algorithms to learn from. In ML, data comes in many forms: numbers, text, images, audio, and more. Understanding your data is the first and most crucial step in any ML project.

Data can be structured (like spreadsheets with rows and columns) or unstructured (like images, text documents, or videos). The type of data you have determines which ML techniques you can use.

Types of Data

Let's explore the different types of data you'll encounter in ML:

data_types.py
# Understanding different data types in ML # 1. Numerical Data (Continuous) # Examples: height, weight, temperature, price numerical_data = [175.5, 68.2, 22.3, 99.99] print("Numerical Data:", numerical_data) print("Type: Continuous - can take any value in a range") # 2. Categorical Data (Discrete) # Examples: color, city, product category categorical_data = ["red", "blue", "green", "red"] print("\nCategorical Data:", categorical_data) print("Type: Discrete - limited set of values") # 3. Ordinal Data (Ordered categories) # Examples: rating (1-5), size (S, M, L, XL) ordinal_data = ["small", "medium", "large", "small"] print("\nOrdinal Data:", ordinal_data) print("Type: Ordered categories with meaningful order") # 4. Binary Data (Yes/No, True/False) binary_data = [True, False, True, True] print("\nBinary Data:", binary_data) print("Type: Two possible values") # 5. Text Data text_data = ["Hello world", "Machine learning", "Data science"] print("\nText Data:", text_data) print("Type: Unstructured text that needs special processing")

Understanding Data Structure

Data in ML is typically organized in a structured format. Let's see how data is represented:

data_structure.py
# Understanding data structure in ML # Example: House price prediction dataset # Each row is an example (instance), each column is a feature # Features (X) - Input variables features = { "size_sqft": [1200, 1500, 2000, 1800, 1600], "bedrooms": [2, 3, 4, 3, 3], "age_years": [5, 10, 2, 15, 8], "location": ["urban", "suburban", "urban", "rural", "suburban"] } # Target (y) - What we want to predict target = [250000, 320000, 450000, 280000, 350000] print("Features (Input):") for feature, values in features.items(): print(f" {feature}: {values}") print("\nTarget (Output to predict):") print(" Price: ", target) print("\nData Structure:") print(" - 5 examples (rows)") print(" - 4 features (columns)") print(" - 1 target variable")

Data Quality Characteristics

Good data for ML should have certain characteristics. Let's examine them:

data_quality.py
# Understanding data quality characteristics # 1. Completeness - No missing values complete_data = [1, 2, 3, 4, 5] incomplete_data = [1, None, 3, 4, None] print("Complete data:", complete_data) print("Incomplete data:", incomplete_data) print("Missing values need to be handled!") # 2. Consistency - Same format and units consistent_temperatures = [20, 25, 30, 22, 28] # All in Celsius inconsistent_temperatures = [20, "77F", 30, 22, "82F"] # Mixed units print("\nConsistent:", consistent_temperatures) print("Inconsistent:", inconsistent_temperatures) # 3. Accuracy - Correct and reliable accurate_ages = [25, 30, 35, 28, 32] # Real ages inaccurate_ages = [25, 300, 35, -5, 32] # Impossible values print("\nAccurate ages:", accurate_ages) print("Inaccurate ages:", inaccurate_ages) print("Outliers and errors need detection!") # 4. Relevance - Data relates to the problem print("\nRelevance:") print(" For house price prediction:") print(" ✓ Relevant: size, location, age") print(" ✗ Not relevant: owner's favorite color")

Data Size and Scale

Understanding the scale of your data is important for choosing the right ML approach:

data_scale.py
# Understanding data size and scale # Small dataset small_dataset = [1, 2, 3, 4, 5] print("Small dataset:", len(small_dataset), "examples") print(" Use case: Simple problems, quick prototyping") # Medium dataset (simulated) medium_dataset_size = 1000 print(f"\nMedium dataset: {medium_dataset_size} examples") print(" Use case: Most ML projects, good for learning") # Large dataset (simulated) large_dataset_size = 100000 print(f"\nLarge dataset: {large_dataset_size} examples") print(" Use case: Production systems, deep learning") print("\nData Scale Considerations:") print(" - More data usually = better model") print(" - But requires more processing time") print(" - Need more computational resources") # Feature count print("\nFeature Count:") print(" Low (1-10): Simple models work well") print(" Medium (10-100): Standard ML algorithms") print(" High (100+): Feature selection may be needed")

Why Understanding Data Matters

Before building any ML model, you must understand your data. This helps you:

  • Choose the right algorithm: Different data types require different approaches
  • Identify problems early: Missing values, outliers, and inconsistencies
  • Engineer better features: Understanding data helps create meaningful features
  • Set realistic expectations: Data quality determines model performance limits

Remember: Garbage in, garbage out. No ML algorithm can perform well with poor quality data.

Exercise: Analyze Your Data

Complete the exercise on the right side:

  • Task 1: Identify the data types in the sample dataset (numerical, categorical, etc.)
  • Task 2: Count how many examples and features are in the dataset
  • Task 3: Check for missing values and print how many are missing
  • Task 4: Calculate basic statistics (mean, min, max) for numerical features

Write your code to complete all tasks and run it to see the results!

💡 Learning Tip

Practice is essential. Try modifying the code examples, experiment with different parameters, and see how changes affect the results. Hands-on experience is the best teacher!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...