What is Data in Machine Learning?
Data is the foundation of machine learning. Without data, there's nothing for algorithms to learn from. In ML, data comes in many forms: numbers, text, images, audio, and more. Understanding your data is the first and most crucial step in any ML project.
Data can be structured (like spreadsheets with rows and columns) or unstructured (like images, text documents, or videos). The type of data you have determines which ML techniques you can use.
Types of Data
Let's explore the different types of data you'll encounter in ML:
numerical_data = [175.5, 68.2, 22.3, 99.99]
print("Numerical Data:", numerical_data)
print("Type: Continuous - can take any value in a range")
categorical_data = ["red", "blue", "green", "red"]
print("\nCategorical Data:", categorical_data)
print("Type: Discrete - limited set of values")
ordinal_data = ["small", "medium", "large", "small"]
print("\nOrdinal Data:", ordinal_data)
print("Type: Ordered categories with meaningful order")
binary_data = [True, False, True, True]
print("\nBinary Data:", binary_data)
print("Type: Two possible values")
text_data = ["Hello world", "Machine learning", "Data science"]
print("\nText Data:", text_data)
print("Type: Unstructured text that needs special processing")
Understanding Data Structure
Data in ML is typically organized in a structured format. Let's see how data is represented:
features = {
"size_sqft": [1200, 1500, 2000, 1800, 1600],
"bedrooms": [2, 3, 4, 3, 3],
"age_years": [5, 10, 2, 15, 8],
"location": ["urban", "suburban", "urban", "rural", "suburban"]
}
target = [250000, 320000, 450000, 280000, 350000]
print("Features (Input):")
for feature, values in features.items():
print(f" {feature}: {values}")
print("\nTarget (Output to predict):")
print(" Price: ", target)
print("\nData Structure:")
print(" - 5 examples (rows)")
print(" - 4 features (columns)")
print(" - 1 target variable")
Data Quality Characteristics
Good data for ML should have certain characteristics. Let's examine them:
complete_data = [1, 2, 3, 4, 5]
incomplete_data = [1, None, 3, 4, None]
print("Complete data:", complete_data)
print("Incomplete data:", incomplete_data)
print("Missing values need to be handled!")
consistent_temperatures = [20, 25, 30, 22, 28]
inconsistent_temperatures = [20, "77F", 30, 22, "82F"]
print("\nConsistent:", consistent_temperatures)
print("Inconsistent:", inconsistent_temperatures)
accurate_ages = [25, 30, 35, 28, 32]
inaccurate_ages = [25, 300, 35, -5, 32]
print("\nAccurate ages:", accurate_ages)
print("Inaccurate ages:", inaccurate_ages)
print("Outliers and errors need detection!")
print("\nRelevance:")
print(" For house price prediction:")
print(" ✓ Relevant: size, location, age")
print(" ✗ Not relevant: owner's favorite color")
Data Size and Scale
Understanding the scale of your data is important for choosing the right ML approach:
small_dataset = [1, 2, 3, 4, 5]
print("Small dataset:", len(small_dataset), "examples")
print(" Use case: Simple problems, quick prototyping")
medium_dataset_size = 1000
print(f"\nMedium dataset: {medium_dataset_size} examples")
print(" Use case: Most ML projects, good for learning")
large_dataset_size = 100000
print(f"\nLarge dataset: {large_dataset_size} examples")
print(" Use case: Production systems, deep learning")
print("\nData Scale Considerations:")
print(" - More data usually = better model")
print(" - But requires more processing time")
print(" - Need more computational resources")
print("\nFeature Count:")
print(" Low (1-10): Simple models work well")
print(" Medium (10-100): Standard ML algorithms")
print(" High (100+): Feature selection may be needed")
Why Understanding Data Matters
Before building any ML model, you must understand your data. This helps you:
- Choose the right algorithm: Different data types require different approaches
- Identify problems early: Missing values, outliers, and inconsistencies
- Engineer better features: Understanding data helps create meaningful features
- Set realistic expectations: Data quality determines model performance limits
Remember: Garbage in, garbage out. No ML algorithm can perform well with poor quality data.
Exercise: Analyze Your Data
Complete the exercise on the right side:
- Task 1: Identify the data types in the sample dataset (numerical, categorical, etc.)
- Task 2: Count how many examples and features are in the dataset
- Task 3: Check for missing values and print how many are missing
- Task 4: Calculate basic statistics (mean, min, max) for numerical features
Write your code to complete all tasks and run it to see the results!
💡 Learning Tip
Practice is essential. Try modifying the code examples, experiment with different parameters, and see how changes affect the results. Hands-on experience is the best teacher!
🎉
Lesson Complete!
Great work! Continue to the next lesson.