1. Definition
Classification is a supervised learning task where the goal is to assign input data into predefined categories (classes/labels).
- Example: Spam detection → Email classified as Spam or Not Spam.
👉 It answers the question: “Which class does this data point belong to?”
2. Key Characteristics
- Input: Features (X)
- Output: Discrete labels (Y)
- Training: Uses labeled dataset (input + correct output).
- Goal: Learn a decision boundary that separates different classes.
3. Types of Classification
- Binary Classification → Two classes
- Example: Yes/No, Spam/Not Spam, Fraud/Not Fraud.
- Multiclass Classification → More than two classes
- Example: Handwriting digit recognition (0–9).
- Multilabel Classification → Each input can belong to multiple classes simultaneously
- Example: A movie tagged as Action + Adventure + Sci-Fi.
4. Common Algorithms
- Logistic Regression → Simple but effective for binary classification.
- Decision Trees & Random Forests → Tree-based methods, handle complex patterns.
- Naive Bayes → Probabilistic model, works well with text data.
- k-Nearest Neighbors (kNN) → Classifies based on closest data points.
- Support Vector Machines (SVM) → Finds optimal separating boundary.
- Neural Networks / Deep Learning → Handles complex, high-dimensional data.
5. Performance Metrics
- Accuracy → % of correct predictions.
- Precision → Correct positive predictions / Total predicted positives.
- Recall (Sensitivity) → Correct positive predictions / Actual positives.
- F1-Score → Balance of precision and recall.
- Confusion Matrix → Visual breakdown of true vs. predicted classes.
- ROC-AUC Curve → Measures classifier performance at different thresholds.
6. Applications
- 📧 Spam Detection (Email classification).
- 👩⚕️ Medical Diagnosis (Healthy vs. Diseased).
- 📱 Image Recognition (Cat vs. Dog).
- 🛒 Customer Segmentation (High-value vs. Low-value customers).
- 🎵 Music Genre Classification.
- 🔐 Fraud Detection (Fraudulent vs. Genuine transactions).
7. Challenges
⚠️ Imbalanced datasets (e.g., 95% normal, 5% fraud).
⚠️ Overfitting with complex models.
⚠️ Choosing the right features.
⚠️ High computational cost for large datasets.
✅ In short: Classification = Teaching a machine to assign labels to data based on patterns learned from past labeled examples.