π Tutorial: Introduction to Machine Learning β Learning Paradigms β PAC Learning
π§ 1. What is Machine Learning?
Machine Learning (ML) is a subfield of artificial intelligence (AI) that focuses on building systems that learn from data to make decisions or predictions without being explicitly programmed.
Arthur Samuel’s definition (1959):
βMachine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.β
π― 2. Goals of Machine Learning
- Discover patterns in data
- Make predictions on unseen data
- Improve performance with experience
- Generalize well to new situations
π 3. Learning Paradigms in ML
Machine learning problems can be broadly classified into three learning paradigms based on the supervision in training data:
β 3.1 Supervised Learning
- Input: Labeled dataset (
X
,Y
) - Goal: Learn a function
f : X β Y
to map input to output - Examples:
- Classification (e.g., spam detection)
- Regression (e.g., house price prediction)
β 3.2 Unsupervised Learning
- Input: Unlabeled dataset (
X
) - Goal: Discover structure or patterns in data
- Examples:
- Clustering (e.g., customer segmentation)
- Dimensionality reduction (e.g., PCA)
βοΈ 3.3 Reinforcement Learning
- Input: Environment and rewards
- Goal: Learn to make sequences of decisions to maximize reward
- Examples:
- Game playing (e.g., AlphaGo)
- Robotics (e.g., walking, grasping)
π§ͺ 3.4 Semi-Supervised Learning
- Mix of labeled and unlabeled data
- Useful when labeling is expensive
π§βπ€βπ§ 3.5 Self-Supervised Learning
- Learns supervision signal from the data itself (e.g., predicting missing words in a sentence)
4. PAC LEARNING
Concept / Term | Definition / Explanation |
---|---|
Instance Space (πΏ) | The set of all possible input examples. In PAC learning, typically: π = {0,1}βΏ β all binary vectors of length n. |
Concept / Target Function (π) | The unknown function we aim to learn. It maps instances to labels: π : X β {0,1}. |
Hypothesis Class (π―) | A set of possible functions (hypotheses) that the learning algorithm can choose from: H β 2Λ£, meaning each hypothesis h β H maps instances to {0,1}. |
Distribution (π«) | An unknown probability distribution over the instance space π. The training data is drawn i.i.d. from this distribution. |
Training Examples | A sequence of labeled examples drawn i.i.d. from π«: (xβ, c(xβ)), …, (xβ, c(xβ)). |
Learning Algorithm (π¨) | The algorithm that receives training data and returns a hypothesis h β H that approximates the target function c. |
PAC Guarantee | The output hypothesis h satisfies: PrββΌπ«[h(x) β c(x)] β€ Ξ΅ with probability at least 1 β Ξ΄. |
Ξ΅ (epsilon) | The accuracy parameter β how close the hypothesis should be to the target function. Smaller Ξ΅ means higher accuracy. |
Ξ΄ (delta) | The confidence parameter β how confident we want to be that the learning algorithm will return a good hypothesis. |
PAC (Probably Approximately Correct) | “Probably” = with probability β₯ 1 β Ξ΄, “Approximately Correct” = error β€ Ξ΅. |
PAC Learnability | A concept class C is PAC-learnable if there exists a polynomial-time algorithm that for any Ξ΅, Ξ΄ β (0,1), can return a hypothesis with PAC guarantees. |
Sample Complexity (π) | The number of training examples needed to achieve PAC guarantees. Formula (with VC dimension): m = O(VCdim(H) β log(1/Ξ΅) + log(1/Ξ΄)) |
VC Dimension (VCdim(H)) | The Vapnik-Chervonenkis dimension β a measure of the capacity/complexity of the hypothesis class H. |
One-sided Error | Hypothesis never makes false positives (i.e., always predicts 0 for negative instances). |
Two-sided Error | Errors are allowed on both positive and negative examples. |
Probably Approximately Correct (PAC) learning is a foundational framework in computational learning theory, introduced by Leslie Valiant, that formalizes the concept of learnability from a theoretical, algorithmic perspective. It defines learning as the process of inferring a hypothesis that closely approximates an unknown target concept based on randomly drawn labeled examples, without relying on explicit programming. A concept class is considered PAC-learnable if a learning algorithm can, with high probability, produce a hypothesis that performs well on unseen data, using only a feasible (polynomial) number of training examples and computational steps. The framework assumes the presence of an unknown data distribution and allows the learner to receive examples either passively (as positive instances) or actively through queries to an oracle. Crucially, PAC learning emphasizes generalization: the learned hypothesis must not just fit the training data but should also perform well on new instances from the same distribution. The theory identifies specific classes of Boolean functionsβsuch as bounded CNF and monotone DNF expressionsβthat are efficiently learnable, while also highlighting inherent limitations due to computational intractability and cryptographic hardness in learning more complex or unrestricted functions. Extensions to the PAC model address practical concerns like noise in data, real-valued outputs, and domain-specific biases, while the notion of VC dimension helps quantify the capacity of a hypothesis space and determine the number of examples needed for learning. Overall, PAC learning offers a rigorous, probability-based approach to understanding what can be learned, how efficiently it can be learned, and under what conditions learning is possible.
VC Dimension
π Definition
The VapnikβChervonenkis (VC) Dimension is a fundamental measure of the capacity or expressiveness of a hypothesis class (denoted as π). It quantifies how well a model can fit various labelings of data.
π‘ Key Concepts
Concept | Description |
---|---|
Shattering | A set of points is shattered by hypothesis class π if π can realize all possible labelings (2βΏ combinations for n points). |
VC Dimension | The maximum number of points that can be shattered by π. Denoted as VC(π). |
Intuition | Measures the ability of π to fit any training data perfectly. Higher VC means higher complexity. |
Zero Error Bound | If a hypothesis from π achieves zero error on N examples, then N β€ VC(π). |
π Example Interpretation
- If VC(π) = 3, then:
- π can shatter any configuration of 3 points.
- It cannot shatter all configurations of 4 points.
π§ Why It Matters
- A core concept in statistical learning theory.
- Helps balance model complexity vs. overfitting.
- Used to understand generalization in machine learning.
π Historical Note
- Introduced by Vladimir Vapnik and Alexey Chervonenkis.
- Applicable to binary classifiers, geometric set families, and more.


Vapnik, V. N., & Chervonenkis, A. Y. (2015). On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity: festschrift for alexey chervonenkis (pp. 11-30). Cham: Springer International Publishing.



FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS
(Ref : Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.)

A Systematic Approach to Learning with the Candidate Elimination Algorithm

Revisit again : Candidate Elimination Algorithm β Simple Steps
Step 1 β Initialization
- S = most specific hypothesis (matches nothing yet)
- G = most general hypothesis (matches everything)
Step 2 β Process each training example:
When the example is Positive (Yes)
- Adjust S so it becomes just general enough to include this example.
- Remove from G any hypothesis that does not include this example.
When the example is Negative (No)
- Remove from S any hypothesis that still includes this example.
- Specialize each hypothesis in G that includes this example, just enough to exclude it, while still including all positive examples so far.
- Remove any hypotheses from G that are more specific than others or duplicate.
Step 3 β Completion
- After all examples are processed, the version space is the set of hypotheses between S and G.
- S represents the narrowest possible rule consistent with the data.
- G represents the broadest possible rule consistent with the data.



