2025 Fall UVa CS Machine Learning Lectures Organized by Given Order

---- ----

Introduction

Lecture: S0-Intro
Version: current
Notes to Read: Numpy Tutorial
Video: M2 / M3

1Basic

Lecture 1 Study Guide

Course Logistics and Overview

Course Staff and Communication

Instructor: Prof. Yanjun Qi (Pronounced: /ch ee/)
Communication: Email and Slack (details on Course overview)
TA and Office Hours: Information available on the Course Website

Course Materials

Textbook: None required; multiple reference books shared via Course Website
Official Content: Anything not mentioned in Prof. Qi’s slides is not an official topic
Resources:
- All slides, videos, readings, and homework on Course Website
- Assignments/Project: UVA CourseSite
- Quizzes: Google Forms

Background Needed

Required:
- Calculus
- Basic linear algebra
- Basic probability
- Basic Algorithm
Recommended: Statistics
Programming: Python (required for all assignments)

Assessment Breakdown

Weekly Quizzes: 20% (top 10 scores count, 14 total, 10 minutes each, closed book via Google Form)
Homework Assignments: 65% (HW1-HW5, 10% each)
Final Project: 10% (apply ML to real-world problems)
Weekly Reading Question Submissions: 5%
Final Exam: 15% (December 9th)
See CourseWeb for detailed policy

Course Format (2025F Flipped Classroom - Starting Week 3)

In-Class Activities

Pre-class question submission
Quizzes
Quiz review
Assignment review
In-class solution examples
Short presentations/code demos

Emphasis

Active learning
Understanding core ML concepts
Applying models (Python, scikit-learn, Keras)
Analyzing performance and ethics
Becoming an ML tool builder

Key Objectives

Learn fundamental principles, algorithm design, methods, and applications
Build simple machine learning tools from scratch (not just tool users)
Understand complex machine learning methods at the source code level

Machine Learning History and Concepts

Artificial Intelligence (AI)

Definition: The study of computer systems that attempt to model and apply the intelligence of the human mind
Attributes of Intelligence (for machines):
- Perceive
- Understand
- Interact
- Think/Reason/Learn
- Imagine/Make analogy
Impact: Economic, cultural, social, health disruptions; potential for job automation; existential threat concerns

History of AI/ML (Timeline)

1950: Alan Turing’s “Computing Machinery and Intelligence” (“Can machines think?”)
1956: Dartmouth workshop, John McCarthy coins “artificial intelligence”
1950s-60s: Enthusiasm, early Neural Networks (NNs) popular
70s-80s: Rule-based/Knowledge-based systems, Expert systems
1959: Arthur Samuel popularizes “machine learning” (Samuel Checkers-playing Program)
90s: Classic ML (SVM, RF, BN), Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) invented
2000s: ImageNet, Deep Learning breakthroughs
2010s: Deep Learning breakthroughs continue (2015-2016 CNN/RNN invented, 2017 Deep Learning, Deep Reinforcement Learning, GANs)
2023-Now: ChatGPT / Generative AI / Agentic AI

Reasons for 2010s Deep Learning Breakthroughs

Plenty of good quality (labeled) data (text, visual, audio, user activity, knowledge graphs, genomics, medical imaging)
Advanced computer architecture that fits Deep Learning (e.g., GPUs for faster, more accurate results)
Powerful, well-engineered machine learning libraries (easy to learn, use, extend)

Recent Advances (2023-Now) on Generative AI

Definition: AI systems that create content (text, images, audio, video, code), powered by foundation models (transformers, diffusion, LLMs). Distinct from discriminative AI
Key Applications:
- LLMs (text/chat)
- Diffusion models (images)
- Gen-2/Sora (video)
- GitHub Copilot (code)
- Multi-modal models
Enabling Factors: Scaling law, data, advanced architecture, powerful libraries

Machine Learning Basics

Goal

Build computer systems that learn and adapt from “experience” (data examples)

Traditional Programming vs. Machine Learning

Traditional: Computer + Data + Program → Output
ML (Training Phase): Computer + Input Data (X) + Output (Y) → Program/Model f()

Common Notations

Inputs (X):
- Matrix (bold capital)
- p variables, n observations
- Xⱼ is jth element
- Xⱼ,ᵢ is ith observed value
- Vectors are column vectors
Outputs:
- Quantitative (Y)
- Qualitative/Categorical (C)
Observed Variables: Lowercase

Supervised Learning

Find a function f() to map input space X to output space Y such that the difference between true y and f(x) is small

Example: Linear Binary Classifier

f(x,w,b) = sign(wᵀx + b)

wᵀx + b > 0 (e.g., +1 point)
wᵀx + b < 0 (e.g., -1 point)

Training

Learning parameters w, b by minimizing a loss/cost function L() on the training set (available x and y pairs)

Testing

Evaluating performance on “future” (unseen) points by comparing true y to predicted f(x). Key: testing examples are not in the training set

Basic Concepts

Loss Function: Measures difference between y and f(x) (e.g., hinge loss for binary classification)
Regularization: Additional information added to the loss function to control f
Generalization: The ability to learn a function from past data to “explain,” “predict,” “model,” or “control” new, unseen data examples (inductive reasoning)

Two Modes of ML

Training: Input-output pairs (X, Y) are fed into a model to learn f()
Testing/Production: Unseen input X’ is fed into the learned model f() to produce f(X’) (predicted output)

Three Aimed Features of ML Models

Robustness
Computation
Prediction

General Lessons for Excellence

Good breadth in fundamentals is key
Strength in particular targeted topics helps stand out

Recommended Extra-Curriculum Books

Master Algorithm by Pedro Domingos (explores different “tribes” of ML: Symbolists, Connectionists, Evolutionists, Bayesians, Analogizers)
Homo Deus: A Brief History of Tomorrow by Yuval Noah Harari

Quiz: UVA CS 4774: Machine Learning - Introduction

Instructions: Answer each question in 2-3 sentences.

What is the main distinction between “Traditional Programming” and the “Machine Learning (training phase)” as illustrated in the lecture?
Name three specific reasons cited in the lecture for the significant breakthroughs in Deep Learning during the 2010s.
Define “Generative AI” and explain how it differs from “discriminative AI” as presented in the source material.
In the context of supervised learning, what is the purpose of the “training phase” for a linear binary classifier, and what is being minimized?
What is the concept of “generalization” in machine learning, and why is it crucial for model performance?

Quiz Answer Key

Machine Learning Goal
The primary goal of machine learning is to build computer systems that can learn and adapt from their “experience.” In this context, “experience” refers to the available data examples, also known as instances or samples, which are described with various properties.
Traditional vs ML Programming
Traditional Programming involves a computer executing a pre-defined program on data to produce an output. In contrast, the Machine Learning training phase takes input data (X) and corresponding outputs (Y) to learn or create the program/model f().
Deep Learning Breakthroughs
The significant breakthroughs in Deep Learning during the 2010s were primarily due to the availability of plenty of good quality labeled data, advanced computer architectures suitable for Deep Learning (like GPUs), and powerful, well-engineered machine learning libraries.
Generative AI
Generative AI refers to AI systems that are capable of creating new content, such as text, images, or code, powered by foundation models. This differs from discriminative AI, which primarily focuses on tasks like classification rather than content generation.
Training Phase Purpose
In supervised learning, the training phase for a linear binary classifier aims to learn the parameters w and b. This is achieved by minimizing a loss or cost function L(), which quantifies the difference between the true output y and the model’s prediction f(x) on the available training examples.
Generalization
Generalization in machine learning is the ability of a model to apply the knowledge learned from past, observed data to effectively “explain,” “predict,” or “model” new, unseen data examples. It is crucial because it indicates whether the model has truly learned the underlying patterns rather than just memorizing the training data.

Algebra Review

Lecture: S0-AlgbReview
Version: current
More advanced to Read: S0-linalg-extra.pdf
Notes to Read: Khanacademy Math + CMU study note
Video: Python Data Science Code tutorials

1Basic

Comprehensive Linear Algebra and Matrix Calculus Review

Study Guide

This study guide is designed to reinforce your understanding of fundamental concepts in linear algebra and matrix calculus, as presented in the source material. Mastering these topics is crucial for advanced studies in machine learning and related fields.

I. Fundamental Definitions

Scalar: Understand what a scalar is and how it’s denoted.
Vector: Define a vector, differentiate between row and column vectors, and understand its representation in real space (R^n).
Matrix: Define a matrix, understand its dimensions (m-by-n), how its entries are denoted, and the concept of a square matrix.

II. Special Matrices

Identity Matrix (I): Understand its structure and significance.
Diagonal Matrix: Recognize its form.
Upper/Lower Triangular Matrices: Identify these structures.
Symmetric Matrix: Understand the property that defines a symmetric matrix.
Orthogonal Matrix: Define an orthogonal matrix and its key property related to its inverse.

III. Matrix Operations

Transposition

Concept: “Flipping” rows and columns.
Notation: A^T.
Application: Understand how to transpose a given matrix.

Addition and Subtraction

Condition: Matrices must be of the same size.
Process: Entry-wise operation.
Examples: Be able to perform addition and subtraction.

Multiplication

Notation: AB (pre-multiplying B by A, post-multiplying A by B).
Conformability: Understand the condition for multiplication (number of columns in premultiplier must equal number of rows in postmultiplier).
Dimensions: Predict the dimensions of the resulting matrix (m × n * n × p = m × p).
Non-Commutativity: Recognize that AB ≠ BA, generally.
Associativity: Understand A(BC) = (AB)C.
Properties: (AB)^T = B^T A^T; multiplication with Identity Matrix.

Special Uses

Scalar & Matrix Product: Understand bA = Ab.
Dot (Inner) Product of Vectors: a^T b (result is a scalar).
Outer Product of Vectors: ab^T (result is a matrix).
Sum of Squared Elements of a Vector: a^T a (useful for L2 norm).

Norm (of Vector)

Concept: Measure of “length” of a vector.
Common Norms: L1, L2 (Euclidean), L∞.
L2 Norm (Euclidean): x = √(x^T x).
General Definition: Understand the properties a function must satisfy to be a norm.

Matrix Inversion

Notation: A^(-1) or inv A.
Definition: AA^(-1) = I = A^(-1)A.
Nonsingular vs. Singular: Differentiate between matrices that have an inverse and those that don’t.
Determinant Condition: An inverse exists if and only if A ≠ 0.
Properties: (AB)^(-1) = B^(-1)A^(-1); (A^T)^(-1) = (A^(-1))^T; (A^(-1))^(-1) = A.
Special Cases: Inverse of diagonal matrices, orthogonal matrices (A^(-1) = A^T).
Pseudo-inverse (Extra): Basic understanding of its existence for non-square matrices.

Matrix Rank

Linear Independence: Define linearly independent vectors.
Definition of Rank: Maximal number of linearly independent columns (or rows).
Properties: rank(A) ≤ min(m,n). Full row rank, full column rank.
Determinant Connection: Rank is the dimension of the largest square sub-matrix with a non-zero determinant.
Relationship to Singular Matrices: How rank relates to singularity.
Rank of Product: rank(AB) ≤ min(rank(A), rank(B)).

Matrix Calculus

Review of Derivatives: Understand the basic definition of a derivative for single-variable functions.
Multivariate Calculus Concepts:
- Partial Derivative: Extension of derivative to functions of multiple variables.
- Gradient: Vector of partial derivatives (first derivative to gradient).
- Hessian Matrix: Matrix of second partial derivatives (second derivative to Hessian).
- Denominator Layout: Understand that the gradient’s size is the same as the variable’s size.
- Examples: Be able to calculate simple gradients.

IV. Additional Concepts (Mentioned as Extra or for Future Reference)

Trace(): Sum of diagonal elements (not detailed in this source, but mentioned as “must know”).
Eigenvalue / Eigenvectors: Important for future topics.
Positive Definite Matrix, Gram Matrix, Quadratic Form, Projection: Future topics.
Khan Academy: A recommended resource for review.
Linear Transformation and Determinant: Connection to matrix determinant and Jacobian determinant.
Laplacian of a function: Trace of the Hessian.
Harmonic function: Function whose Laplacian is 0.

Quiz: Short-Answer Questions

What is the primary difference between a scalar and a vector in linear algebra notation?
Define a “column vector” and provide an example of a vector in R^4.
Explain the condition for two matrices to be “conformable” for multiplication. Why is this condition important?
Given matrices A (3×2) and B (2×4), what are the dimensions of the resulting matrix C = AB? What about C = BA?
State two key properties of matrix multiplication that distinguish it from scalar multiplication.
Describe the concept of a “norm” of a vector. What does the L2 norm represent?
What is the definition of the inverse of a matrix A (denoted A^(-1))? Under what crucial condition does a matrix inverse exist?
Differentiate between a “nonsingular” and a “singular” matrix.
Explain the concept of “linear independence” in the context of vectors. How does this relate to the rank of a matrix?
What is the gradient of a multivariate function? How does it relate to partial derivatives?

Quiz Answer Key

Scalar vs. Vector: A scalar is a single number (e.g., 1 or 22), denoted with regular type. A vector is a single row or column of numbers (e.g., [1 2 3]), denoted with bold small letters.
Column Vector and R^4 Example: A column vector is a vector arranged vertically. An example of a vector in R^4 (a column vector) is v = (1,6,3,4)^T.
Conformable Matrices: For two matrices A and B to be conformable for multiplication (AB), the number of columns in the premultiplier (A) must equal the number of rows in the postmultiplier (B). This ensures that the dot product for each entry in the resulting matrix can be calculated.
Matrix Multiplication Dimensions: If A is (3×2) and B is (2×4), then C = AB will have dimensions (3×4). C = BA cannot be performed because the number of columns in B (4) does not equal the number of rows in A (3).
Properties of Matrix Multiplication:
- Matrix multiplication is generally not commutative (AB ≠ BA).
- Matrices must be conformable for multiplication to occur.

Vector Norm: A norm of a vector

is a measure of its “length.” The L2 norm, also known as the Euclidean norm, represents the straight-line distance from the origin to the vector’s endpoint (calculated as the square root of the sum of the squared elements).

Matrix Inverse Definition and Condition: The inverse of an n × n matrix A is the matrix A^(-1) such that AA^(-1) = I = A^(-1)A, where I is the identity matrix. A crucial condition for an inverse to exist is that the determinant of A (

) must not be zero.

Nonsingular vs. Singular Matrix: A “nonsingular” matrix is an n × n matrix that has an inverse. A “singular” matrix is an n × n matrix that does not have an inverse.
Linear Independence and Rank: A set of vectors is linearly independent if none of them can be written as a linear combination of the others. The rank of a matrix is defined as the maximal number of linearly independent columns (or rows) in that matrix.
Gradient of a Multivariate Function: The gradient of a multivariate function is a vector containing its partial derivatives with respect to each variable. It extends the concept of a derivative to functions of multiple variables, indicating the direction of the steepest ascent of the function.

Section 1 - Basics Supervised & On Tabular Input Type

Scikit-learn

Lecture: S0-scikit-learn-self-study
Version: current
Notes to Read: basic tutorial + scikit-learn code examples
Video: How install

1Basic platform

Machine Learning in a Nutshell

Lecture: S1-nutshell
Version: current
Notes to Read: two modes running example
Video: M1 / M2 / M3

1Basic

Machine Learning in a Nutshell: Comprehensive Study Guide

I. Core Concepts of Machine Learning

A. Definition and Goal

Machine Learning (ML): A field of artificial intelligence where systems learn from data to optimize a performance criterion, aiming to generalize to unseen data.
Goal: To enable computers to learn from example data or past experience without being explicitly programmed, and then apply this learned knowledge to make predictions or decisions on new, unseen data.

B. The Machine Learning Pipeline (Nutshell Components)

Data: The raw material for learning, consisting of examples/instances/samples, each with features/attributes and potentially a target/label.
Task: The specific problem ML aims to solve (e.g., classification, regression, clustering).
Representation: The way data and the learning function are structured (e.g., linear function, neural network).
Score Function (Loss/Cost Function): A metric used to evaluate how well the model’s predictions align with the actual values. The goal is often to minimize this function.
Search/Optimization: The algorithms used to find the best model parameters by minimizing the score function (e.g., gradient descent).

Models, Parameters, Hyperparameters, Metrics

Models: The learned functions or structures.
Parameters: Internal variables of the model learned from data (e.g., w, b in a linear classifier).
Hyperparameters: Parameters whose values are set before the learning process begins (e.g., learning rate, regularization strength).
Metrics: Measures used to evaluate model performance (e.g., accuracy, precision, recall, MSE).

C. Two Phases of Machine Learning

Training (Learning Parameters)

Uses a training set of input-output pairs (X, Y).
The objective is to find model parameters (e.g., w, b) that minimize the loss/cost function, representing the difference between predicted f(x) and true y on the training examples.

Testing / Deployment / Inference (Evaluating Performance)

Uses a testing set of examples (x?) that were not included in the training set.
Evaluates the model’s performance by measuring the difference between true y? and predicted f(x?).
Crucial for assessing the model’s ability to generalize to new data.

D. When to Use Machine Learning

Extracting knowledge from data: Discovering hidden relationships and correlations in large datasets.
Learning tasks difficult to formalize: Problems hard to define explicitly with rules (e.g., face recognition).
Creating software that improves over time: Systems that can adapt to new knowledge and data without continuous manual redesign.

II. Data Types in Machine Learning

A. General Terminology

Data/points/instances/examples/samples/records: Rows in a dataset.
Features/attributes/dimensions/independent variables/covariates/predictors/regressors: Columns in a dataset (excluding the target).
Target/outcome/response/label/dependent variable: The special column to be predicted.

B. Main Types of Columns

Continuous: Real numbers (e.g., weight, temperature).
Discrete: Symbols or categories (e.g., “Good”, “Bad”, “Cat”, “Dog”).

C. Examples of Data Types

Tabular Data: Structured data in rows and columns (e.g., patient records with gene quantities).
1D Sequence Data: Data with a sequential order (e.g., language text, genome sequences, audio).
2D Grid Data: Data arranged in a grid (e.g., images).
Graph Data: Data representing relationships between entities (e.g., social networks).

III. Machine Learning Tasks

A. Supervised Learning

Definition: Learning from labeled input-output pairs (X, Y). The goal is to predict Y for unseen X.

Classification: Predicting a discrete target variable (Y)

Binary Classification: Two possible output classes (e.g., positive/negative review, disease present/absent).
Multi-class Classification: More than two possible output classes (e.g., categorizing text into Sports, Business, Education).
Multi-label Classification: Assigning a set of target labels to a single input (e.g., an image being tagged with “Castle” and “Mountains”).
Hierarchical Classification: Classification where classes are organized in a hierarchy.

Regression: Predicting a continuous target variable (Y)

Generating: Creating new data instances based on learned patterns (e.g., Text2Image, Edges2Image)

B. Unsupervised Learning

Definition: Learning from unlabeled data (no Y provided).
Goal: Finding patterns, structures, or relationships within the data.
Example: Clustering – grouping instances into “natural” clusters.

C. Structured Output Learning

Definition: Prediction tasks where output labels have complex structured correlations or constraints (e.g., spatial, temporal, relational dependencies).
Example: Language parsing (predicting a parse tree), sequence labeling (predicting a sequence of labels for a sequence of inputs).

D. Reinforcement Learning (RL)

Definition: An agent interacts with an environment and learns to maximize a scalar reward signal.
Key Characteristics: Not independent and identically distributed (IID) data, sequential decision-making.
Variations: Basic RL (no labels/supervision), imitation learning (supervised).

IV. Representation Types

A. Concept of Representation

How the input data (X) and the function (f) mapping X to Y are structured.

B. Examples of Representations

Linear functions: f(x,w,b) = sign(w^T x + b) for binary classification.
Nonlinear functions: Polynomial expansion, logistic function, trees, multi-layer networks (neural networks).
Prob-density family: Bernoulli, multinomial, Gaussian, mixture of Gaussians.
Vector Space Representation: Representing text as vectors (e.g., Bag of Words).
Representation Learning (Deep Learning/Feature Learning): Automatically learning useful features from raw data, often through multi-layer architectures. This replaces traditional, time-consuming “Feature Engineering.”

V. Loss/Cost Types

A. Purpose of Loss Functions

Quantify the difference between the model’s prediction and the true target value.
Minimizing the loss function guides the learning process.

B. Examples of Loss Functions

Mean Squared Error (MSE): Common for regression tasks.
Hinge Loss: Used in binary classification (e.g., Support Vector Machines).
Log-likelihood / Cross-entropy: Common for classification tasks, especially with probabilistic models.
0-1 Loss: Simple loss for classification, counts misclassifications.
Regularized loss functions (L1, L2): Add penalty terms to the loss function to control model complexity and prevent overfitting.

VI. Optimization and Model Properties

A. Search/Optimization Algorithms

Purpose: To find the optimal model parameters by minimizing the loss/cost function.

Examples:

Gradient Descent (GD): An iterative first-order optimization algorithm that takes steps proportional to the negative of the gradient.
Stochastic Gradient Descent (SGD): A variant of GD that updates parameters using a single training example or a small batch at a time.
Newton’s Method: A second-order optimization algorithm.
Linear Programming, Quadratic Programming: For specific types of optimization problems.
EM (Expectation-Maximization): For models with latent variables.
Backpropagation: Used to train neural networks by efficiently computing gradients.

B. Model Properties / Basic Concepts

Generalization: The ability of a model to perform well on unseen data. This is the ultimate goal of ML.
Overfitting: When a model learns the training data too well, including noise, leading to poor performance on new data.
Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test data.
Bias-Variance Tradeoff: A fundamental concept balancing model simplicity (high bias, low variance) and complexity (low bias, high variance).
- Big gap between train/validation loss suggests overfitting (high variance).
- No gap, both bad, suggests underfitting (high bias).
Regularization: Techniques (e.g., L1, L2) that add a penalty to the loss function to discourage overly complex models and prevent overfitting.

VII. Evaluation and Advanced Considerations

A. Measuring Prediction Accuracy

Evaluate models on test data using appropriate metrics.
For supervised classification, common metrics include accuracy, precision, recall, F1-score.

B. Human-Centric ML/AI Research

Trustworthy ML: Ensuring models are reliable and fair.
Explainable ML: Understanding why a model makes certain predictions.
Robustness: How well a model performs under adversarial conditions (e.g., adversarial examples).
Interacting with/Understanding/Modeling Human Intent: Developing AI that better collaborates with humans.

Quiz: Machine Learning Fundamentals

Instructions: Answer each question in 2-3 sentences.

What is the primary goal of machine learning, and how does it differ from traditional programming?
Explain the concept of “generalization” in machine learning. Why is it important, and what happens if a model fails to generalize?
Describe the two main phases of a machine learning process. What is the key difference in the data used during these phases?
Define a “loss function” and explain its role in training a machine learning model.
What is the difference between “parameters” and “hyperparameters” in a machine learning model? Provide a brief example for each.
List three types of data commonly encountered in machine learning. Give a real-world example for each.
Distinguish between supervised and unsupervised learning tasks. Provide a task example for each.
Briefly explain what “representation learning” is and why it has become more prominent compared to traditional “feature engineering.”
What is the “bias-variance tradeoff”? How might you diagnose a model that is suffering from high variance?
Describe the purpose of “regularization” in machine learning. How does it help in model training?

Answer Key

Primary Goal of ML: The primary goal of ML is to optimize a performance criterion using example data or past experience, aiming to generalize to unseen data. It differs from traditional programming where rules are explicitly coded; instead, ML systems learn patterns from data to make predictions or decisions.
Generalization Concept: Generalization refers to a model’s ability to perform accurately on new, unseen data, not just the data it was trained on. It is crucial because the ultimate purpose of an ML model is to make predictions in real-world scenarios. A model that fails to generalize is said to be overfitting, meaning it has learned the training data too specifically, including noise, and will perform poorly on new examples.
Two Main Phases: The two main phases are training and testing/deployment. During training, the model learns parameters from a labeled “training set.” During testing, the model’s performance is evaluated on a separate “testing set” of examples that were not part of the training data.
Loss Function: A loss function quantifies the discrepancy between a model’s predicted output and the actual true output for a given example. Its role in training is to provide a metric that the optimization algorithm (e.g., gradient descent) attempts to minimize, thereby iteratively adjusting the model’s parameters to improve its accuracy.
Parameters vs Hyperparameters: Parameters are internal variables of the model learned from the data during training, like the weights (w) and bias (b) in a linear classifier. Hyperparameters are external configurations set before the training process begins, such as the learning rate in gradient descent or the regularization strength.
Three Data Types: Three common data types are tabular, 2D grid, and 1D sequence data. Tabular data can be patient records with various medical attributes. 2D grid data commonly refers to images, like photographs for object recognition. 1D sequence data includes text documents for sentiment analysis or genome sequences.
Supervised vs Unsupervised: Supervised learning involves learning from labeled input-output pairs (X, Y) to predict Y for new X; an example is classifying emails as spam or not spam. Unsupervised learning, conversely, deals with unlabeled data to find patterns or structures within it; an example is clustering customer data to identify distinct market segments.
Representation Learning: Representation learning is the process where a machine learning model automatically discovers the representations (features) needed for detection or classification from raw data. It has become prominent because it often yields more effective features and reduces the laborious, time-consuming, and often task-dependent manual effort required in traditional “feature engineering.”
Bias-Variance Tradeoff: The bias-variance tradeoff refers to the dilemma of simultaneously minimizing two sources of error that prevent models from generalizing well: bias (error from overly simplistic assumptions) and variance (error from sensitivity to small fluctuations in the training data). High variance (overfitting) can be diagnosed by observing a large gap between the model’s performance on the training data (very good) and its performance on validation or test data (significantly worse).
Regularization Purpose: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function during training. This penalty discourages the model from becoming too complex or assigning excessively large weights to features. By controlling model complexity, regularization helps the model generalize better to unseen data.

Linear Regression

Lecture: S1-LinearReg
Version: current
Notes to Read: linear regression coderun
Video: M1 / M2 / M3 / M4

2Regression Linear 1Basic

Notebook Resources: notebook/L3_plot_ols.ipynb

Understanding Linear Regression: A Study Guide

Quiz

Instructions: Answer the following questions in 2-3 sentences each.

What is the primary goal of Machine Learning in a nutshell, as described in the lecture?
Define “regression” in the context of machine learning, specifically mentioning the nature of the target variable.
Explain the purpose of the “score function” (or loss/cost function) in linear regression.
Describe what the training dataset consists of for supervised regression.
In the linear regression equation f(x) = wx + b, what do w and b represent?
How does the concise notation ŷ = f(x) = x^T θ = θ^T x simplify the representation of linear regression with multiple features?
What is the “Sum of Squared Error” (SSE) and why is it used as a loss function in linear regression?
Briefly explain the concept of “Normal Equations” in the context of finding optimal regression coefficients.
Why is it important for a learned model to “Generalize Well” to unseen data?
What are the three main aims for a learned model besides generalizing well, as discussed in the lecture?

Answer Key

Primary Goal of Machine Learning: The primary goal of Machine Learning is to optimize a performance criterion using example data or past experience. The ultimate aim is to generalize this learned knowledge effectively to unseen data.
Regression Definition: In machine learning, regression refers to tasks where the target variable (Y) is continuous. This means the model predicts a numerical value rather than a category or class.
Score Function Purpose: The score function (or loss/cost function), denoted as L(), quantifies the difference between the model’s predictions and the actual target values. The goal of the learning algorithm is to minimize this function, thereby finding the optimal model parameters.
Training Dataset: The training dataset for supervised regression consists of input-output pairs. These pairs include available examples (input features, X) and their corresponding continuous target labels (output, Y), which the model uses to learn.
Linear Regression Parameters: In the linear regression equation f(x) = wx + b (for a single variable), w represents the weight or slope, indicating how much Y changes for a unit change in X. b represents the bias or intercept, which is the value of Y when X is zero.
Concise Notation: The concise notation ŷ = f(x) = x^T θ = θ^T x uses vector/matrix products to represent multiple features and their corresponding weights in a single expression. This compact form simplifies mathematical derivations and computational implementation for multivariate linear regression.
Sum of Squared Error (SSE): The Sum of Squared Error (SSE) measures the sum of the squared differences between the predicted values (f(x)) and the actual target values (y). Squaring the errors ensures that all errors contribute positively to the total loss and penalizes larger errors more heavily, providing a clear objective for minimization.
Normal Equations: The Normal Equations provide a closed-form solution to find the optimal regression coefficients (θ) by setting the derivative (gradient) of the cost function with respect to θ to zero. This method directly calculates the θ that minimizes the SSE for linear regression.
Generalization: Generalizing well means that the model performs accurately not only on the data it was trained on but also on new, unseen data. This indicates that the model has learned underlying patterns rather than just memorizing the training examples, making it useful in real-world applications.
Additional Model Aims: Besides generalizing well, the lecture suggests that a learned model should also be computationally scalable and efficient, and robust, trustworthy, and interpretable. These attributes contribute to the practical utility and reliability of machine learning models.

Linear Regression Basics

1. Machine Learning in a Nutshell

Definition: Machine Learning (ML) is a field that grew out of Artificial Intelligence (AI). Its core purpose is to “Optimize a performance criterion using example data or past experience, aiming to generalize to unseen data.”

Key Components of ML: The process of machine learning can be broken down into several fundamental elements:

Task: What needs to be predicted or achieved (e.g., y in regression).
Representation: How the data and model are structured (e.g., x, f()).
Score Function: A measure of how well the model is performing (e.g., L()).
Search/Optimization: The method used to find the best model parameters (e.g., argmin()).
Models, Parameters: The specific mathematical structure and adjustable values of the model (e.g., w, b for regression coefficients).
Data: The input used for training and testing (e.g., X: Tabular).

2. Regression Fundamentals

Definition: Regression specifically deals with predicting a continuous target variable y.

Linear Regression: In linear regression, the target Y is modeled as a “Weighted linear sum of Xs.”

For a single variable x, the function is f(x) = wx + b. Here, w is the weight (slope) and b is the bias (intercept). A slope of 2 means “every 1-unit change in X yields a 2-unit change in Y.”
For multiple features x₁, x₂, …, the model becomes ŷ = f(x) = θ₀ + θ₁x₁ + θ₂x₂. Here, θ₀ is the intercept, and θ₁, θ₂ are coefficients for x₁, x₂ respectively.

Supervised Learning: Linear Regression is a type of supervised learning. This means the training dataset consists of “input-output pairs” (X, Y), where Y (the target) is known for each input X.

3. Data Representation for Regression

Tabular Dataset Structure

Data/points/instances/examples/samples/records: Rows of the dataset.
Features/attributes/dimensions/independent variables/covariates/predictors/regressors: Columns of the dataset, excluding the target.
Target/outcome/response/label/dependent variable: A special column (usually the last) that is continuous-valued and to be predicted.

Concise Vector/Matrix Notation

To simplify computations, especially with multiple features:

Each data sample x is represented as a column vector.
A pseudo feature x₀=1 (the intercept term) is added to the feature vector, making it x^T = [(x₀=1), x₁, x₂, …, xₚ].
The parameter vector θ is also a column vector [θ₀, θ₁, …, θₚ]^T.
The linear regression function can then be concisely written as ŷ = f(x) = x^T θ = θ^T x.

Training Set in Matrix Form: The entire training set with n samples can be represented as a matrix X (where each row is a feature vector x_i^T) and a target vector y_train = [y₁, y₂, …, yₙ]^T.

4. Training and Optimization

Training Goal: To find the optimal parameters (w, b or θ) by minimizing a Loss/Cost function L(). This function quantifies the “difference between y and f(x) on available examples in the training set.”

Sum of Squared Error (SSE) / Mean Squared Error (MSE)

The most common loss function for linear regression:

J(θ) = 1/2 × Σ (f(xᵢ) - yᵢ)²
In matrix form, J(θ) = 1/2 × (Xθ - y)^T (Xθ - y)

Finding Optimal Parameters (θ)

Closed-form Solution (Normal Equation)

Since J(θ) is a convex function, its minimum can be found by taking the derivative (gradient) with respect to θ and setting it to zero (∇θJ(θ) = 0). This leads to the “normal equations”: X^T Xθ = X^T y.

The solution for θ is then θ* = (X^T X)^(-1) X^T y.

Convexity of J()

The loss function J(θ) is convex. “Intuitively, a convex function (1D case) has a single point at which the derivative goes to zero, and this point is a minimum.” For multivariate functions, if the Hessian matrix is positive definite (which X^T X is, given certain conditions), the function is convex and the point where the gradient is zero is a minimum.

Gradient Descent (GD) / Stochastic Gradient Descent (SGD)

These are iterative optimization algorithms used when a closed-form solution is not feasible or computationally too expensive. (Mentioned in outline, but not detailed in excerpts).

5. Evaluating Model Performance

Generalization: A primary aim is for the “learned model [to] Generalize Well” to unseen data. It should also be “Computational Scalable and Efficient” and “Robust / Trustworthy / Interpretable.”

Testing Mode: After training, the model is tested on “Production Data” (also called test data), which consists of new inputs X? for which the output f(X?) is predicted.

Metrics for Regression

Test MSE Error: A common metric reported for evaluating regression models, calculated on the test set: J_test = 1/m × Σ (x_i^T θ* - y_i)².
Other metrics exist (referencing scikit-learn documentation).

6. Computational Considerations

Why Vector/Matrix Form? It allows for “Concise Notation,” “Closed form solution” for training, and efficient computation for testing (ŷ_test = X_test θ*).

Computational Cost (Naive Matrix Multiply)

A naive matrix multiplication C = C + A×B has a complexity of O(n³) flops.

Optimizing Computational Cost

Data Parallelization

Utilizing CPU SIMD (Single Instruction, Multiple Data), multithreading, and GPU parallelization. SIMD allows “one operation [to produce] multiple results.”

Memory Hierarchy/Locality

Exploiting spatial (accessing nearby data) and temporal (reusing data) locality to improve average memory access speed through caches.

Better Algorithms

Examples include Strassen’s Matrix Multiply, which reduces complexity to O(n^2.81) (asymptotically faster than naive O(n³)).

BLAS (Basic Linear Algebra Subroutines)

Industry-standard interfaces (e.g., numpy wraps BLAS) providing optimized implementations for vector (BLAS1), matrix-vector (BLAS2), and matrix-matrix (BLAS3) operations. BLAS3 is generally preferred for its potential for higher computational intensity.

7. Advanced/Optional Concepts

Probabilistic Interpretation: Linear regression can be viewed probabilistically by assuming the error term ε (representing unmodeled effects or random noise) follows a Gaussian distribution N(0, σ). This leads to the concept of Maximum Likelihood Estimation for finding θ.

Gram Matrix: X^T X is referred to as a Gram Matrix and is always positive semi-definite. If X has full rank, X^T X is positive definite and invertible, ensuring a unique minimum for J(θ).

Regularization: Used when X has less than full column rank to handle non-invertible X^T X (not detailed in these excerpts but mentioned as a solution).

Scalability to Big Data: While polynomial time algorithms are good, O(n) can still be too slow for truly massive datasets, leading to research in “low rank, sparse, hardware, sampling, randomized…” solutions.

GD and SGD for LR

Lecture: S1-LROptimization
Version: current
Notes to Read: SGD Jupyter notebook + numpy linalg +
Extra Notes to Read: more SGD + ELS Ch3.2
Video: M1 + M2

2Regression Optimization

Notebook Resources: notebook/L4_stochastic_gradient_descent.ipynb

Optimization for Linear Regression: A Study Guide

I. Course Overview & Core Concepts

This section of the course focuses on optimization techniques for Linear Regression (LR), specifically for tabular data. It revisits fundamental machine learning concepts within this context.

A. Course Sectioning Context

Section 1: Basic Supervised Regression on Tabular Data
Section 2: Basic Deep Learning on 2D Imaging Data
Section 3: Advanced Supervised Learning on Tabular Data
Section 4: Generative and Deep Learning on 1D Sequence Text Data
Section 5: Unsupervised Learning (Mostly on Tabular Data)

B. Regression Fundamentals

Regression: Deals with predicting a continuous output variable (y).
Linear Regression (LR): Models the output (Y) as a weighted linear sum of input features (Xs).
Sum of Squared Error (Least Squares): The standard cost function used to measure the difference between predicted and actual values in LR.
Metrics, Implementation, Regression Coefficients (w, b): Key aspects to consider in LR.

C. Machine Learning Task Components

Task: Predicting ‘y’.
Representation: Input features ‘x’ and the function ‘f()’.
Score Function: The loss function ‘L()’.
Search/Optimization: The process of finding ‘argmin()’ for the loss function.
Models, Parameters: The chosen model and its adjustable parameters.
Data: ‘X’, typically tabular.
Prediction: ŷ = f(x) = θ^T x (where θ represents the parameters/weights).

D. Optimization Principles

Objective Function: The function to be minimized or maximized (e.g., the cost function).
Variables: The parameters being optimized (e.g., weights ‘w’ or ‘θ’).
Constraints: Conditions that the variables must satisfy.
Goal: Find variable values that optimize the objective function while satisfying constraints.

II. Methods for Optimizing Linear Regression

The lecture covers two main approaches to optimizing Linear Regression: Direct Optimization (Normal Equation) and Iterative Optimization (Gradient Descent variants).

A. Method 1: Directly Optimize (Normal Equation)

Concept: For convex functions (like the Sum of Squared Error in LR), the minimum occurs where the derivative (slope) is zero.

Process:

Write the cost function J(θ) in matrix form.
Take the gradient of J(θ) with respect to θ and set it to zero.
Solve the resulting equation for θ.

Formula: θ = (X^T X)^(-1) X^T y

Pros: A single-shot algorithm, easiest to implement.

Cons:

Requires computing the pseudo-inverse (X^T X)^(-1), which can be computationally expensive (especially for large datasets).
Prone to numerical issues if the matrix is singular or ill-conditioned.

Direct vs. Iterative: A direct method, meaning it finds the solution in a finite number of steps.

B. Method 2: Iteratively Optimize via Gradient Descent

Concept: An iterative algorithm that moves towards the minimum of a function by taking steps proportional to the negative of the gradient at the current point.

Gradient Definition: A vector whose entries contain the partial derivatives of the function with respect to each variable. It points in the direction of the greatest rate of increase of the function.

General Algorithm:

Initialize k=0, choose an initial x₀ (randomly or by prior).
While k < k_max (or until a stopping criterion is met): x_k = x_{k-1} - α∇x F(x{k-1})

Key Factors Influencing GD:

Learning Rate (α): Determines the step size. A critical parameter.
Objective Function: Its shape (e.g., convexity) affects convergence.
Starting Point: Can influence which local minimum is found (though not an issue for LR’s convex cost function).
Stop Criterion: When to stop the iterative process.

Pros:

Works on any objective function as long as its gradient can be evaluated.
Useful for minimizing complex functions.
Guaranteed convergence for convex functions like LR’s cost function.

Cons:

Can be slow converging (especially batch GD).
Can get stuck in local minima for non-convex functions (not an issue for LR).
Learning rate selection is crucial.

C. Variants of Gradient Descent

1. Batch Gradient Descent (GD)

Concept: Calculates the gradient of the cost function with respect to the parameters for all training examples in each iteration.

Update Rule: θ_{t+1} = θ_t - α∇_θ J(θ_t) = θ_t + αX^T (y - Xθ_t)

Characteristics:

Computes the exact gradient in each step.
Smooth convergence path.
Can be computationally expensive for very large datasets, as it needs to process the entire dataset per update.

2. Stochastic Gradient Descent (SGD)

Concept: Calculates the gradient and updates the parameters for each individual training example (or a very small subset) at a time.

Update Rule (for a single training point i): θ_{t+1} = θ_t + α(y_i - x_i^T θ_t)x_i

Characteristics:

“Noisier” updates due to processing one example at a time.
Faster per-step computation compared to batch GD.
Can escape shallow local minima due to the noise (advantage for non-convex problems).
Useful for massive datasets that don’t fit in memory or for online/streaming data.

Epoch: One complete pass over the entire training data when using SGD for offline training.

Pros:

Low per-step cost.
Can be faster converging to an approximate solution.
Less prone to getting stuck in local optima (for non-convex cases).
Efficient for large datasets and online learning.

Cons:

Convergence to the exact optimum is not always guaranteed (it tends to oscillate around the minimum).
More variation in the objective function during training.

3. Mini-Batch Gradient Descent

Concept: A compromise between Batch GD and SGD. Computes the gradient on a small “mini-batch” of samples (e.g., B=32, 64, 128).

Update Rule: θ_{t+1} = θ_t + (1/B)αX_B^T (y_B - X_B θ_t) (where B refers to the mini-batch size and X_B, y_B are the mini-batch data)

Characteristics:

Combines benefits of both: more stable updates than SGD, but computationally more efficient than Batch GD (especially with GPU acceleration).
Commonly used in practice.
Special Cases: B=1 is standard SGD, B=N (total data size) is standard Batch GD.

Pros:

Much faster computationally than single-point SGD (better use of computer architecture like GPUs).
Low per-step cost.
Fast convergence.

D. Stopping Criteria for Gradient Descent

Predetermined Maximum Number of Iterations: A common and simple stopping rule.
Threshold for Improvement: Stop when the improvement in the objective function (e.g., MSE) drops below a certain threshold.
Monitoring Training Error: Plotting Train-MSE per sample to observe its decrease over epochs is helpful for understanding behavior.

Quiz: Optimization for Linear Regression

What is the primary goal of optimization in the context of Linear Regression?
Briefly explain the main idea behind the “Normal Equation” method for Linear Regression.
What are two significant disadvantages of using the Normal Equation, especially for large datasets?
Define “Gradient Descent” as an iterative optimization algorithm.
List three critical factors that influence the behavior and performance of Gradient Descent.
How does Stochastic Gradient Descent (SGD) differ from Batch Gradient Descent (GD) in terms of gradient calculation and parameter updates?
Provide two advantages of using Stochastic Gradient Descent (SGD) compared to Batch Gradient Descent for very large datasets.
What is “Mini-Batch Gradient Descent,” and why is it often preferred in practice?
Explain the concept of an “epoch” in the context of Stochastic Gradient Descent.
Name two common stopping criteria for iterative optimization algorithms like Gradient Descent.

Quiz Answer Key

Primary Goal of Optimization: The primary goal of optimization in Linear Regression is to find the set of model parameters (weights ‘w’ or ‘θ’) that minimize the chosen cost function, typically the Sum of Squared Error (Least Squares). This minimization results in the best-fit line or hyperplane for the given data.
Normal Equation Method: The Normal Equation method directly solves for the optimal Linear Regression parameters by setting the gradient of the cost function to zero. It analytically calculates the parameters in a single step using matrix operations, without needing an iterative process.
Disadvantages of Normal Equation: Two significant disadvantages of the Normal Equation are its high computational cost due to the requirement of computing a matrix inverse, particularly for large numbers of features, and its susceptibility to numerical issues if the (X^T X) matrix is singular or ill-conditioned.
Gradient Descent Definition: Gradient Descent is an iterative optimization algorithm that minimizes an objective function by repeatedly moving in the direction opposite to the gradient of the function. Each step adjusts the parameters by a small amount proportional to the negative of the gradient, gradually approaching a local (or global) minimum.
Critical Factors for GD: Three critical factors influencing Gradient Descent are the learning rate (α), which determines the step size; the shape of the objective function, which dictates the gradient landscape; and the starting point, which can affect the path to the minimum or which local minimum is found.
SGD vs Batch GD: Stochastic Gradient Descent (SGD) updates parameters using the gradient calculated from a single training example at a time, leading to noisy but frequent updates. In contrast, Batch Gradient Descent (GD) calculates the gradient using all training examples before making a single parameter update, resulting in smoother but less frequent updates.
Advantages of SGD: Two advantages of SGD for very large datasets are its low per-step computational cost, as it processes only one sample at a time, making it suitable for data that doesn’t fit in memory, and its ability to converge faster to an approximate solution.
Mini-Batch Gradient Descent: Mini-Batch Gradient Descent is an optimization technique that computes the gradient and updates parameters using a small, randomly selected subset (mini-batch) of the training data. It is often preferred because it balances the stability of Batch GD with the computational efficiency of SGD, leading to faster training and better utilization of hardware like GPUs.
Epoch Concept: In the context of Stochastic Gradient Descent, an “epoch” refers to one complete pass through the entire training dataset. If SGD is used for offline training, it repeatedly cycles through all samples in the dataset, and each such pass constitutes an epoch.
Stopping Criteria: Two common stopping criteria for Gradient Descent algorithms include a predetermined maximum number of iterations (epochs) and stopping when the improvement in the objective function (e.g., Train MSE) between consecutive iterations drops below a specified small threshold.

LR with basis

Lecture: S1-LRSelectExtend
Version: current
More advanced to Read: S1-nonparametric
Notes to Read: L5_Poly_Regression.ipynb
Extra Notes to Read: NonLinearR + ELS Ch5
Video: M1 + M2

2Regression Nonlinear ModelSelection Local

Notebook Resources: notebook/L5-Poly-Regression.ipynb
Notebook Resources: notebook/L5-RBF-regressionLR.ipynb

Understanding Linear Regression with Basis Functions Expansion

Study Guide

This study guide is designed to review your understanding of Linear Regression with Basis Functions Expansion, as presented in the provided lecture material.

I. Core Concepts of Linear Regression

Definition of Regression

What is the primary characteristic of the y variable in a regression task?

Goal of Linear Regression with Basis Expansion

How does Linear Regression (LR) achieve the ability to model non-linear relationships using basis functions?

Key Components of a Machine Learning Task

Identify and briefly describe the four fundamental components mentioned for any machine learning task:

Task: What needs to be predicted
Representation: How data and functions are structured
Score Function: Metric to evaluate model performance
Search/Optimization: Algorithms to find optimal parameters

Parameters vs. Models

Differentiate between “models” and “parameters” in the context of machine learning.

Optimization Methods

List the three primary methods mentioned for optimizing the Sum of Squared Error:

Normal Equation
Gradient Descent (GD)
Stochastic Gradient Descent (SGD)

“KEY” Insight

Explain the fundamental implication of having predefined basis functions on the nature of the learning problem.

II. Basis Functions Expansion

Purpose of Basis Functions

Why are basis functions introduced in linear regression?

Types of Basis Functions

Name at least three different types of basis functions mentioned in the lecture:

Polynomial basis functions
Radial Basis Functions (RBFs)
Trigonometric basis functions

Polynomial Regression

What is the specific form of the basis function φ(x) for polynomial regression up to degree two (d=2)?

For degree 2: φ(x) = [1, x, x²]ᵀ

How is the prediction ŷ formulated using polynomial basis functions?

ŷ = θᵀφ(x) = θ₀ + θ₁x + θ₂x²

Radial Basis Functions (RBFs)

Define what an RBF is in terms of its dependency. RBFs are functions whose output depends only on the distance from a central point.

Describe the characteristic behavior of a Gaussian RBF as distance from its center increases. The output decreases exponentially as the distance from the center increases, creating a bell-shaped curve.

What are the “hyperparameters” that users need to define for RBF basis functions?

Centers (μⱼ): The central points where RBFs are located
Widths (λⱼ): Parameters controlling the spread of the RBF

Provide the general mathematical form of a Gaussian RBF, specifically φⱼ(x).

φⱼ(x) = exp(-λⱼ

x - μⱼ

²)

III. Parametric vs. Non-Parametric Models

Parametric Learning Algorithm

Define a parametric learning algorithm, using (unweighted) linear regression as an example. What is a key characteristic regarding the need for training data after fitting?

Parametric algorithms have a fixed number of parameters that are learned from data. Once trained, the training data can be discarded as predictions are made using only the learned parameters.

Non-Parametric Learning Algorithm

Define a non-parametric learning algorithm, providing examples like K-Nearest Neighbor (KNN) or Locally Weighted Linear Regression. How does the “amount of knowledge” required to represent the hypothesis differ from parametric models?

Non-parametric algorithms require the entire training dataset to be kept for making predictions. The amount of knowledge grows with the size of the training set.

IV. Mathematical Notations and Formulas

General Prediction Equation

Write down the general equation for ŷ (predicted output) using basis functions φ(x) and weights θ.

ŷ = θᵀφ(x)

Normal Equation

Provide the formula for calculating the optimal weight vector θ* using the Normal Equation, given φ(x) and y.

θ* = (ΦᵀΦ)⁻¹Φᵀy

where Φ is the design matrix with rows φ(xᵢ)ᵀ

Quiz: Linear Regression with Basis Functions

Instructions: Answer each question in 2-3 sentences.

How does Linear Regression with basis functions expansion allow for the modeling of non-linear relationships, despite being fundamentally a “linear” model?
What is the significance of the “KEY” insight mentioned in the lecture regarding predefined basis functions?
Describe the primary goal of the “Search/Optimization” component in a machine learning task, as it relates to the score function.
If you are performing polynomial regression with a degree up to three, what would be the form of the basis function vector φ(x)?
Explain why Radial Basis Functions (RBFs) are often described as “bell-shaped” or having an output that decreases with distance from a center.
What role do the “centers” and “widths” play in defining Radial Basis Functions?
Differentiate between the “Task” and “Representation” components of a machine learning problem.
Briefly explain the main difference in how parametric and non-parametric learning algorithms utilize training data after the initial fitting phase.
Which optimization methods, other than the Normal Equation, are mentioned for finding the optimal regression coefficients?
In the context of the general prediction equation ŷ = θᵀφ(x), what do θ and φ(x) represent, respectively?

Quiz Answer Key

Non-linear Modeling: Linear Regression with basis functions models non-linear relationships by transforming the input features x into a higher-dimensional space using non-linear basis functions φ(x). The relationship between the transformed features φ(x) and the output y remains linear, allowing linear regression techniques to be applied.
KEY Insight: The “KEY” insight is that even when non-linear basis functions are used, the problem of learning the parameters from the data is still considered Linear Regression. This is because the model is linear with respect to the parameters θ, even if it’s non-linear with respect to the original input x.
Search/Optimization Goal: The “Search/Optimization” component’s primary goal is to find the set of model parameters that minimize (or maximize, depending on the objective) the score function. This process iteratively adjusts the parameters to achieve the best possible fit to the training data according to the defined score.
Polynomial Basis Vector: For polynomial regression with a degree up to three, the basis function vector φ(x) would be [1, x, x², x³]ᵀ. This expands the single input feature x into a set of features including its powers.
RBF Bell Shape: RBFs are “bell-shaped” because their output typically reaches a maximum at a central point and then decreases symmetrically as the input moves further away from that center. For Gaussian RBFs, this decay is exponential, creating a characteristic bell-like curve.
RBF Centers and Widths: In RBFs, “centers” (μⱼ) define the point in the input space where the basis function’s influence is maximal, while “widths” (λⱼ) determine how rapidly the function’s output decreases as the input moves away from that center, thus controlling the spread or local influence of the RBF.
Task vs Representation: “Task” defines what needs to be predicted (e.g., y is continuous for regression). “Representation” refers to how the input data x and its potential transformations f() are structured for the model, essentially defining the features used for prediction.
Parametric vs Non-parametric Data Usage: Parametric algorithms, once trained, can discard the training data as predictions are made solely using the learned, fixed parameters. Non-parametric algorithms, in contrast, require the entire training dataset to be kept available for making future predictions, as their “knowledge” grows with the data size.
Other Optimization Methods: Besides the Normal Equation, the lecture also mentions Gradient Descent (GD) and Stochastic Gradient Descent (SGD) as optimization methods for finding optimal regression coefficients by minimizing the Sum of Squared Error.
Prediction Equation Components: In ŷ = θᵀφ(x), θ represents the vector of regression coefficients (weights) that the model learns from the data. φ(x) represents the basis function expansion of the input x, which transforms the original features into a potentially higher-dimensional space.

Workflow for model selection

Lecture: S1-ModelSelect
Version: current
Notes to Read: hyperpara select notebook + flow API
Extra Notes to Read: ELS Ch5
Video: M1

2Regression Nonlinear ModelSelection Local

Notebook Resources: notebook/L6-Hyperparameters-and-Model-Validation.ipynb

Model Selection and Validation Study Guide

Quiz: Short Answer Questions

What is the primary goal of model selection in machine learning? The primary goal is to choose the right model type and its hyperparameters to ensure good generalization, meaning the model can accurately predict future data drawn from the same distribution, rather than just fitting the training data well. This involves balancing between underfitting and overfitting.
Explain the difference between overfitting and underfitting. Underfitting occurs when a model is too simple to capture the underlying patterns in the data, leading to high training and test errors (high bias). Overfitting happens when a model is too complex and fits noise in the training data, resulting in low training error but high test error (high variance).
Why is simply choosing the model with the best fit to the training data not a reliable strategy for model selection? Choosing the model with the best fit to the training data is unreliable because it often leads to overfitting. An overfit model memorizes the training data, including noise, and performs poorly on new, unseen data, failing to generalize effectively.
Describe the process of the “Train-Validation” (Hold-out) method for model selection. The train-validation method involves splitting the labeled data into two sets: a training set (e.g., 70%) and a validation set (e.g., 30%). The model is trained on the training set, and its performance is then estimated using the validation set to evaluate its future performance.
What are the main advantages and disadvantages of the “Train-Validation” method? Advantages include its simplicity and ease of implementation. Disadvantages are that it “wastes data” (the validation set is not used for training the final model) and can lead to a high variance in performance estimation if the validation set is small or unrepresentative.
What problem does K-Fold Cross-Validation aim to solve compared to the simple train-validation method? K-Fold Cross-Validation addresses the issue of data wastage and high variance in performance estimation inherent in the train-validation method, especially when data is scarce. It ensures that each data point is used for both training and validation, providing a more robust estimate.
Briefly explain how K-Fold Cross-Validation works. In K-Fold Cross-Validation, the dataset is divided into K equal “folds.” The model is trained K times; in each iteration, one fold is used as the validation set, and the remaining K-1 folds are used as the training set. The scores from each validation step are collected, and their mean is typically reported as the overall performance estimate.
What is Leave-One-Out Cross-Validation (LOOCV), and how does it relate to K-Fold Cross-Validation? Leave-One-Out Cross-Validation (LOOCV) is a specific type of K-Fold Cross-Validation where K is equal to n, the number of data points in the dataset. In LOOCV, each data point individually serves as the validation set, and the model is trained on the remaining n-1 points.
When analyzing a validation curve (score vs. model complexity), what typically happens to the training and validation scores as model complexity increases, and why? As model complexity increases, the training score generally increases (or error decreases) because a more complex model can fit the training data better. The validation score will initially increase alongside the training score but will eventually start to decrease after a certain point due to overfitting, as the model begins to learn noise from the training data.
Define Bias and Variance in the context of machine learning models. Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a simplified model. It represents how much the average model over all training sets differs from the true model. Variance refers to how much the models estimated from different training sets differ from each other, reflecting the model’s sensitivity to small fluctuations in the training data.

Essay Format Questions

Compare and contrast the Train-Validation (Hold-out), K-Fold Cross-Validation, and Leave-One-Out Cross-Validation methods for model selection. Discuss their respective advantages, disadvantages, and typical use cases, particularly considering data availability.
Elaborate on the concepts of overfitting and underfitting. Explain how these phenomena manifest in model performance metrics (training error, test error) and discuss various strategies for mitigating each, referencing the bias-variance tradeoff.
Discuss the “generalization” goal in machine learning. Why is it more important than achieving a perfect fit to training data, and what techniques are employed to ensure a model generalizes well to new, unseen data?
Explain the significance of the bias-variance tradeoff in model selection. Provide examples of how model complexity influences bias and variance, and describe how one might navigate this tradeoff using techniques discussed in the source material.
Imagine you are tasked with selecting the optimal polynomial degree for a regression model on a limited dataset. Describe a step-by-step process you would follow, incorporating at least two different validation techniques, to make an informed decision while addressing potential pitfalls.

Glossary of Key Terms

Model Selection: The process of choosing the right model type and its hyperparameters to achieve the best performance on unseen data.
Hyperparameter: A parameter whose value is used to control the learning process, which is set before the learning process begins (e.g., polynomial degree, number of folds in cross-validation).
Overfitting: A phenomenon where a model learns the training data too well, including its noise and irrelevant details, leading to excellent performance on training data but poor performance on new, unseen data. Characterized by low bias and high variance.
Underfitting: A phenomenon where a model is too simple to capture the underlying patterns in the training data, resulting in poor performance on both training and new data. Characterized by high bias and low variance.
Generalization: The ability of a machine learning model to perform accurately on new, unseen data, reflecting its capacity to “explain,” “predict,” or “model” new examples from the same distribution as the training data.
Training Set: The portion of the labeled dataset used to train the machine learning model.
Validation Set (Hold-out Set): A separate portion of the labeled dataset used to estimate the model’s performance during training and hyperparameter tuning, helping to avoid overfitting to the test set.
Test Set: A completely independent portion of the labeled dataset, kept separate from both training and validation, used to provide an unbiased evaluation of the final chosen model’s performance on new data.
Mean Squared Error (MSE): A common metric for regression models, calculated as the average of the squares of the differences between predicted and actual values. Lower MSE indicates better model performance.
Variance (of performance estimator): Refers to how much an estimate of performance might change if a different validation set or data split were used. High variance implies an unreliable estimate.
K-Fold Cross-Validation: A resampling procedure used to evaluate machine learning models on a limited data sample. The dataset is divided into K folds, and the model is trained and validated K times, with each fold serving as the validation set once.
Leave-One-Out Cross-Validation (LOOCV): A special case of K-Fold Cross-Validation where K equals the number of data points (n). Each data point is used as a validation set, with the remaining n-1 points used for training.
Bias: The error due to inaccurate assumptions or simplifications made by the model. A high bias model typically underfits the data.
Variance: The error due to the model’s sensitivity to small fluctuations in the training data. A high variance model typically overfits the data.
Bias-Variance Tradeoff: The inherent conflict in machine learning where reducing one type of error (bias or variance) tends to increase the other. The goal is to find a balance that minimizes overall prediction error.
Learning Curve: A plot showing the performance of a learning model (e.g., score or error) on both the training set and validation set as a function of the number of training examples or model complexity.
Validation Curve: A plot showing the performance of a learning model (e.g., score or error) on both the training set and validation set as a function of a model hyperparameter (e.g., model complexity).

Linear Prediction with Regularization

Lecture: S1-lrRegularized
Version: current
Extra Notes to Read: More Ridge
Video: M1 + M2 + Extra M3

2Regression Optimization Regularization ModelSelection

Notebook Resources: notebook/L7_regularizedRegression_06_Linear_Regression.ipynb

Study Guide: Regularized Linear Regression

This guide provides a comprehensive review of the concepts surrounding regularized multivariate linear regression, based on the provided source material. It includes a quiz to test understanding, essay questions for deeper exploration, and a glossary of key terminology.

Quiz

Answer the following questions in 2-3 sentences each, based on the provided source material.

Why is standard linear regression often problematic for datasets where the number of features (p) is greater than the number of samples (n)?
What is regularization in the context of machine learning, and what problem does it primarily address?
Describe the penalty term used in Ridge (L2) regression and explain how it affects the model’s coefficients.
Describe the penalty term used in LASSO (L1) regression and explain its key advantage over Ridge regression.
What is the primary purpose of the regularization parameter, λ (lambda), and how is its optimal value typically determined?
Explain the concept of a “regularization path” and what it visualizes.
What is Elastic Net regularization and why was it developed?
From a geometric perspective, why does LASSO (L1) regularization tend to produce sparse models (i.e., set some coefficients to exactly zero)?
What are the three main goals for a trained machine learning model mentioned in the lecture?
What is the “grouping effect” in the context of Elastic Net, and when is it particularly useful?

Quiz Answer Key

When p > n, the matrix X^T X in the normal equation for linear regression has less than full column rank and is therefore not invertible. This prevents the direct calculation of a unique solution for the regression coefficients using Ordinary Least Squares.
Regularization is an additional criterion added to the loss function to ensure the model does not overfit the training data. It primarily addresses the issue of overfitting by keeping the model’s parameters more “normal” or regular, effectively adding a bias that prefers certain types of weights (e.g., smaller ones) over others.
Ridge regression adds an L2 penalty, which is proportional to the sum of the squared coefficient weights (λ Σ βj²). This penalty shrinks the coefficients towards zero, with the effect becoming stronger as λ increases. However, it does not set the coefficients exactly to zero.

**LASSO regression uses an L1 penalty, proportional to the sum of the absolute values of the weights (λ Σ

βj

).** Its key advantage is that it performs both shrinkage and variable selection simultaneously, as the penalty can force some coefficients to be exactly zero, creating a more interpretable and sparse model.

The regularization parameter λ controls the strength of the penalty; λ=0 corresponds to standard least squares, while a large λ pushes coefficients towards zero. The optimal value is chosen to ensure good generalization ability on new data and is typically determined using k-fold cross-validation.
A regularization path is a plot that visualizes how the model’s regression coefficients (βj) change in value as the regularization parameter λ is varied. It illustrates the effect of the penalty on each feature’s weight, from the unregularized solution to a highly constrained one.
Elastic Net is a hybrid regularization method that combines both L1 (LASSO) and L2 (Ridge) penalties. It was developed to address disadvantages of LASSO, such as its limitation of selecting at most n variables in a p>n scenario and its tendency to arbitrarily select only one variable from a group of highly correlated ones.
Geometrically, the L1 constraint region is a diamond shape with sharp corners that lie on the axes. The elliptical contour lines of the least squares error function are more likely to first make contact with this region at one of its corners. This contact point corresponds to a solution where coefficients for other axes are zero, thus producing sparsity.
The three main goals for a trained model are to (1) generalize well to new data, (2) be computationally scalable and efficient, and (3) be trustworthy, meaning it is robust and interpretable.
The “grouping effect” is a property of Elastic Net that encourages the model to select or deselect groups of highly correlated variables together. This is useful when features are naturally grouped, as LASSO by itself tends to arbitrarily select only one variable from such a group.

Essay Questions

The following questions are designed for longer, essay-style responses to encourage a deeper synthesis of the material. No answers are provided.

Compare and contrast Ridge, LASSO, and Elastic Net regularization. Discuss the mathematical formulation of their penalty terms, their impact on model coefficients, and the specific scenarios where one might be preferred over the others.
Explain the problem of overfitting in multivariate linear regression, particularly in the context of “large p, small n” datasets. How does regularization serve as a solution to this problem, and what is the trade-off involved?
Describe the role and importance of the hyperparameter λ in regularized regression. Detail the process of using k-fold cross-validation to select an optimal λ and explain why this process is crucial for building a model that generalizes well.
Using the geometric interpretation of regularized regression, explain why L1 regularization (LASSO) leads to sparse solutions while L2 regularization (Ridge) does not. Use the concepts of constraint regions and objective function contour lines in your explanation.
Discuss the practical application of regularized regression in a real-world problem like text regression for predicting movie revenues. Explain how model interpretability, a key feature of LASSO and Elastic Net, is valuable in such a domain.

Glossary of Key Terms

Term	Definition
Regularization	An additional criterion added to a model’s loss function to prevent overfitting by keeping the parameters “regular” or small.
Overfitting	A state where a model is too complex and learns “noise” from the training data, resulting in low training error but high error on new, unseen data.
Underfitting	A state where a model is too simple, resulting in high errors on both the training and test datasets.
Ridge Regression (L2)	A regularized linear regression method that adds a penalty proportional to the sum of the squared coefficient weights (λ Σ βj²). It performs parameter shrinkage.
LASSO (L1)	Stands for “Least Absolute Shrinkage and Selection Operator.” A regularized regression method that adds a penalty proportional to the sum of the absolute values of the weights (λ Σ \|βj\|). It performs both shrinkage and variable selection.
Elastic Net	A regularized regression method that combines the L1 and L2 penalties. It was developed to handle highly correlated variables and p>n scenarios more effectively than LASSO alone.
Normal Equation	The analytical solution for Ordinary Least Squares regression: β = (XᵀX)⁻¹ Xᵀy. It is not solvable if XᵀX is non-invertible.
p > n problem	A common scenario in datasets, such as in gene expression or text analysis, where the number of features (p) is greater than the number of data samples (n).
λ (Lambda)	The regularization parameter, also known as a hyperparameter, that controls the strength of the penalty term in regularized models. Its value is often chosen via cross-validation.
K-fold Cross Validation	A technique used to evaluate a model’s ability to generalize to new data and to select optimal hyperparameters like λ.
Regularization Path	A plot that shows how the values of the estimated coefficients (βj) change as the regularization parameter λ is varied from zero to a large value.
Shrinkage	The effect of regularization where coefficient values are reduced in magnitude, pulling them closer to zero. This helps to reduce model complexity.
Sparsity	A property of a model where many of its coefficients are exactly zero. This is a primary feature of LASSO, which effectively performs feature selection.
Grouping Effect	A property of Elastic Net where it tends to select groups of highly correlated predictor variables together, rather than arbitrarily choosing one from the group as LASSO might.
Pearson Correlation (r)	A statistical measure of the linear correlation between two variables, yielding a value between -1 (total negative correlation) and +1 (total positive correlation).

KNN and Theory

Lecture: S1-KNN
Version: current
More advanced to Read: S1-KNN-extra
Video: M1 + M2

3Classification 5Theory Local ModelSelection

Notebook Resources: notebook/L8_Knearest.ipynb

K-Nearest Neighbor (KNN) Study Guide

Short-Answer Quiz

Answer each question in 2-3 sentences.

Describe the three main categories of supervised classifiers and provide an example for each.
Explain the concept of “lazy learning” as it applies to the K-Nearest Neighbor algorithm.
What are the three essential components required for a K-Nearest Neighbor (Instance-based) method?
Contrast the process for making a prediction in KNN for a regression task versus a classification task.
What happens during the “Training Mode” of a naïve K-Nearest Neighbor algorithm?
Outline the three steps involved in the “Testing Mode” when using KNN to classify an unknown sample.
Why is feature scaling important for K-Nearest Neighbor algorithms? Provide a brief example.
Discuss the bias-variance tradeoff when selecting the value of k. What is the effect of a k that is too small or too large?
What is the “curse of dimensionality,” and how can it negatively impact the performance of a KNN model?
According to the Cover and Hart theorem (1967), what is the asymptotic relationship between the error rate of 1-nearest-neighbor classification and the Bayes rate?

Quiz Answer Key

The three major types of supervised classifiers are Discriminative, Generative, and Instance-based. Discriminative classifiers, like logistic regression or support vector machines, directly estimate a decision boundary. Generative classifiers, such as Naïve Bayes, build a statistical model of the data. Instance-based classifiers, like KNN, use the training observations directly without building an explicit model.
Instance-based learning is referred to as “lazy learning” because the decision of the output value is delayed until a new instance arrives for prediction. In KNN, no model is built during a training phase; instead, the original training instances themselves represent the knowledge, and computation is deferred until a prediction is requested.
The three essential components for a KNN method are: (1) the set of stored training samples, (2) a distance metric to compute the distance between samples (e.g., Euclidean distance), and (3) the value of k, which specifies the number of nearest neighbors to retrieve.
For a regression task, where the target variable is continuous, KNN makes a prediction by taking the mean value of the k nearest training examples. For a classification task, where the target is discrete, the prediction is determined by a majority vote among the class labels of the k nearest neighbors.
In the naïve version of the K-Nearest Neighbor algorithm, the training mode involves doing nothing other than storing the training samples. The algorithm does not build a model but simply retains the dataset, which will be used directly during the testing or deployment phase.
The three steps in the testing mode are: (1) Compute the distance from the unknown sample to all records in the training set. (2) Identify the k nearest neighbors based on these computed distances. (3) Use the class labels of these nearest neighbors to determine the class label of the unknown record, typically by taking a majority vote.
Feature scaling is important because attributes with larger ranges can dominate distance calculations, skewing the model’s performance. For example, if one feature is a person’s income (e.g., $10K to $1M) and another is height (e.g., 1.5m to 1.8m), the income attribute would have a much larger influence on the Euclidean distance calculation unless the features are scaled to a comparable range.
When selecting k, a small value can lead to a model that is sensitive to noise points, resulting in high variance and an unstable, though potentially accurate, local model. Conversely, a large k can cause the neighborhood to include points from other classes, leading to a model with high bias and an inaccurate estimation, though it will be more stable.
The “curse of dimensionality” refers to the problem where a nearest neighbor algorithm can be easily misled when working with high-dimensional data (i.e., data with many features). In a high-dimensional space, most points are far apart, and the concept of a “close” neighbor becomes less meaningful, especially if many attributes are irrelevant to the target function.
The Cover and Hart theorem states that asymptotically (with an infinite number of training samples), the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate. The Bayes rate is the error rate of an optimal classifier that knows the true data-generating model, representing the lowest achievable error.

Essay Questions

Discuss the core principles of instance-based learning. How does KNN exemplify these principles, and what are the primary advantages and disadvantages of this “lazy” approach compared to “eager” learning models like linear regression?
Explain the role of distance metrics in the KNN algorithm. Compare and contrast at least three different distance metrics mentioned in the source material (e.g., Euclidean, Manhattan, Mahalanobis) and discuss scenarios where one might be preferable over the others.
Elaborate on the different variations of the standard KNN algorithm, such as Distance-Weighted KNN and RBF kernel Weighted KNN. How do these methods attempt to improve upon the naïve majority-vote approach?
Describe the relationship between KNN, non-parametric density estimation, and the Bayes classifier. How can the principles of the Bayes classifier be used to provide a probabilistic interpretation for KNN?
While KNN is conceptually simple, its computational cost can be a significant drawback. Discuss the computational time cost of the naïve KNN search and explain how a data structure like a KD Tree can be used to optimize the search for nearest neighbors.

Glossary of Key Terms

Term	Definition
Asymptotic Analysis	The study of an algorithm’s properties (like error rate) as the amount of data approaches infinity. In KNN, it shows the error rate of 1-NN is less than twice the Bayes rate.
Bayes Classifier	A theoretically optimal classifier that minimizes the probability of classification error. It is achieved when density estimates converge to the true densities with an infinite number of samples.
Bayes Error	The lower bound of the probability of classification error, which is the error achieved by the Bayes classifier.
Curse of Dimensionality	A phenomenon where a nearest neighbor algorithm is easily misled when the input data has a high number of features (dimensions), making the concept of proximity less meaningful.
Discriminative Classifier	A type of supervised classifier that directly estimates a decision rule or boundary. Examples include support vector machines, logistic regression, and neural networks.
Distance Metric	A function used to compute the distance or similarity between two data samples. Examples include Euclidean, L1 norm (Manhattan), L∞ norm, and Mahalanobis distance.
Feature Scaling	The process of scaling attributes to prevent distance measures from being dominated by features with large ranges.
Features	The columns in a tabular dataset used for prediction, also known as attributes, dimensions, independent variables, covariates, predictors, or regressors.
Generative Classifier	A type of supervised classifier that builds a generative statistical model of the data. Examples include Bayesian networks and Naïve Bayes classifiers.
Instance-based Learning	A form of learning where the training instances themselves represent the knowledge. Predictions are made by searching for training instances that most closely resemble a new instance. It is also known as lazy learning.
K-Nearest Neighbor (KNN)	An instance-based algorithm for classification or regression. It predicts the output for a new data point based on the outputs of its k closest neighbors in the training data.
KD Tree	A data structure used to optimize the search for nearest neighbors (NN Search), reducing the computational time cost compared to a naïve search across all data points.
Lazy Learning	A learning method where the generalization of the training data is deferred until a query is made to the system, as opposed to eager learning, where a model is constructed explicitly. KNN is a form of lazy learning.
Local Smoothness	An underlying assumption in KNN that points that are close to each other in the feature space are likely to have similar target values.
Regression	A supervised learning task where the target or outcome variable is a continuous, numerical value.
Supervised Classifier	An algorithm that learns from a labeled dataset (where both features and outcomes are known) to classify new, unseen data.
Target	The special column in a tabular dataset that is to be predicted, also known as the outcome, response, label, or dependent variable.
Voronoi Diagram	A partitioning of a plane into regions based on the distance to points in a specific subset of the plane. It can be used to visualize the decision boundaries of a 1-nearest neighbor classifier.

Lasso and Elastic Net

Lecture: S1-lrRegularized
Version: current
Notes to Read: Elastic paper
Video: Extra M3

2Regression Optimization Regularization ModelSelection

Bias Variance Tradeoff

Lecture: S1-biasVariance
Version: current
More advanced to Read: S1-Tibshirani-modelbasics
Notes to Read: notebook validation and learning curves
Extra Notes to Read: Useful BiasVar + ESL Ch7
Video: M1 + M2

5Theory Local

Notebook Resources: notebook/L6-Hyperparameters-and-Model-Validation.ipynb
Notebook Resources: notebook/L9-LearningCurves.ipynb

Study Guide: The Bias-Variance Tradeoff

This guide is designed to review and reinforce key concepts related to the bias-variance tradeoff in machine learning. It includes a quiz with an answer key to test comprehension, a set of essay questions for deeper exploration, and a glossary of essential terms.

Part I: Short-Answer Quiz

Instructions: Answer the following ten questions in 2-3 sentences each, based on the provided source material.

What is the ultimate goal in machine learning, and why is minimizing training error not always sufficient to achieve it?
Explain the decomposition of Expected Prediction Error (EPE). What does each component represent?
Describe the characteristics of a model that is “underfitting.” How does this relate to bias and variance?
Describe the characteristics of a model that is “overfitting.” How does this relate to bias and variance?
Why are bias and variance not directly computable in a real-world machine learning problem?
Compare the properties of a linear classification model with a 1-Nearest Neighbor model in terms of bias and variance.
According to statistical decision theory, what is the best theoretical estimator for a regression problem under L2 (squared error) loss?
Identify two distinct methods for reducing high model variance (overfitting).
Identify two distinct methods for reducing high model bias (underfitting).
What is the difference between model selection and model assessment?

Part II: Quiz Answer Key

The ultimate goal is generalization, which is the model’s ability to perform well on new, unseen data. Minimizing training error is not always sufficient because a model can become too complex and fit irrelevant noise in the training data (e.g., 1-NN), leading to poor performance on test data.
Expected Prediction Error (EPE) decomposes into three parts: EPE(x) = noise² + bias² + variance. Noise (or Irreducible/Bayes error) is the unavoidable error inherent in the data. Bias is the error from incorrect assumptions made by the model. Variance is the error due to the model’s sensitivity to the randomness of the training samples.
An underfitting model is too “simple” to represent the relevant characteristics of the data. This condition is defined by high bias and low variance, resulting in both high training error and high test error.
An overfitting model is too “complex” and fits irrelevant characteristics or noise in the training data. This condition is defined by low bias and high variance, which leads to low training error but high test error.
Bias and variance cannot be computed directly because their calculation relies on knowing the true underlying distribution of the input vectors (x) and output variables (y), which is unknown in practice.
A linear classification model is a global model that is stable but can be inaccurate, exhibiting high bias and low variance. In contrast, a 1-Nearest Neighbor model is a local model that is accurate on training data but unstable, exhibiting low bias and high variance.

Under L2 loss, the best theoretical estimator to minimize EPE is the conditional mean (also called conditional expectation). This is expressed as f̂(x) = E(Y

X = x).

To reduce high variance, one can choose a simpler classifier, regularize the parameters, get more training data, or use a smaller set of features.
To reduce high bias, one can get additional features to provide more information or try a more complex learning model with more flexibility.
Model selection is the process of estimating the performance of different models to choose the best one. Model assessment is the subsequent process of taking the chosen model and estimating its prediction error on new, unseen data.

Part III: Essay Questions

Instructions: The following questions are designed to encourage a deeper, more synthesized understanding of the topic. Formulate comprehensive answers based on the source material.

Discuss the Bias-Variance Tradeoff in detail. Explain how model complexity influences bias, variance, training error, and expected test error. Use the examples of a highly regularized linear model and a 1-Nearest Neighbor classifier to illustrate the opposing ends of the complexity spectrum.
You are presented with a learning curve where the training error is unacceptably high and there is only a small gap between the training error and the test error. Diagnose the problem, explain the underlying issues in terms of bias and variance, and propose at least two concrete remedies to improve the model.
Explain the concept of Expected Prediction Error (EPE) and its decomposition into noise, bias, and variance. Describe why the noise component is considered “irreducible” and provide the theoretical justification for why the conditional mean is the optimal estimator for EPE under L2 loss.
Compare and contrast the Frequentist and Bayesian interpretations of probability as they relate to model parameters. How does the Bayesian approach handle parameter estimation differently from the Frequentist approach?
Describe the role of cross-validation (CV) in the machine learning pipeline, particularly when data is not abundant. Discuss the practical considerations for choosing the value of K in K-fold CV, including the bias-variance tradeoff inherent in this choice.

Part IV: Glossary of Key Terms

Term	Definition
Bias	The error component resulting from incorrect assumptions or simplifications made by a model. It measures the accuracy or quality of an estimator; low bias means the estimator will, on average, accurately estimate the true parameter.
Variance	The error component resulting from a model’s sensitivity to the specific training data sample. It measures the precision or specificity of an estimator; low variance means the estimator does not change much as the training set varies.
Bias-Variance Tradeoff	The principle that models with few parameters (low complexity) tend to have high bias and low variance, while models with many parameters (high complexity) tend to have low bias and high variance. The goal is to find a balance that minimizes total error.
Underfitting	A state where a model is too “simple” to represent the relevant characteristics of the data. It is characterized by high bias, low variance, high training error, and high test error.
Overfitting	A state where a model is too “complex” and fits irrelevant noise in the training data. It is characterized by low bias, high variance, low training error, and high test error.
Generalization	The ultimate goal of a machine learning model, referring to its performance on new, unseen data.
Expected Prediction Error (EPE)	The expected value of a loss function over the joint probability distribution of inputs and outputs. It decomposes into noise² + bias² + variance.
Irreducible Error (Bayes Error / Noise)	The component of EPE that is unavoidable, resulting from inherent randomness or noise in the data itself.
L2 Loss (Squared Error Loss)	A loss function used in regression that calculates the error as the square of the difference between the true output and the predicted output: (y - f(x))².
Conditional Mean	The optimal estimator for minimizing Expected Prediction Error under L2 loss, mathematically expressed as `E(Y \\| X = x)`.
Model Selection	The process of estimating the performance of different models (or a single model with different hyperparameters) to choose the best one.
Model Assessment	The process of taking a final, chosen model and estimating its prediction error on new data.
Cross-Validation (CV)	An efficient method for reusing samples for model selection and assessment, particularly when there is insufficient data for a simple train/validation/test split.
K-Nearest Neighbors (k-NN)	A local, non-parametric method where k acts as a smoother. A k=1 model has very low bias and high variance, while a model with a larger k (e.g., k=15) is smoother, with higher bias and lower variance.
Frequentist Probability	An interpretation of probability where probabilities are objective properties of the real world, referring to limiting relative frequencies. Model parameters are considered fixed, unknown constants.
Bayesian Probability	An interpretation of probability that describes degrees of belief. Model parameters are treated as hidden random variables, and one can compute their probability distribution given data, `P(θ \\| data)`.

Section 2 - Deep on 2D Grid Type (e.g. Imaging)

ProbReview + MLE

Lecture: S2-MLE
Version: current
More advanced to Read: S2-MLE
Notes to Read: Error Metrics
Extra Notes to Read: MLE / MLE code
Video: M1

1Basic

Study Guide: Maximum Likelihood Estimation

Quiz: Key Concepts Review

Answer the following questions in 2-3 sentences each based on the provided course material.

What is the fundamental goal of Machine Learning as described in the lecture?
Define “Sample Space” and “Event” in the context of probability theory, and provide an example for each.
What is the core principle of Maximum Likelihood Estimation (MLE)?
Why is it often more convenient to work with the log-likelihood function instead of the likelihood function itself?
For a Bernoulli distribution, what is the Maximum Likelihood Estimate for the parameter p (the probability of success)?
How does a probability density function (pdf) for a continuous random variable differ from a probability mass function (pmf) for a discrete random variable?
What do the mean and covariance matrix represent in a multivariate normal (Gaussian) distribution?
What key assumption connects Maximum Likelihood Estimation to the least squares error function in linear regression?
In the probabilistic interpretation of linear regression, what distribution is the error term (ε) assumed to follow?
Define the covariance between two random variables, X and Y.

Essay Questions

The following questions are designed for longer-form, essay-style answers to test a deeper synthesis of the material. Answers are not provided.

Explain the complete process of deriving the Maximum Likelihood Estimate for the parameter p of a Bernoulli distribution. Detail the role of the likelihood and log-likelihood functions, the use of differentiation to find the maximum, and interpret the final result.
Describe the relationship between basic probability concepts (sample space, events, random variables) and the task of machine learning. How do these foundational ideas enable the formulation of a model and the application of an estimation technique like MLE?
Compare and contrast discrete and continuous random variables. Discuss their respective probability functions (pmf vs. pdf), how their means (expectations) are calculated, and provide an example of each as mentioned in the lecture notes.
Discuss the probabilistic interpretation of linear regression. Explain how assuming a Gaussian distribution for the error term allows the problem to be framed as a Maximum Likelihood Estimation task, and show why maximizing the log-likelihood is equivalent to minimizing the sum of squared errors.
What are the key properties of a multivariate Gaussian (Normal) distribution? Describe the roles of the mean vector and the covariance matrix, and explain how they influence the shape and orientation of the probability density function, using the bivariate normal distribution as an illustrative example.

Answer Key

Quiz Answers

The goal of Machine Learning is to optimize a performance criterion using example data or past experience. The primary aim is to develop models that can generalize their performance to new, unseen data.
A Sample Space (O) is the set of all possible outcomes of an experiment, such as {HH, HT, TH, TT} for tossing a coin twice. An Event is a subset of the sample space, such as the event of the first toss being a head, which would be the set {HH, HT}.
The principle of Maximum Likelihood Estimation is to select the set of model parameters (θ) that are most likely to have produced the observed data. This is achieved by assuming a particular model and finding the parameter values that maximize the joint probability (the likelihood) of the observed sample set.

Working with the log-likelihood function is convenient because it converts the joint probability, which is a product of individual probabilities, into a sum. This transformation from Π P(Zi

θ) to Σ log(P(Zi

θ)) simplifies the mathematical process of finding the maximum, particularly when taking derivatives.

The MLE for the parameter p in a Bernoulli distribution is the sample proportion of positive outcomes. It is calculated as p̂ = x/n, where x is the number of successes (e.g., heads) observed in n trials.
A probability mass function (pmf) gives the probability that a discrete random variable is equal to a specific value, P(X = xi). A probability density function (pdf) describes the probability density for a continuous variable, and the actual probability is obtained by taking the integral of the pdf over a given range.
In a multivariate normal distribution, the mean vector (μ) is the point where the PDF reaches its peak value. The covariance matrix (Σ) captures the linear dependencies among the variables and determines the shape and orientation of the elliptical contours of equal probability density.
The key assumption is that the error term, or residual, in the linear regression model is independent and identically distributed (IID) according to a Gaussian (Normal) distribution with a mean of zero. This assumption makes maximizing the data likelihood equivalent to minimizing the residual sum of squares.
In the probabilistic view of linear regression, the error term ε is assumed to follow a Gaussian distribution with a mean of 0 and some variance σ², denoted as N(0, σ).
Covariance measures the joint variability of two random variables. It is defined as the expectation of the product of their deviations from their individual means: Cov(X,Y) = E((X − µx)(Y − µy)).

Glossary of Key Terms

Term	Definition
Sample Space (O)	The set of all possible outcomes of an experiment.
Event	A subset of the sample space O.
Random Variable (RV)	A function that maps outcomes from the sample space to an attribute space, providing a concise way of specifying attributes of outcomes.
Discrete Random Variable	A random variable that may take on only a countable number of distinct values.
Continuous Random Variable	A random variable described by a probability density function (pdf) rather than a probability mass function (pmf).
Probability Density Function (pdf)	A function f(x) that describes the probability density for a continuous random variable in terms of an input variable x. It must be non-negative, and its integral over all possible values is 1.
Probability Mass Function (pmf)	A function that gives the probability that a discrete random variable is exactly equal to some value.
Maximum Likelihood Estimation (MLE)	An estimation technique where one chooses a set of parameters that are most likely to have produced the observed results. It assumes a model with unknown parameters and maximizes the likelihood of the observed data with respect to those parameters.
Likelihood Function	The joint probability of observing a set of data, expressed as a function of the model parameters θ. For independent samples, it is the product of the individual probabilities: `P(Z₁...Zₙ \\| θ) = Π P(Zᵢ \\| θ)`.
Log-Likelihood Function	The natural logarithm of the likelihood function. It converts the product of probabilities into a sum: `log(L(θ)) = Σ log(P(Zᵢ \\| θ))`.
Bernoulli Distribution	A probability distribution for a binary random variable that takes a value of 1 (success) with probability p and a value of 0 (failure) with probability 1-p.
Binomial Distribution	A discrete probability distribution describing the number of successes in a sequence of k independent Bernoulli trials, each with a success probability of p.
Gaussian (Normal) Distribution	A widely used continuous probability distribution for a real-valued random variable, characterized by its mean (μ) and variance (σ²).
Mean (Expectation)	A measure of the central tendency of a random variable. For a discrete RV, it’s Σ vᵢ * P(X=vᵢ); for a continuous RV, it’s ∫ x * f(x)dx.
Variance	A measure of the spread of a random variable around its mean. It is defined as Var(X) = E((X − µ)²).
Covariance	A measure of the joint variability of two random variables, X and Y. It is defined as Cov(X,Y) = E((X − µx)(Y − µy)).
Correlation	A normalized measure of the linear relationship between two random variables, calculated as ρ(X,Y) = Cov(X,Y) / (σx * σy). Its value ranges from -1 to 1.

Logistic and NN

Lecture: S2-LogisticRegression
Version: current
More advanced to Read: S2-LogisticRegression
Notes to Read: compare classifiers
Video: M1 + M2

3Classification Nonlinear Deep Linear Discriminative

Study Guide: Logistic Regression and Bayes Classifiers

Short-Answer Quiz

Instructions: Answer the following questions in 2-3 sentences each, based on the provided source context.

What is the primary goal of a Bayes classifier when presented with a new data sample?
Explain the fundamental difference between discriminative and generative classifiers.
What is the core principle of Maximum Likelihood Estimation (MLE) for determining model parameters?
Describe the concept of “log-odds” or the “logit” function in the context of logistic regression.
How does the Bernoulli distribution serve as a conceptual model for the target variable in logistic regression?
In logistic regression, what is the decision boundary and what is its geometric shape?
Describe the two main stages of logistic regression when viewed as a single “neuron” or block.
What is the Expected Prediction Error (EPE) and how does a 0-1 loss function lead to the Bayes classifier rule?
Why is it often more convenient to work with the log-likelihood function rather than the likelihood function itself during MLE?
Briefly compare Newton’s method to Gradient Descent for optimization.

Answer Key

The primary goal of a Bayes classifier is to predict the class of a new data sample x. It achieves this by finding the class C that maximizes the posterior probability p(C

x), a principle known as the Maximum A Posteriori (MAP) rule.

Discriminative classifiers, such as logistic regression and neural networks, directly estimate a decision rule or boundary to separate classes. In contrast, generative classifiers, like Naïve Bayes, build a statistical model of how the data for each class is generated and use that model to make predictions.
Maximum Likelihood Estimation (MLE) is a method for estimating the unknown parameters of a model. The core principle is to choose the set of parameters that are most likely to have produced the observed data set. This is achieved by maximizing the joint probability, or likelihood, of the observed samples.

**The “log-odds,” or “logit” function, is the natural logarithm of the odds P(y

x) / (1-P(y

x)).** In logistic regression, this logit is modeled as a linear function of the input features (x), forming the core representation of the model: ln[p/(1-p)] = β₀ + β₁x₁ + … + βₚxₚ.

Logistic regression is suitable for target variables coded as 0 or 1. The Bernoulli distribution, which models binary outcomes like a coin flip, is used to conceptually model this target variable. The probability parameter p of the Bernoulli distribution is not fixed but is instead modeled as a function of the input features, p = p(y=1

x).

The decision boundary in logistic regression is the point where the log-odds equation equals zero, which corresponds to a probability of 0.5. Because the log-odds are modeled as a linear function of the features, the resulting decision boundary is also linear.
When viewed as a single block or “neuron,” logistic regression consists of two stages. The first is a summing function that calculates a weighted sum of the inputs (z = wᵀx + b). The second stage is the application of a sigmoid function (y = sigmoid(z)), which squashes the output of the summing function to a value between 0 and 1, representing a probability.
Expected Prediction Error (EPE) is the expected value of a loss function over the joint distribution of inputs and outputs. When using a 0-1 loss function, which penalizes any misclassification, minimizing the EPE is equivalent to choosing the class with the maximum posterior probability, which is precisely the MAP rule used by the Bayes classifier.
Working with the log-likelihood function is convenient because the joint probability of independent samples is a product of individual probabilities. Taking the logarithm transforms this product into a sum (log(Π P) = Σ log(P)), which is mathematically simpler to differentiate and maximize.
Both are optimization algorithms. Gradient Descent (GD) uses a first-order approximation to update parameters. In contrast, Newton’s method uses a second-order (quadratic) Taylor series approximation, incorporating curvature information via the Hessian matrix, which often allows it to find a more direct route to the minimum.

Essay Questions

Instructions: Formulate detailed responses to the following questions, synthesizing information from across the source material.

Synthesize the five different “views” of logistic regression presented: (I) logit as a linear function, (II) Y as a Bernoulli variable, (III) the “S” shape sigmoid function, (IV) a linear classification boundary, and (V) a two-stage neuron model. How do these perspectives complement each other to provide a comprehensive understanding of the model?
Discuss the role of Maximum Likelihood Estimation (MLE) in training a logistic regression model. Trace the process from defining the likelihood for a basic Bernoulli trial to deriving the conditional log-likelihood function for the entire logistic regression training set.
Explain the concept of the MAP (Maximum A Posteriori) rule within the framework of Bayes classifiers. How is this rule justified by the principles of Statistical Decision Theory and the minimization of Expected Prediction Error (EPE) with a 0-1 loss function?
Compare and contrast the three major groups of classifiers: discriminative, generative, and instance-based. Provide examples of each and explain where logistic regression fits within this taxonomy and why.
Describe the optimization challenge in fitting a logistic regression model. Compare the second-order Newton’s method with the first-order Stochastic Gradient Descent (SGD) method, discussing their respective approaches, requirements (e.g., first and second derivatives), and characteristics.

Glossary of Key Terms

Term	Definition
Bayes Classifier	A probabilistic classifier that treats each feature and the class label as random variables. It predicts the class of a sample x by finding the class C that maximizes the posterior probability `p(C \\| x)`.
Bernoulli Distribution	A probability distribution for a binary random variable, such as a coin flip that can be “Head” (with probability p) or “Tail.” In logistic regression, the target variable Y is modeled as a Bernoulli random variable where p is a function of the input features x.
Decision Boundary	In logistic regression, this is the boundary where the log-odds equation equals zero. Because the log-odds are a linear function of the input features, the decision boundary is linear.
Discriminative Classifier	A type of classifier that directly estimates a decision rule or boundary to separate classes. Logistic regression, support vector machines, and neural networks are examples.
Expected Prediction Error (EPE)	A measure of a model’s performance, defined as the expected value of a loss function L(Y, f(X)) over the joint distribution Pr(X,Y). Minimizing EPE is the central goal of statistical decision theory.
Generative Classifier	A type of classifier that builds a generative statistical model of the data for each class. It models `P(X \\| Y)` and `P(Y)` to make predictions.
Hessian Matrix	For a multivariate function, the Hessian is the matrix of second-order partial derivatives. It is the multivariate equivalent of the second derivative and is used in second-order optimization methods like Newton’s method.
Instance-Based Classifier	A type of classifier that uses observations directly for classification without building an explicit model. K-Nearest Neighbors is a key example.
Log-Likelihood	The natural logarithm of the likelihood function. It is often used in MLE because it converts products into sums, which are mathematically easier to differentiate and optimize.
Logistic Regression	A discriminative probabilistic classifier for binary classification tasks. It models the log-odds (logit) of the class probability as a linear function of the input features.
Logit (Log-odds)	The core function in logistic regression, defined as the natural logarithm of the odds: `ln[P(y=1\\|x) / P(y=0\\|x)]`.
Maximum A Posteriori (MAP) Rule	The decision rule used by Bayes classifiers. It assigns a new instance X to the class c* that has the maximum posterior probability `P(c* \\| X)`.
Maximum Likelihood Estimation (MLE)	A method for estimating the parameters of a statistical model by finding the parameter values that maximize the likelihood function of the observed data. It chooses parameters that are “most likely” to have produced the observed results.
Newton’s Method	A second-order optimization algorithm that finds the minimum of a function by making a quadratic Taylor series approximation at each step. It requires both first (gradient) and second (Hessian) derivatives and often converges faster than first-order methods.
Sigmoid Function	An “S”-shaped function used in logistic regression to compress a real-valued number (the output of the linear model) into the range [0, 1], allowing it to be interpreted as a probability. The formula is 1 / (1 + e⁻ᶻ).
Stochastic Gradient Descent (SGD)	A first-order iterative optimization algorithm used for finding the minimum of a function. It is a variant of gradient descent that updates model parameters based on the gradient computed from a single sample or a small batch of samples.
0-1 Loss Function	A loss function used for classification where the loss is 0 for a correct classification and 1 for an incorrect classification. Minimizing the EPE with a 0-1 loss function leads to the MAP classification rule.

NN and Deep Learning

Lecture: S2-deepNNBasics
Version: current
Notes to Read: DNN Cheatsheets
Video: M1+ M2+ M3

3Classification Nonlinear Deep Discriminative

Neural Networks and Backpropagation: A Study Guide

Quiz: Short-Answer Questions

Answer each question in 2-3 sentences based on the provided source material.

How can a single neuron be conceptualized as an extension of logistic regression?
What was the historical Perceptron, and what type of activation function did it use?
Describe the structure of a Multi-Layer Perceptron (MLP). What makes a neural network “deep”?
What is the purpose of the Softmax function, particularly in the context of multi-class classification?
Identify two different types of loss functions mentioned in the text and the tasks they are suited for.
What is the core purpose of the backpropagation algorithm in training a neural network?
Briefly describe the two main phases of the training loop for a neural network using backpropagation.
What problem can occur with gradient magnitudes during training, and why is divergence considered the worse outcome?
What is “Dropout” and how does it function as a regularization technique?
Explain the concept of Batch Normalization and its benefits for the training process.

Answer Key

A single neuron can be seen as an expanded logistic regression unit. It performs a two-stage process: first, a summing function calculates a weighted sum of the inputs plus a bias (z = wᵀ · x + b), and second, a non-linear activation function (like sigmoid) “squashes” the result into a desired range.
The Perceptron, first proposed by Rosenblatt in 1958, was a simple one-neuron unit used to classify input into one of two categories. It used a step function as its activation function, which outputs +1 if the input is greater than or equal to zero and -1 otherwise.
A Multi-Layer Perceptron (MLP), or a feed-forward neural network, is composed of an input layer, one or more hidden layers, and an output layer. A neural network is considered “deep” when it contains many hidden layers.
The Softmax function is a normalizing function used as the final layer for multi-class classification. It converts the output of each class unit into a probability, ensuring that the sum of all output probabilities is equal to 1.
The text describes Sum of Squared Errors (SSE) loss, which is suitable for regression tasks, calculating (y₁-ŷ₁)²+(y₂-ŷ₂)². It also details the Cross-Entropy loss function (also called negative log-likelihood), which is used for binary and multi-class classification tasks.
The backpropagation algorithm is used to learn the optimal weights for a neural network by jointly optimizing all parameters. It accomplishes this by calculating the gradient of the loss function with respect to each weight in the network, even those in lower layers.
The first phase is the “Forward” pass, where inputs are fed through the network layer by layer to compute the final output and the loss. The second phase is the “Backward” pass, where the algorithm propagates local gradients from the final loss function back through the network to calculate each layer’s gradient for weight updates.
During training, gradients can become too big, leading to divergence, or too small, leading to slow convergence. Divergence is considered much worse because the training process fails to find a solution, whereas slow convergence simply takes more time.
Dropout is a regularization technique where some neurons are randomly set to zero during the forward pass of training. This process is akin to training a large ensemble of models that share parameters, which helps prevent overfitting.
Batch Normalization is a technique that standardizes the activations of a prior layer, which stabilizes and speeds up training. It improves gradient flow, allows for higher learning rates, reduces dependence on initialization, and acts as a form of regularization.

Essay Questions

Consider the following questions to synthesize the concepts from the course material. Formulate a detailed, evidence-based response for each.

Explain the journey from Logistic Regression as a linear classifier to a Multi-Layer Perceptron capable of modeling non-linear decision boundaries. How do hidden layers and non-linear activation functions enable this capability?
Describe the complete process of training a neural network for multi-class classification using Mini-batch Stochastic Gradient Descent (SGD) and backpropagation. Detail the roles of the forward pass, the Softmax layer, the cross-entropy loss function, the backward pass, and the weight update step.
Compare and contrast the four primary activation functions presented in the material: sigmoid, tanh, softplus, and rectify (ReLU). Discuss their mathematical forms, their derivatives, and the potential implications of these properties on the training process.
Discuss the importance of proper network initialization and regularization in deep learning. Using the concepts of Xavier initialization, Batch Normalization, and Dropout, explain how practitioners can avoid common training pitfalls like poor gradient flow and overfitting.
From a “block view,” a neural network can be seen as a series of differentiable, parameterized functions. Explain how the chain rule of calculus is the fundamental mathematical principle behind backpropagation, allowing gradients to be passed “backward” through these blocks to update parameters in the earliest layers of the network.

Glossary of Key Terms

Term	Definition
Activation Function	A function applied to the output of a neuron (the weighted sum of inputs) to introduce non-linearity into the network. Also known as a transfer function. Examples include Sigmoid, tanh, and ReLU.
Backpropagation	An algorithm for training neural networks by using the chain rule to efficiently compute the gradients of the loss function with respect to all the weights and biases in the network. It involves a forward pass to compute outputs and loss, and a backward pass to propagate gradients.
Batch Normalization	A technique to normalize the activations of a prior layer, which improves gradient flow, allows higher learning rates, and acts as a form of regularization.
Bias Term (b)	A constant value added to the product of inputs and weights in a neuron. It allows the activation function to be shifted to the left or right, which is critical for learning.
Cross-Entropy Loss	A loss function used for classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1. Also known as negative log-likelihood or deviance.
Deep Neural Network (DNN)	A neural network with many hidden layers between the input and output layers.
Dropout	A regularization technique where randomly selected neurons are ignored (“dropped out”) during training. This prevents units from co-adapting too much and helps prevent overfitting.
Feed-Forward Neural Network	A type of neural network where connections between the nodes do not form a cycle. Information moves in only one direction: from the input nodes, through the hidden nodes, and to the output nodes. The Multi-Layer Perceptron (MLP) is a feed-forward network.
Loss Function	A function that computes a single number representing the “cost” or “error” associated with the network’s prediction (ŷ) versus the true label (y). The goal of training is to minimize this value.
Multi-Layer Perceptron (MLP)	A class of feed-forward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer.
Neuron	The fundamental processing unit of a neural network. It takes multiple inputs, computes a weighted sum, adds a bias, and then passes the result through an activation function.
Perceptron	A single-neuron unit proposed by Rosenblatt (1958) that uses a step function for activation. It is a linear classifier.
Rectified Linear Unit (ReLU)	An activation function defined as h(x) = max(0, x). It is computationally efficient and has become a default activation function for many types of neural networks. Variations include Leaky ReLU.
Sigmoid Function	An activation function that squashes its input value into a range between 0 and 1. Its formula is h(x) = 1 / (1 + exp(-x)).
Softmax Function	A function that converts a vector of K real numbers into a probability distribution of K possible outcomes. It is often used as the last activation function of a neural network to normalize the output for multi-class classification.
Stochastic Gradient Descent (SGD)	An iterative optimization algorithm used to find the minimum of a loss function. In each iteration, it estimates the gradient based on a single training example or a small “mini-batch” of examples to update the network’s weights.
Sum of Squared Errors (SSE)	A loss function used for regression tasks that measures the sum of the squared differences between the predicted values and the actual values.
tanh Function	An activation function that squashes its input value into a range between -1 and 1. Its formula is h(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x)).
Weights (w)	Parameters within a neural network that transform input data within the network’s layers. The network learns by modifying its weights during training.
Xavier Initialization	A method for initializing the weights in a neural network to maintain a reasonable variance of activations and gradients across layers, which is important for stable training.

CNN

Lecture: S2-deepImageCNN
Version: current
More advanced to Read: S2-PCA
Notes to Read: Keras notebook + another Keras Kaggle notebook
Video: M1+ M2

3Classification Nonlinear Deep Discriminative

Notebooks to run and experienc:

Study Guide: Supervised Image Classification and Convolutional Neural Networks

Quiz

Answer the following questions in 2-3 sentences each, based on the provided lecture context.

What were the primary challenges that led to the “Winter of Neural Networks” around the 2000s?
Explain the three key properties of images that Convolutional Neural Networks are specifically designed to leverage.
What is the core function of a “filter” in a convolutional layer, and what property of images does it relate to?
Describe the concept of “translation invariance” and explain how a CNN achieves it.
What is the purpose of the Max Pooling operation in a CNN?
How does a convolutional layer differ from a fully-connected layer in terms of parameter usage?
What are the three main components of a dataset used for model selection and assessment in a “data-rich scenario”?
According to the lecture, who invented the CNN and in what year was the foundational paper published?
What happens to the output of the convolutional and max pooling layers before it is fed into a fully connected feedforward network?
In the context of a tabular dataset for classification, what are the three main types of columns identified in the lecture?

Answer Key

The “Winter of Neural Networks” was caused by several challenges. The optimization problem was non-convex, requiring many tricks for tuning hyperparameters like the number of layers and hidden units. Additionally, these networks were hard to analyze theoretically, and large labeled datasets were rare during that period.
CNNs leverage three key properties of images. The first is Locality, where patterns are smaller than the whole image. The second is Translation Invariance, meaning the same patterns can appear in different regions. The third is that Subsampling the pixels will not change the object, allowing the image to be made smaller.
A filter in a convolutional layer is a matrix of parameters (weights) designed to detect a small, specific pattern in the input image, such as an edge or a curve. This directly relates to the property of Locality, as the filter does not need to see the whole image to discover its target pattern.
Translation invariance is the principle that an object’s appearance is independent of its location within an image. A CNN achieves this through weight sharing, where the same set of filter parameters is applied across all regions of the image to detect a specific pattern regardless of where it appears.
The purpose of the Max Pooling operation is to subsample the feature map created by the convolutional layer. It reduces the spatial dimensions of the data, which makes the image smaller and results in fewer parameters for the network to process in subsequent layers.
A convolutional layer uses significantly fewer parameters than a fully-connected layer. This is because its neurons are only connected to a small local region of the input (locality) and it reuses the same parameters across the entire image (weight sharing), whereas every neuron in a fully-connected layer connects to every input from the previous layer.
In a data-rich scenario, the dataset is split into three parts for model selection and assessment. These are the Training set, the Validation set, and the Test set.
The Convolutional Neural Network (CNN) was invented by Professor Yann LeCun. The foundational paper, “Gradient-based learning applied to document recognition,” was published in 1998.
After the final max pooling layer, the resulting multi-channel, smaller image (a 3-D tensor) is processed by a Flatten operation. This converts the 3-D data into a one-dimensional vector, which can then be used as input for the subsequent fully connected feedforward network.
For a tabular dataset, the columns are identified as Features (also called attributes, predictors, or independent variables), which are all columns except the last. The last column is the Target (also called outcome, label, or dependent variable), which is the special column to be predicted. The rows are referred to as Data points or instances.

Essay Questions

Develop a detailed, essay-format response for each of the following prompts. No answers are provided for this section.

Trace the historical evolution of artificial intelligence and machine learning from Alan Turing’s 1950 paper to the landmark 2012 paper on ImageNet classification. Discuss the key milestones, influential figures, dominant technologies of each era (e.g., expert systems, SVMs), and the factors that contributed to the “Winter of Neural Networks.”
Provide a comprehensive explanation of the complete architecture of a standard CNN for image classification, as described in the lecture. Detail the journey of an input image through the network, explaining the purpose and mechanics of the Convolution layer, Max Pooling layer, Flatten operation, and the final Fully Connected network with softmax.
Elaborate on the three fundamental properties of images that make CNNs more effective than traditional Multilayer Perceptrons (MLPs) for image-related tasks. For each property (Locality, Translation Invariance, and Subsampling), explain what it is and which specific CNN mechanism (e.g., filters, weight sharing, pooling) is designed to exploit it.
Compare and contrast the processes of “Model Selection” and “Model Assessment.” Describe the different strategies available for these processes, such as the train/validation/test split for data-rich scenarios and methods like Cross-Validation for when data is insufficient.
Using the provided example of a 6x6 image and a 3x3 filter, explain the mathematical process of convolution. Describe how applying the filter across the image generates a “feature map” and how parameters like stride can alter the dimensions of this output.

Glossary of Key Terms

Term	Definition
Convolution	An operation where a filter (a small matrix of parameters) is applied across an input image to produce a feature map. This process leverages the properties of locality and translation invariance.
Convolutional Neural Networks (CNN)	A type of neural network invented by Yann LeCun, first successfully trained with many layers in 1998. It is specifically designed for data with grid-like topologies, such as 2D images, by leveraging properties of locality, translation invariance, and subsampling.
Cross-Validation (k-CV)	A method for model selection and assessment used when there is insufficient data to split into three parts. It involves efficiently reusing samples to choose hyperparameters.
Feature Map	The output of applying a filter across an image in a convolutional layer. Each feature map corresponds to a specific filter and represents the detection of that filter’s pattern across the input.
Features	In a tabular dataset, these are the columns used as predictors or independent variables to predict the target. They are also referred to as attributes, dimensions, covariates, or regressors.
Filter	A small matrix of learnable parameters (weights) used in a convolution layer to detect specific patterns (e.g., edges, textures) in an input image.
Flatten	An operation in a CNN that converts the multi-dimensional output of the convolutional/pooling layers into a single one-dimensional vector. This vector is then fed into a fully connected network.
Fully Connected Feedforward Network	The final part of a typical CNN architecture. It takes the flattened vector as input and performs classification, often ending with a softmax layer to output probabilities for each class.
Locality	A property of images where patterns of interest (like a bird’s beak) are much smaller than the whole image. CNNs exploit this by using small filters, meaning a neuron does not have to see the entire image to discover a pattern.
Max Pooling	A subsampling technique where a feature map is downsized by taking the maximum value over a defined window. This reduces the number of parameters and computational complexity, making the representation more robust to small shifts.
Model Assessment	The process of, having chosen a final model, estimating its prediction error on new, unseen data.
Model Selection	The process of estimating the performance of different models (or models with different hyperparameters) in order to choose the best one. This is also referred to as hyperparameter tuning.
Subsampling	The property that reducing the resolution of an image (e.g., by taking every other pixel) will not fundamentally change the object depicted. This is the principle behind pooling layers in a CNN.
Target	In a tabular dataset, this is the special column to be predicted. It is also referred to as the outcome, response, label, or dependent variable.
Translation Invariance	A property of images where the appearance of an object is independent of its location. CNNs model this by sharing weights, meaning the same filter is used to detect a pattern regardless of where it appears in the image.
Weight Sharing	The practice in a CNN where the same set of filter parameters is applied across all spatial locations in an input image. This drastically reduces the total number of parameters and enforces translation invariance.

Huggingface

Lecture: S3-hf-intro-2025-1016
Version: current
Notes to Read: S2-L0-pytorch
Video: M1

platform

Quick recap

The meeting covered an overview of Hugging Face’s model components and task interface, including demonstrations of how to explore models, datasets, and demo applications through the platform. Guangzhi provided detailed instructions on working with datasets in pipelines and using Hugging Face models through the Transformers library, including guidance on accessing model cards and recommended usage. The session concluded with a demonstration of a web-based demo application that enables real-time interaction with AI models through a graphical user interface, showcasing capabilities like text generation and image synthesis without requiring local resources.

Summary

Understanding Hugging Face Task Interface

Guangzhi explained the components of models, including architecture and learned parameter weights, and emphasized the importance of tasks, which vary based on input types like text or images. He demonstrated how Hugging Face’s task interface, accessible via AgentPase.co/task, categorizes tasks into natural language processing, computer vision, audio, and multimodal categories, with examples like any-to-any tasks. Guangzhi highlighted the overlap between datasets and other components in Hackerface and showed how users can explore models, datasets, and demo apps relevant to specific tasks.

Dataset Loading and Exploration

Guangzhi explained how to load and work with datasets in a pipeline, focusing on an any-to-any task dataset. He demonstrated how to use the dataset viewer to understand the data structure, including subsets and splits, and showed examples of loading datasets using the datasets package. Guangzhi also covered how to read instances from a loaded dataset by indexing and slicing, and he provided examples of accessing specific columns and labels.

Hugging Face Model Usage Tutorial

Guangzhi demonstrated how to use Hugging Face models, focusing on downloading and using models through the Transformers library. He explained how to access model cards, which provide basic information about models, and showed how to use the “Use This Model” button to download necessary files. Guangzhi also discussed using pipelines for higher-level tasks and emphasized the importance of checking model cards for recommended usage. Dr. asked about different ways to use models beyond the Hugging Face library, and Guangzhi confirmed that the Transformers library is a good starting point, with more advanced options available for experienced users.

AI Model Playground Demo

Guangzhi demonstrated a web-based demo application that allows users to interact with AI models directly through a graphical user interface without downloading the models. Dr. explained that the application consists of a frontend interface and a backend Hugging Face server, enabling real-time model interactions similar to ChatGPT. They discussed how the application serves as a useful playground for exploring model capabilities, with Guangzhi showing examples of text generation and image synthesis tasks that can be performed without requiring local resources.

pyTorch + Keras

Lecture: S2-L0-Keras
Version: current
Notes to Read: S2-L0-pytorch

platform

PCA, Feature Selection

Lecture: S2-dimReduce
Version: current
More advanced to Read: S2-PCA
Notes to Read: Great PCA Video + PCA Notebook
Extra Notes to Read: UMAP
Video: M1

DimenReduct

Notebook to try:

PCA Notebook

Study Guide: Dimensionality Reduction and PCA

running notebook

PCA Notebook

Short-Answer Quiz

Instructions: Answer the following questions in 2-3 sentences based on the provided source material.

What is the “Curse of Dimensionality,” and what is its practical effect on machine learning models?
Define the two primary methods of dimensionality reduction: feature extraction and feature selection.
Describe the three main approaches to feature selection discussed in the text.
What is the main objective of Principal Component Analysis (PCA), and what property of the data does it seek to maximize in its projections?
What are the two core components of an autoencoder, and what is its primary function in the context of feature learning?
When using PCA to project data onto a single line, what is the formal algebraic goal?
How are eigenvalues related to their corresponding principal components (eigenvectors) in terms of explaining the data’s structure?
List two criteria used to determine how many principal components should be retained after performing PCA.
Explain a scenario where the direction of maximum variance found by PCA might not be optimal for a classification task.
Briefly outline the process of using the “Eigenfaces” technique for face recognition.

Answer Key

The “Curse of Dimensionality” refers to the issue where the number of training examples required for a model increases exponentially with the number of features (dimensionality). In practice, including more features does not always improve accuracy and can lead to worse performance.
Feature extraction finds a new set of features by applying a mapping function to the existing features. Feature selection, in contrast, chooses a subset of the original, existing features without creating new ones.
The three approaches are Filtering, Wrapper, and Embedded. The Filtering approach ranks features independently of the predictor; the Wrapper approach uses a predictor to assess feature subsets; and the Embedded approach uses a predictor that internally selects a subset of features as it builds a single model.
The objective of PCA is to approximate a high-dimensional dataset with a lower-dimensional linear subspace. It achieves this by finding new axes (principal components) in the directions of the greatest variability, thereby maximizing the variance of the projected data.
An autoencoder consists of an encoder, which compresses the input into a lower-dimension latent space, and a decoder, which reconstructs the input from that latent space. Its function is to train the hidden layer units to become good and reliable feature detectors by minimizing the reconstruction loss between the original input and the reconstructed output.
The formal algebraic goal is to find a line (direction vector v) that maximizes the sum of squares of the data samples’ projections onto that line. This is equivalent to finding the eigenvector v associated with the largest eigenvalue λ of the matrix XᵀX.
The principal components (eigenvectors) are the new, uncorrelated features derived by PCA. The variance of the data along each principal component vₖ is equal to its corresponding eigenvalue λₖ, meaning that PCs with small eigenvalues correspond to directions of small variance in the data.
One method is to keep enough PCs to explain a cumulative variance greater than a certain threshold, such as 50-70%. A second method is to use a “Scree plot,” which visualizes the variance explained by each PC, and keep only the components with eigenvalues above a certain value (e.g., >1).
The direction of maximum variance is not always good for classification. For example, with two distinct classes represented as ellipsoidal Gaussian densities, the first principal component may capture the overall data trend, while the second, less significant component may be the one that best discriminates between the two classes.
The Eigenfaces technique first computes the principal components (“eigenfaces”) from a set of training face images. All training images are then projected into this new, lower-dimensional “face space.” A novel image is classified by projecting it into the same space and finding the nearest neighboring training face based on the distance between their low-dimensional coefficients.

Essay Questions

Instructions: The following questions are designed for a more in-depth, essay-style response. Answers are not provided.

Compare and contrast the three primary feature selection strategies: Filtering, Wrapper, and Embedded. Discuss the relative advantages and disadvantages of each in terms of computational cost, model dependency, and optimization for a specific learning algorithm.
Elaborate on the “Curse of Dimensionality” using the specific examples cited in the text (QSAR drug screening, Leukemia Diagnosis, and Text Categorization). Explain how high-dimensional feature spaces in these domains necessitate dimensionality reduction and what benefits are gained from applying these techniques.
Provide a detailed explanation of the mathematical and algebraic interpretation of Principal Component Analysis. Describe the roles of the centered data matrix, the covariance matrix, eigenvalues, and eigenvectors in the process of identifying the principal components that capture maximum variance.
Using the “Eigenfaces” method as a case study, describe the end-to-end application of PCA for image recognition. Explain how a high-dimensional image is transformed into a low-dimensional representation, how this representation is used for classification, and what information is potentially lost or preserved in this process.
Discuss the limitations of PCA. Explain why its objective of maximizing variance may not align with the objective of maximizing class separability in a supervised learning context, and illustrate with the examples provided in the source material.

Glossary of Key Terms

Term	Definition
Autoencoder	A neural network structure, trained to reproduce its input, consisting of an encoder and a decoder. It forces the ‘hidden layer’ units to become reliable feature detectors by minimizing reconstruction loss.
Covariance Matrix	A matrix representing the covariance between pairs of variables in a dataset. In PCA, the eigenvectors of this matrix (or of XᵀX for centered data) are the principal components.
Curse of Dimensionality	The phenomenon where the number of training examples required for a model increases exponentially with the number of features (dimensionality). In practice, this can lead to worse performance as more features are added.
Decoder	A component of an autoencoder that reconstructs the input from the compressed latent space representation generated by the encoder.
Dimensionality Reduction	The process of choosing an optimum set of features of lower dimensionality to create simpler, more interpretable, and better-generalizing models.
Eigenfaces	A term for the principal components (eigenvectors) computed from a covariance matrix of face images. This technique constructs a low-dimensional linear subspace to explain variation in a set of face images for recognition tasks.
Eigenvalue (λ)	A scalar associated with an eigenvector. In PCA, the eigenvalue represents the amount of variance in the data explained by its corresponding eigenvector (principal component).
Eigenvector (v)	A vector whose direction remains unchanged when a linear transformation is applied to it. In PCA, the eigenvectors of the covariance matrix are the principal components, representing the directions of maximum variance.
Embedded Approach	A feature selection method where a predictor is used to build a single model with a subset of features that are internally selected during the training process.
Encoder	A component of an autoencoder that compresses an input into a latent-space representation of a usually smaller dimension.
Feature Extraction	A method of dimensionality reduction that finds a set of new features by applying a mapping function (which can be linear or non-linear) to the existing features.
Feature Selection	A method of dimensionality reduction that chooses a subset of the original features.
Filtering Approach	A feature selection method that ranks features or feature subsets as a pre-processing step, independently of the learning algorithm (predictor) being used.
Principal Component Analysis (PCA)	An unsupervised linear feature extraction method that approximates a high-dimensional dataset with a lower-dimensional linear subspace. It seeks a projection that preserves as much of the information (variance) in the data as possible.
Principal Components (PCs)	The new, more informative, uncorrelated features found by PCA, which correspond to the eigenvectors of the data’s covariance matrix. The first PC explains the most variance, the second PC explains the next most, and so on.
Wrapper Approach	A feature selection method that uses a predictor (treated as a “black box”) to assess and score different subsets of features based on their predictive power.

Feature Selection

Lecture: S3-feaSelc
Version: current
Notes to Read: S3-QuizReview + ELS Ch3.4 and Ch3.3 + API
Video: (Extra M2 + M3 )

DimenReduct ModelSelection

Study Guide: Feature Selection in Machine Learning

The Purpose and Benefits of Feature Selection

In machine learning, datasets can contain thousands or even millions of low-level features. Feature selection is the process of selecting a subset of p’ original features from a total of p features. The primary goal is to identify the most relevant features to build better learning models for tasks like classification and regression.

This process yields several significant benefits:

Improved Generalization: Models with fewer, more relevant features are less sensitive to noise and tend to have lower variance. This aligns with Occam’s Razor (the law of parsimony).
Computational Efficiency: Fewer features reduce training and inference cost and often require fewer labeled examples.
Enhanced Interpretability: Simpler models are easier to understand and explain, important for trust and transparency.

Core Methodologies for Feature Selection

There are three primary approaches to feature selection, each differing in how they interact with the learning algorithm.

Approach	Description	Key Characteristics
Filter	Ranks features or feature subsets independently of the predictor.	Learner-agnostic; fast; used as pre-processing.
Wrapper	Uses a predictor to assess the quality of features or feature subsets.	Learner-dependent; computationally expensive; simple to apply.
Embedded	Performs feature selection during model training.	Learner-specific; selection integrated into training.

1. The Filter Approach

Filter methods select features as a pre-processing step, independently of the classifier. They include univariate and multivariate techniques.

Univariate Filtering

Pearson Correlation: Measures linear correlation between a feature and the target (range [-1, 1]); detects only linear dependencies.
T-test: Tests whether a feature distinguishes two classes (assumes normality and equal variance). Null hypothesis H0: class means are equal; T-statistic measures significance of mean difference.

Multivariate Filtering

Methods consider multiple variables jointly to capture interactions missed by univariate methods.

Core challenges:

Scoring Function: Measure the quality of a subset.
Search Strategy: Explore the 2^p possible subsets efficiently.

Finding the minimal optimal subset is NP-hard; good heuristics are essential.

2. The Wrapper Approach

Uses the predictive performance of a specific learner (treated as a black box) to score feature subsets.

Two key questions:

(a) Assessment: How to measure performance?
- Split into train/validation/test. Train on train, score on validation, choose best subset; optionally use cross-validation; assess final model once on test.
(b) Search: How to explore 2^p subsets?
- Heuristics:
  - Forward Selection: Start empty; add features iteratively.
  - Backward Elimination: Start full; remove features iteratively.
  - Advanced: Beam search, GSFS, PTA(l,r), floating search.

Main criticism: high computational cost due to repeated training/evaluation.

3. The Embedded Approach

Integrates selection into training a single model; selection occurs implicitly.

Key Characteristics: Learner-specific.
Example: L1-regularization shrinks many coefficients to zero, selecting features.

Feature Selection vs. Feature Extraction

Feature Selection: Selects a subset of original features
(X1, …, Xp → Xk1, …, Xkp’); features remain unchanged.
Feature Extraction / Dimensionality Reduction: Creates new features as (possibly weighted) combinations of originals
(X1, …, Xm → g1(X), …, gp’(X)).

Example: Principal Component Analysis (PCA) finds a linear mapping to a lower-dimensional space that maximally preserves variance; principal components are combinations of original features.

Model Selection and Assessment

Feature selection is tied to broader tasks of model selection and assessment.

Model Selection: Estimate performance to choose the best model (e.g., hyperparameters or feature subset).
Model Assessment: Estimate the chosen model’s prediction error on unseen data.

Data Handling Strategies

Data-Rich Scenario: Split into train (build), validation (select), test (final assess).
Insufficient Data: Use cross-validation or bootstrap; or analytical approximations (AIC/BIC) to mimic validation.

Review Quiz

Answer the following questions in 2-3 sentences each based on the provided material.

What is the primary objective of feature selection in machine learning?
According to the text, what are the three main benefits of applying feature selection?
Explain the fundamental difference between the Filter and Wrapper approaches.
What is a major limitation of univariate filtering methods like the Pearson Correlation?
Why is the task of finding a minimal optimal feature subset described as NP-hard?
In the Wrapper approach, what are the distinct roles of the training, validation, and test data sets?
Describe the core ideas behind the Forward Selection and Backward Elimination search strategies.
How does the Embedded approach to feature selection differ from both Filter and Wrapper methods?
Distinguish between Feature Selection and Feature Extraction, using Principal Components Analysis (PCA) as an example.
What is the purpose of model selection, and how does it differ from model assessment?

Answer Key

Objective: Select a subset of the most relevant original features to build models that are better, faster, and more interpretable for classification or regression.
Benefits: Improved generalization (lower variance, less noise sensitivity); computational efficiency (faster training/inference); enhanced interpretability (more explainable).
Filter vs Wrapper: Filter ranks/selects features independently of the learner as pre-processing; Wrapper uses the performance of a specific learner to score feature subsets.
Limitation of Univariate Filtering: Detects only linear, single-feature relationships; may miss features useful only in combination.
NP-hardness: With p features, there are 2^p subsets; exhaustive search is infeasible, requiring heuristics.
Data splits: Train to fit predictors per subset; Validation to select the best subset; Test used once for final unbiased performance estimate.
Search strategies: Forward Selection adds features iteratively from empty; Backward Elimination removes features iteratively from full.
Embedded: Selection is integrated into model training (learner-specific), unlike separate pre-processing (Filter) or external evaluation loops (Wrapper).
Selection vs Extraction: Selection keeps original features; Extraction creates new features (e.g., PCA components) as combinations that preserve variance.
Model Selection vs Assessment: Selection chooses the best model via validation; Assessment estimates final generalization on unseen data (test set).

Essay Questions

Formulate comprehensive responses for each:

Discuss Occam’s Razor and its relationship to the three primary goals of feature selection.
Compare univariate vs. multivariate Filter methods; give a scenario where multivariate succeeds but univariate fails.
With millions of features and limited compute, which approach (Filter, Wrapper, Embedded) would you choose and why? Discuss trade-offs.
Explain the inherent “search problem” in feature selection. Describe three heuristic strategies and when each is preferable.
Detail the complete Wrapper process with emphasis on data splitting. Why separate validation and test sets to avoid bias?

Glossary of Key Terms

Term	Definition
Backward Elimination	Heuristic that starts with all features and iteratively removes the least useful ones.
Cross-Validation	Efficient sample reuse to estimate learner performance when validation data is limited.
Dimensionality Reduction	Creating new features g(X) by combining originals; also called Feature Extraction.
Embedded Approach	Selection performed implicitly during model training; learner-specific.
Feature Extraction	Transforming X into new features g(X), often lower dimensional.
Feature Selection	Selecting a subset of p’ original features from p available features.
Filter Approach	Ranks features/subsets independent of the predictor as pre-processing.
Forward Selection	Heuristic that starts empty and iteratively adds the most useful features.
L1-Regularization	Embedded method that penalizes complexity and drives some coefficients to zero.
Model Assessment	Estimating prediction error of the final model on new data.
Model Selection	Estimating performance of candidate models to choose the best.
Multivariate Methods	Filter methods assessing multiple variables jointly.
NP-hard	Class of problems lacking known polynomial-time solutions; subset search is NP-hard.
Occam’s Razor	Prefer simpler explanations/models with fewer assumptions.
Pearson Correlation	Measures linear relationship between two variables (range [-1, 1]).
Principal Components Analysis (PCA)	Feature extraction mapping to a lower-dimensional space preserving variance.
T-test	Univariate test assessing whether a feature separates two normally distributed classes.
Univariate Methods	Filter methods assessing one variable at a time.
Wrapper Approach	Uses a learner as a black box to score feature subsets by predictive power.

Section 3 - Deep on 1D Sequence Type (e.g. Language Text)

Prob Review

Lecture: S3-ProbReview
Version: current
Video: M1 + M2

1Basic

Recent deep learning on Text

Lecture: S3-deepNNtext
Version: current
Notes to Read: Keras Notebook on DNN text + [Recent LLM-survey]
Video: M1 + M2 + M3

Nonlinear Deep Discriminative 4Unsupervised Generative

Notebook to try:

L15 Keras Notebook on DNN text

Extra reading: [Recent LLM-survey]

In this lecture, we cover:

What is NLP?
Typical NLP tasks / Challenges / Pipeline
f() on natural language
- Before Deep NLP (Pre 2012) • (BOW / LSI / Topic Modeling LDA )
- Word2Vec (2013-2016) • (GloVe/ FastText)
- Recurrent NN (2014-2016) • LSTM
- Seq2Seq
- Attention
- Self-Attention (2016 – now )
- Transformer (attention only Seq2Seq)
- BERT / RoBERTa/ XLNet/ GPT-2 / …

Study Guide: Deep Neural Networks for Natural Language Processing

Quiz: Short-Answer Questions

Instructions: Provide a concise answer (2-3 sentences) for each of the following questions based on the provided course material.

What is Natural Language Processing (NLP) and what fundamental goal does it aim to achieve beyond simple keyword matching?
Identify and briefly describe three major challenges that researchers and engineers face in the field of Natural Language Processing.
Explain the “Bag of Words” (BOW) representation and identify its two primary limitations for many NLP tasks.
What is the “one-hot vector” method for representing words, and what are its main drawbacks?
Describe the core characteristic of Recurrent Neural Networks (RNNs) and explain how this feature makes them suitable for processing sequence data.
Explain the purpose of the Encoder and Decoder components within a Seq2Seq architecture for a task like machine translation.
What is the core idea behind the “Attention Mechanism” in sequence-to-sequence models?
How does the Transformer model’s architecture fundamentally differ from that of a Recurrent Neural Network?
Describe the Masked Language Model (MLM) pre-training objective used by BERT.
Differentiate between the CBOW and SkipGram models within the Word2Vec framework.

Answer Key

Natural Language Processing (NLP) is a field of computer science, AI, and computational linguistics focused on the interaction between computers and human languages. Its goal is to go beyond simple keyword matching to identify the structure and meaning of words and sentences, enabling a deep understanding of broad language.
Three major challenges in NLP are ambiguity, such as pronoun references; the fact that language is not static and constantly changes with new slang or “cyber lingo”; and the immense scale of language data, with sources like Wikipedia containing billions of words.
The Bag of Words (BOW) is a method that represents text by counting the occurrences of each word, such as f()=c great 2 love 2. Its primary limitations are that it removes all word position information and cannot effectively represent word compositions.
The “one-hot vector” is a binary vector with a length equal to the vocabulary size, where a ‘1’ is placed in the position corresponding to a word’s ID and the rest are ‘0’s. Its main drawbacks are its extremely high dimensionality, its sparsity, and its inability to represent a word’s meaning.
Recurrent Neural Networks (RNNs) are networks containing loops, which allow information to persist. This structure enables them to operate over sequences of vectors with variable lengths, using recent history and current input to model dynamic temporal dependencies.
In a Seq2Seq architecture, the Encoder is an RNN that encodes an input sentence (e.g., in a source language) into a hidden state or feature vector. The Decoder is another RNN that takes this hidden state as input and generates the output sequence (e.g., the translated sentence).
The Attention Mechanism provides a weight for each input word for every single output timestep. This allows the model to create a context vector (C1) that is a weighted sum of the hidden encodings from the input, effectively letting the model focus on the most relevant parts of the input sequence when generating an output.
The Transformer model’s architecture is fundamentally different because it contains no recurrence. Instead of processing sequences step-by-step like an RNN, it relies entirely on self-attention mechanisms to map a sequence to itself.
The Masked Language Model (MLM) is a pre-training objective where some input tokens are masked with a unique [MASK] token. The model is then trained as a denoising autoencoder to predict these original masked tokens based on their surrounding context.
In Word2Vec, the Continuous Bag-of-Words (CBOW) model predicts the current input token based on its surrounding context tokens. Conversely, the SkipGram model does the opposite, predicting the surrounding context tokens based on the current input token.

Essay Questions

Instructions: Prepare a detailed, essay-format response for each of the following prompts. (Answers not provided).

Trace the evolution of natural language representation in machine learning as outlined in the course material, beginning with pre-2012 methods like Bag of Words and culminating in modern approaches like Transformer-based contextual embeddings. Discuss the key innovations and limitations at each major stage (BOW, Word2Vec, RNN/LSTM, Transformers).
Compare and contrast the architectural philosophies of Recurrent Neural Networks (like LSTM) and Transformer models for processing sequential data. What are the fundamental differences in how they handle sequences, and what specific advantages did the introduction of self-attention in Transformers provide over recurrence?
Discuss the primary challenges inherent in Natural Language Processing, specifically ambiguity, scale, and the dynamic nature of language. Using examples from the source, explain how modern deep learning approaches attempt to address these challenges more effectively than classic NLP pipeline components.
Explain the concept of the Seq2Seq (Encoder-Decoder) architecture and its wide range of applications in generative NLP tasks. How does the integration of an attention mechanism enhance the performance and interpretability of these models, particularly in a complex task like machine translation?
The source mentions several landmark pre-trained models, including BERT, ALBERT, and XLNet. Describe the concept of pre-training and fine-tuning. Explain the specific training innovations introduced by ALBERT (Sentence Order Prediction) and XLNet (Permutation Language Model) as attempts to improve upon the original BERT framework.

Glossary of Key Terms

Term	Definition
ALBERT	A “lite” version of BERT that proposes the Sentence Order Prediction (SOP) task to replace Next Sentence Prediction (NSP), making the model focus more on the semantic coherence between sentences.
Attention Mechanism	A technique used in sequence models where, for each output timestep, a weighted sum of the hidden encodings of the input sequence is calculated. This allows the model to focus on the most relevant parts of the input.
Bag of Words (BOW)	A text representation method that removes word position information and represents a document as a collection of its word counts. It is not applicable to many NLP tasks because it cannot represent word compositions.
BERT	Bidirectional Encoder Representations from Transformers. A pre-trained model for sentence embedding whose architecture is a Transformer’s encoder stack. It is trained using a Masked Language Model (MLM) objective.
CBOW (Continuous Bag-of-Words)	A Word2Vec model that predicts an input token based on its surrounding context tokens.
Co-reference Resolution	An NLP task that involves determining if different expressions in a text refer to the same entity (e.g., determining if “Chris” and “Mr. Robin” are the same person).
Decoder	In a Seq2Seq model, the component (typically an RNN) that takes the hidden state from the encoder as input and generates the output sequence.
Encoder	In a Seq2Seq model, the component (typically an RNN) that processes the input sentence and encodes it into a single hidden state or feature vector.
GPT (Generative Pre-Training)	A pre-trained model from OpenAI. The GPT-2 version has 1.5 billion parameters and was trained on millions of web pages.
Long Short-Term Memory (LSTM)	A type of Recurrent Neural Network (RNN) invented by Schmidhuber in 1997. It is highly successful in language modeling and sequence learning problems.
Masked Language Model (MLM)	A pre-training objective used by BERT, where some input tokens are masked and the model is trained to predict the original tokens based on their context. It functions as a Denoising Auto Encoder.
Natural Language Processing (NLP)	A field of computer science, AI, and linguistics concerned with the interactions between computers and human languages, aiming for a deep understanding of language structure and meaning.
One-hot vector	A basic method for representing a word as a binary vector whose length is the size of the vocabulary. It has a ‘1’ in the position of the word’s ID and ‘0’s elsewhere, but it is extremely high-dimensional and sparse.
Permutation Language Model (PLM)	The pre-training objective for XLNet, which learns bidirectional contexts by permuting the factorization order of the sequence.
Recurrent Neural Network (RNN)	A type of neural network with loops, allowing information to persist. This architecture allows RNNs to operate over sequences of vectors with variable length.
Self-Attention	An attention mechanism that creates attention layers mapping from a sequence to itself, forming the core component of the Transformer model.
Sentence Order Prediction (SOP)	A pre-training task used by ALBERT where the model must determine if two sentences are in their correct original order, which forces it to learn about semantic coherence.
Seq2Seq	An Encoder-Decoder architecture used for sequence-to-sequence generation tasks like machine translation, dialogue generation, and question answering.
SkipGram	A Word2Vec model that predicts context tokens based on a given input token.
Transformer	A sequence model from Google Brain that contains no recurrence and relies entirely on self-attention mechanisms. It is a Seq2Seq model that uses encoder self-attention, decoder self-attention, and encoder-decoder attention.
Word2Vec	A technique to learn distributed representations of words (word embeddings). It includes the CBOW and SkipGram models.
XLNet	A pre-trained language model that builds on Transformer-XL (adding recurrence between segments) and uses a Permutation Language Model to learn bidirectional contexts.

Generative Classification

Lecture: S3-GenerativeBayesClassify
Version: current
Notes to Read: text NBC notebook
Video: M1 + M2 + (Extra M3 )

3Classification Generative

Notebook to run:

L16 text NBC notebook

Study Guide for Generative Bayes Classifiers

This study guide provides a review of the core concepts related to Generative Bayes Classifiers, as detailed in the source material. It includes a short-answer quiz to test comprehension, a set of essay questions for deeper analysis, and a comprehensive glossary of key terms.

Quiz: Short-Answer Questions

Instructions: Answer the following ten questions based on the provided source material. Each answer should be approximately two to three sentences.

What are the three major types of classification approaches described in the material?
What is the fundamental goal of a Bayes Classifier when presented with a new data sample for testing?
Explain the primary difference between a generative and a discriminative probabilistic model for classification.
What is the core “naïve” assumption of the Naïve Bayes Classifier, and why is it crucial for the model’s functionality?
How does the Naïve Bayes conditional independence assumption affect the number of parameters the model needs to learn?
Describe the learning or training phase for a Naïve Bayes Classifier that handles discrete input attributes.
What is the Maximum A Posteriori (MAP) rule, and what role does it play in Generative Bayes Classifiers?
When using maximum likelihood estimates for training, what significant problem can arise if the training data is not perfectly representative?
What is the purpose of “smoothing” in the context of training a Naïve Bayes model?
According to the “Play Tennis” example, how many parameters were required for a general Generative Bayes Classifier compared to a Naïve Bayes Classifier?

Answer Key

The three major types of classification approaches are Discriminative classifiers, which directly estimate a decision boundary (e.g., SVM); Generative classifiers, which build a statistical model for the data (e.g., Naïve Bayes); and Instance-based classifiers, which use observations directly without a model (e.g., K-nearest neighbors).

The goal of a Bayes Classifier during testing is to predict the class for a given sample x. Specifically, it seeks to find the class c that maximizes the posterior probability p(c

x1, x2, …, xp).

A generative model builds a statistical model for each class, modeling the class-conditional probability P(X

C) and the prior probability P(C). A discriminative model, in contrast, directly estimates the posterior probability P(C

X) or learns a direct mapping from inputs to class labels without modeling the underlying data distribution.

The core assumption is that all input attributes (features) are conditionally independent of one another, given the class. This assumption is crucial because it simplifies the problem of estimating the joint probability P(X1, …, Xp

C) by allowing it to be calculated as the product of individual conditional probabilities P(Xi

C).

The assumption dramatically reduces the number of parameters. Instead of estimating a parameter for every possible combination of attribute values (multiplicative complexity), the model only needs to estimate parameters for each attribute conditioned on the class (additive complexity), making it feasible to train with much less data.

The learning phase for a discrete Naïve Bayes Classifier involves estimating probabilities from the training data using frequencies (maximum likelihood estimates). For each class ci, the model estimates the prior probability P(C=ci), and for every attribute value xjk, it estimates the conditional probability P(Xj=xjk

C=ci).

The MAP rule is a decision principle used to select the most probable class for a new instance. In Generative Bayes Classifiers, it means choosing the class cj that maximizes the product of the class-conditional probability P(x1, …, xp

cj) and the prior probability P(cj).

If a specific combination of an attribute value and a class never appears in the training data, its maximum likelihood probability estimate will be zero. This “zero probability” can then nullify the entire posterior probability calculation for that class, regardless of the evidence from other attributes.
Smoothing is a technique used to avoid the problem of zero probabilities. It adjusts the counts used for probability estimation, for example, by adding a small value (like 1) to the numerator, which ensures that no estimated conditional probability is ever exactly zero.
For the “Play Tennis” example, the general Generative Bayes Classifier required estimating 72 parameters. In contrast, the Naïve Bayes Classifier, with its conditional independence assumption, only required the estimation of 20 parameters.

Essay Questions

Instructions: The following questions are designed to encourage a deeper synthesis of the source material. Answers are not provided.

Compare and contrast the standard Generative Bayes Classifier and the Naïve Bayes Classifier. Discuss the trade-offs between model complexity, data requirements, and the validity of underlying assumptions, using the “Play Tennis” dataset as a specific case study.
The source material frames classification as a process involving a Task, Representation, Score Function, and Search/Optimization. Elaborate on how each of these components is realized within the Naïve Bayes Classifier framework for discrete attributes.
Explain the mathematical justification for the Naïve Bayes classification rule. Begin with the primary goal of any Bayes Classifier, apply Bayes’ Rule, and then explicitly incorporate the conditional independence assumption. Conclude by explaining why the P(X) term from Bayes’ Rule can be disregarded during the final MAP classification step.
Discuss the “zero probability” problem in detail. Explain why this is a critical issue for models trained with maximum likelihood estimation and how the smoothing techniques presented in the source material provide a robust solution to avoid overfitting and improve model generalization.
While the document primarily focuses on discrete attributes, it briefly mentions Gaussian Naïve Bayes. Based on the fundamental structure of the Naïve Bayes model, hypothesize how a Gaussian Naïve Bayes classifier would differ from the discrete model in its learning (parameter estimation) and testing phases. What kind of feature data would this variant be suited for?

Glossary of Key Terms

Term	Definition
Bayes’ Rule	A fundamental theorem of probability used to find a conditional probability. It states: `P(C \\| X) = P(X \\| C) * P(C) / P(X)`.
Bayes Classifier (BC)	A probabilistic classification approach that treats feature attributes and class labels as random variables, aiming to predict the class c that maximizes the posterior probability `p(c \\| x1, x2, ..., xp)`.
Conditional Independence Assumption	The core simplifying assumption of the Naïve Bayes Classifier, which posits that all feature attributes are independent of each other given the class label.
Discriminative Classifier	A type of classifier that directly estimates a decision boundary or models the posterior probability `P(C \\| X)` without explicitly modeling the data distribution.
Generative Bayes Classifier (GBC)	A classifier that works by building a generative statistical model. It models the prior probability of each class P(C) and the class-conditional probability of the data `P(X \\| C)`.
Instance-based Classifier	A classification approach that uses training observations directly to make predictions without building an explicit model. An example is the K-nearest neighbors (KNN) algorithm.
Learning Phase (Training)	The process of estimating the parameters of a model from a training dataset. For Naïve Bayes, this involves calculating the prior and conditional probabilities from data frequencies.
Maximum A Posteriori (MAP) Rule	A decision rule for classification that selects the class with the highest posterior probability given the observed data. It is equivalent to finding the class c that maximizes the product `P(X \\| c) * P(c)`.
Maximum Likelihood Estimates (MLE)	A method for estimating model parameters by finding the parameter values that maximize the likelihood of making the observations given the parameters. In this context, it involves using the frequencies in the data to estimate probabilities.
Naïve Bayes Classifier (NBC)	A type of Generative Bayes Classifier that simplifies learning by making the strong (naïve) assumption that all features are conditionally independent given the class.
Posterior Probability	The probability of a class C after observing the data X, denoted `P(C \\| X)`.
Prior Probability	The initial probability of a class C before any data has been observed, denoted P(C).
Smoothing	A technique used to prevent zero probabilities during model training, typically when a specific feature-value and class combination is absent from the training data. It adjusts probability estimates to ensure no probability is exactly zero.
Testing Phase	The process of using a trained model to assign a class label to new, previously unseen data instances.

NaiveBC on Text

Lecture: S3-NBCtext
Version: current
Notes to Read: text NBC notebook
Extra Notes to Read: Multinomial MLE
Video: M1 + M2 + (Extra M3 + M4)

3Classification Generative

Notebook to run:

L16 text NBC notebook

Study Guide: Naïve Bayes Classifier for Text Classification

Quiz: Short-Answer Questions

Instructions: Answer the following questions in 2-3 complete sentences, drawing only upon the provided course materials.

What is the core “naïve” assumption made by the Naïve Bayes classifier, particularly in the context of text classification?
Describe the “bag of words” representation for text documents and identify its primary simplifying assumption.
What are the two main probabilistic models discussed for applying the Naïve Bayes classifier to text, and how do their feature representations differ?
According to the materials, which of the two primary Naïve Bayes models is generally more effective for text classification tasks?
Explain the purpose of using log probabilities during the testing stage of a Naïve Bayes classifier.
How does Maximum Likelihood Estimation (MLE) determine the parameters for the Multivariate Bernoulli model?
In the Multinomial Naïve Bayes model, what is the concept of a “mega-document” and how is it used during training?
What is a primary strategy for handling out of vocabulary (OOV) or unknown words that appear in a test document but not in the training corpus?
Despite its simplifying assumption, why is the Naïve Bayes classifier considered robust and a good baseline model?
How can the Multinomial Naïve Bayes classifier be understood as a class-conditional unigram language model?

Answer Key

The core “naïve” assumption is that all input attributes are conditionally independent given the class. In text classification, this means that the appearance of one word in a document is assumed to be independent of the appearance of any other word, given the document’s topic or class.
The “bag of words” representation models a text document as a vector, where each dimension corresponds to a word in a dictionary. This vector can either contain the frequency of each word or a boolean value indicating its presence or absence. The model’s primary simplifying assumption is that word order is not important.
The two models are the Multivariate Bernoulli Naïve Bayes and the Multinomial Naïve Bayes. The Multivariate Bernoulli model uses binary features, representing each word as a boolean (true/false) value indicating whether it appears in the document. The Multinomial model uses integer features representing the frequency (count) of each word in the document.
The course materials state that the Multinomial model is “almost always more effective in text applications.” An experiment by M&N (1998) on classifying university web pages is cited as evidence supporting this conclusion.
Using log probabilities is a technique for underflow prevention. Multiplying many small probability values (which are between 0 and 1) can result in a number too small for standard floating-point representation. By summing the logs of the probabilities instead—since log(xy) = log(x) + log(y)—this numerical instability is avoided while preserving the ability to find the most probable class.

For the Multivariate Bernoulli model, Maximum Likelihood Estimation (MLE) is used to estimate the parameter P(word_i = true

class_j). This is calculated as the fraction of documents belonging to class j in which word i appears. It is the relative frequency of a binary event (the word’s presence).

In the Multinomial model, a “mega-document” for a specific class is created by conceptually concatenating all training documents belonging to that class. The frequency of a word w in this mega-document is then used to calculate the probability P(w

class) via MLE, which simplifies to the relative frequency of w across all words in all documents of that class.

A primary strategy is to train the model with an explicit symbol for an unknown word, such as . During preprocessing, words not in a pre-chosen vocabulary (or rare words) in the training corpus can be replaced with , allowing the model to learn a probability for it.
Naïve Bayes is considered a good baseline because it is very fast to train (one pass over the data) and test, has low storage requirements, and is robust to irrelevant features. For many text categorization tasks with numerous features, it performs well and was even a top performer in the KDD-CUP 97 competition.
The Multinomial model can be seen as a class-conditional unigram language model because it calculates the probability of a document by multiplying the probabilities of its individual words, given a class. This is equivalent to a unigram model (where each word’s probability is independent of others) where a separate set of word probabilities (a separate language model) is learned for each class.

Essay Questions

Instructions: Prepare a detailed, essay-format response for each of the following prompts.

Provide a comprehensive comparison of the Multivariate Bernoulli and Multinomial Naïve Bayes models for text classification. Discuss their underlying assumptions, how text documents are represented as features, the process for parameter estimation (MLE) in each, and the practical reasons one is often preferred over the other.
Explain the complete workflow for training and testing a Multinomial Naïve Bayes text classifier. Begin with raw text documents and a fixed set of classes, and detail the steps of text representation (including dictionary creation and the bag-of-words model), parameter estimation, and the final classification decision process for a new document.
Discuss the statement: “Naive Bayes is Not So Naive.” Elaborate on the strengths of the Naïve Bayes classifier that make it a powerful and dependable baseline for text classification, referencing its performance in competitions like KDD-CUP 97, its robustness to certain types of features, and its computational efficiency.

Describe the mathematical foundation of the Naïve Bayes classifier, starting from Bayes’ rule (argmax P(C

X)). Explain how the “naïve” assumption of conditional independence simplifies the P(X

C) term and makes computation tractable, especially for high-dimensional data like text.

What is a generative model in the context of classification? Explain how both the Multinomial and Multivariate Bernoulli Naïve Bayes classifiers can be viewed as generative models that approximate how a text document is produced, given a class label.

Glossary of Key Terms

Term	Definition
Bag of Words	A representation of a text document that models it as a high-dimensional vector. This vector can either represent the frequency of each word from a dictionary appearing in the document or a boolean value indicating the presence or absence of each word. This model simplifies text by assuming word order is not important.
Conditional Independence Assumption	The core “naïve” assumption of the Naïve Bayes classifier. It posits that, given the class variable, all features (e.g., words in a document) are independent of one another.
Generative Probabilistic Model	A type of statistical model that describes how a dataset is generated. In classification, a generative model learns the joint probability distribution P(X, C) or the class-conditional probability P(X\|C) and the class prior P(C), effectively modeling how to generate data X for each class C.
Maximum Likelihood Estimation (MLE)	A method for estimating the parameters of a probability distribution by maximizing a likelihood function. For Bernoulli distributions, this results in the relative frequency of an event. For Multinomial distributions, it corresponds to the relative frequency of each category’s occurrence.
Multinomial Naïve Bayes	A Naïve Bayes classification model that uses a multinomial distribution for its features. It is well-suited for text classification where features are word counts or frequencies. It is generally considered more effective for text tasks than the Bernoulli model.
Multivariate Bernoulli Naïve Bayes	A Naïve Bayes classification model appropriate for binary feature variables. In text classification, each feature represents a word from the dictionary, and its value is true if the word appears in the document and false otherwise, ignoring frequency.
Out of Vocabulary (OOV)	Words that appear in test data but were not present in the training data used to build the model’s dictionary. These are often handled by mapping them to a special (unknown) token.
Parameter Estimation	The process of using training data to calculate the probability values (parameters) needed by the model. For Naïve Bayes, this involves calculating class priors P(C) and conditional probabilities P(word \| C).
Stochastic Language Model	A probabilistic model of a sequence of words. A unigram language model, for instance, models the probability of a string by multiplying the independent probabilities of each word in it. Multinomial Naïve Bayes acts as a class-conditional unigram language model.
Underflow Prevention	A computational technique used to avoid numerical errors when multiplying many small probabilities. This is typically achieved by converting probabilities to their logarithms and summing them instead of multiplying the original values.

Gaussian GBC

Lecture: S3-GaussianGBC
Version: current
Notes to Read: Paper Discr vs. Genera
Video: Extra M4

3Classification Gaussian Generative

Quick survey of recent deep learning

Lecture: S3-deepNNSurvey
Version: current
More advanced to Read: <a href="/2025Fall-UVA-CS-MachineLearningDeep//Lectures/ [Recent LLM-survey] .pdf" target="_blank"> [Recent LLM-survey] </a>
Notes to Read: DNN Cheatsheets + [Recent LLM-survey]
Video: M1 + M2 + M3 + M4

3Classification Nonlinear Deep Discriminative 4Unsupervised Generative

Extra reaing:

[Recent LLM-survey]

Summary:

This lecture covers 10 deep learning trends that go beyond classic machine learning:

1. Popular CNN, RNN, Transformer models are not covered much here
1. DNN on graphs / trees / sets
1. NTM 4program induction
1. Deep Generative models/ DeepFake
1. Deep reinforcement learning
5 . Few-shots / Meta learning / AGI?
1. pretraining workflow / Autoencoder / self-supervised training
1. Generative Adversarial Networks (GAN) workflow
1. AutoML workflow / Learning to optimize /to search architecture
1. Validate / Evade / Test / Verify / Understand DNNs
1. Model Compression / Efficient Net

Disclaimer: it is quite hard to make important topics of deep learning fit on a one-session schedule. We aim to make the content reasonably digestible in an introductory manner. We try to focus on a modularity view by introducing important variables to digest deep learning into chunks regarding data/ model architecture /tasks / training workflows and model characteristics. We think this teaching style provides students with context concerning those choices and helps them build a much deeper understanding.

Briefing on Recent Trends in Deep Neural Networks

Executive Summary

This document synthesizes an overview of ten significant trends in Deep Neural Networks (DNNs), based on a lecture from the University of Virginia’s CS 4774 Machine Learning course. The analysis is structured around the “Deep Learning in a Nutshell” framework, which deconstructs the machine learning process into core components: Task, Representation (Topology), Score Function, Search/Optimization, Data, Models/Parameters, Hyperparameters/Metrics, and Hardware.

The key trends indicate that innovation in deep learning is occurring across this entire spectrum. Advancements are not limited to model architecture but also encompass the types of problems being solved, the methods used for training and optimization, and the practical challenges of deployment. The ten identified trends are:

Expanding Input Types: Moving beyond traditional grid-like data (images, sequences) to handle complex structures like graphs, trees, and sets.
Symbolic Reasoning: Employing architectures like Neural Turing Machines (NTMs) for tasks requiring program induction and manipulation of symbolic data.
Deep Generative Models: Creating realistic data for applications like image super-resolution, text-to-image synthesis, and DeepFakes, while also aiding in semi-supervised learning and handling missing data.
Deep Reinforcement Learning (RL): Enabling agents to learn complex behaviors from raw observations by maximizing reward signals, exemplified by AlphaGo’s success.
Meta-Learning and AGI: Shifting focus from narrow, task-specific AI towards “learning to learn” and the pursuit of Artificial General Intelligence (AGI), defined by autonomy and the ability to solve novel problems.
Self-Supervised Pre-training: Using autoencoders and reconstruction loss to pre-train network layers on unlabeled data, improving feature detection and mitigating training issues like vanishing gradients.
Generative Adversarial Networks (GANs): A powerful training workflow for generative tasks, with notable advancements seen in models like CycleGAN, Progressive GAN, and StyleGAN.
Automated Machine Learning (AutoML): Developing methods to automate the optimization process and the search for effective DNN architectures, reducing manual effort.
Robustness and Trustworthiness: A growing focus on understanding, verifying, and securing DNNs against adversarial attacks, as well as developing methods for model interpretation and bias detection.
Hardware Adaptation and Efficiency: Compressing large models for deployment on resource-constrained hardware like IoT and mobile devices, using techniques such as pruning, quantization, and filter decomposition.

The “Deep Learning in a Nutshell” Framework

The lecture organizes deep learning concepts and trends within a modular framework that breaks the field into its constituent parts. This structure provides a holistic view of where innovation is occurring.

Component	Description	Examples from Source
Task	The ultimate goal or problem to be solved.	Prediction, Generation, Reinforcement, Reasoning, Classification
Data (X)	The input fed into the system.	2D/3D Grids, 1D Grids (sequences), Graphs, Sets, Tabular Data
Representation (f())	The model’s topology or architecture that transforms the data.	CNN, RNN, Transformer, GNN, NTM, Multilayer Network
Score Function (L())	A function that measures the model’s performance or error.	Cross-Entropy, Mean Squared Error (MSE), Reconstruction Error
Search/Optimization	The algorithm used to find the best model parameters by minimizing the score.	Stochastic Gradient Descent (SGD), Backpropagation, Asynchronous SGD
Models, Parameters	The learnable components of the model.	Weights, Biases
Hyperparameter, Metrics	Settings that configure the learning process and metrics to evaluate success.	Network architecture choices, learning rate / Accuracy, F1 Score
Hardware	The physical infrastructure used for training and deployment.	GPU, TPU, Edge devices

Module I: Innovations in Representation and Architecture

This module focuses on trends related to the Representation component of the framework, specifically the types of data a DNN can process and the architectures designed for them.

Trend 1: DNNs for Graphs, Trees, and Sets

Deep learning is expanding beyond grid-structured data. New architectures are being developed to handle more complex and irregular data formats:

Graph Neural Networks (GNNs) are used for graph data.
Other specialized topologies are designed for tree and set-based inputs.

Trend 2: Symbolic Reasoning and Program Induction

This trend addresses tasks that require reasoning over symbolic inputs and outputs, such as computer programs.

Neural Turing Machines (NTMs) are a key architecture in this area. An NTM combines a neural network controller with external memory, which it can read from and write to using “blurry” read/write heads. This enables it to handle tasks involving sequential symbolic forms and decision-making.

Foundational Architectures

The lecture also recaps foundational architectures that remain central to the field:

Convolutional Neural Networks (CNNs): Highly effective for grid-like data. Applications mentioned include:
- Image Processing: Standard use case.
- Playing Go: A 19x19 game board is treated as an image to predict the next move.
- Speech Processing: A spectrogram (representing frequency over time) is treated as an image, with filters moving in the frequency direction.
Recurrent Neural Networks (RNNs), Attention, Seq2Seq, and Transformers: These architectures are dominant in natural language processing (NLP).
- Notable Pre-trained Models: The source highlights ELMo (pre-trained biLSTM for contextual embeddings) and BERT (pre-trained transformer encoder for sentence embeddings) as significant examples.

Module II: Expanding the Scope of Solvable Tasks

This module explores trends related to the Task component, highlighting new types of problems that deep learning is being applied to.

Trend 3: Deep Generative Models

These models are designed to generate new, realistic data. Their applications and benefits include:

Applications: Image Super-Resolution, Label-to-Image, Edges-to-Image, Text-to-Image, and DeepFakes (swapping faces in videos).
Utility:
- Testing the ability to handle high-dimensional probability distributions.
- Simulating possible futures for planning or reinforcement learning.
- Handling missing data and enabling semi-supervised learning.
- Producing multi-modal outputs.

Trend 4: Deep Reinforcement Learning (RL)

Deep RL combines deep neural networks with reinforcement learning principles.

Core Concept: An agent interacts with an environment, receiving raw observations (not hand-crafted states) and a scalar reward. It learns by taking actions to maximize this cumulative reward.
AlphaGo Case Study: The success of AlphaGo is attributed to a learning pipeline that combines Supervised Learning (SL) and Reinforcement Learning (RL) to guide its Monte Carlo Tree Search:
- SL Policy Network: Provides prior search probability.
- Rollout: Conducts quick simulations on leaf nodes of the search tree.
- Value Network: Evaluates the “global feeling” or state of a leaf node.
- Subsequent work, referenced as “Mastering the Game of Go without Human Knowledge,” further advanced this approach.

Trend 5: Meta-Learning and the Pursuit of Artificial General Intelligence (AGI)

This trend focuses on “learning to learn” and moving beyond task-specific “Narrow AI.”

Narrow AI: Models designed for specific tasks like game-playing, medical diagnosis, or car-driving.
Artificial General Intelligence (AGI): Defined as “the ability to achieve complex goals in complex environments using limited computational resources.” Key characteristics of AGI include:
- Autonomy.
- Practical understanding of self and others.
- Ability to understand “what the problem is,” not just solve problems posed by programmers.
- Capability to solve problems unknown to its original programmers.

Module III: Evolving Training and Optimization Workflows

This module covers trends related to the Search/Optimization and Score Function components, focusing on new ways to train models effectively.

Trend 6: Self-Supervised Learning and Autoencoders

This approach involves training models on unlabeled data by creating a supervisory signal from the data itself.

Autoencoders: An auto-encoder is trained to reproduce its input. The Reconstruction Loss (difference between input and output) forces the hidden layers to learn reliable and descriptive features.
Unsupervised Pre-training: A layer-wise training workflow:
1. Train the first hidden layer using a self-supervised loss, then fix its parameters.
2. Repeat this process for subsequent layers, working bottom-up.
3. After pre-training, perform Supervised Fine-tuning on labeled data to refine the learned features for the specific end-task.
Benefits: This method helps overcome common DNN training challenges like non-convexity and vanishing gradients.

Trend 7: Generative Adversarial Networks (GANs)

GANs represent a specific training framework for generative models, involving a competition between a generator and a discriminator. Notable advanced GAN architectures cited include:

CycleGAN: Translates characteristics between two image collections without paired examples (e.g., style transfer, object transfiguration).
Progressive GAN: Improves quality, stability, and variation in generated images.
StyleGAN: A generator architecture that produces high-quality images and achieves an unsupervised separation of high-level attributes (disentanglement).

Trend 8: Automated Machine Learning (AutoML)

AutoML aims to automate aspects of the machine learning pipeline, particularly optimization and architecture design.

Key Focus: “Learning to Optimize” and “Learning to Search” for optimal DNN architectures.
Example: “Neural Optimizer Search with Reinforcement Learning” is cited as a method for hyperparameter search.

Module IV: Addressing Deployment and Real-World Challenges

This final module addresses trends related to the practical deployment of DNNs, including robustness and hardware efficiency.

Trend 9: Trust, Robustness, and Interpretability

As DNNs are deployed in critical systems, ensuring their reliability and understanding their behavior has become paramount.

Understanding DNNs:
- Post-hoc Explanations: Methods like feature visualization and attribution to explain model decisions after training.
- Inherently Interpretable Models: Designing models that are transparent by construction.
Adversarial Examples (AE): DNNs can be fooled by adding imperceptible noise to inputs. The source provides an example where adding 0.007 * noise to an image of a “panda” causes a CNN to misclassify it as a “gibbon.”
Verification: Formal methods are being developed to prove properties about DNN behavior, such as the “Reluplex” SMT solver for verifying networks.

Trend 10: Hardware-Aware DNNs and Model Compression

This trend is driven by the need to deploy powerful DNNs on resource-constrained hardware, such as the projected 20 billion connected IoT devices by 2020.

Goal: Enable efficient edge inference for applications like language translation, speech recognition, and object detection on mobile and IoT devices.
Model Compression Methods:
- (a) Model Pruning: Removing redundant neurons, filters, or layers. Methods include magnitude-based pruning and Hessian-based pruning, often involving iterative pruning and retraining.
- (b) Simpler Filter Construction: Using techniques like matrix factorization, singular value decomposition (SVD), and flattened convolutions.
- (c) Quantization: Reducing the precision of weights, activations, and gradients using fixed-point formats or codebooks.

Section 4 - More Advanced Supervsied on Tabular Type

SVM

Lecture: S4-SVM-basic
Version: current
Notes to Read: PCA+SVM Notebook
Video: M1 + M2

3Classification Linear Discriminative Regularization Optimization 5Theory

Notebook to Run: L14 PCA+SVM Notebook

Support Vector Machines (Basics) Study Guide

This guide is designed to review the fundamental concepts of Support Vector Machines (SVMs) as presented in the source material. It includes a short-answer quiz, an answer key, suggested essay questions for deeper exploration, and a glossary of key terms.

Quiz: Short-Answer Questions

Instructions: Answer the following ten questions in 2-3 sentences each, based on the provided source context.

What are the three major types of classification approaches discussed, and what is the core principle of each?
When was the Support Vector Machine (SVM) first introduced, and what theoretical field inspired its development?
What specific application was instrumental in popularizing SVMs, and how did its performance compare to a contemporary neural network?
Describe the MNIST dataset, including its contents and original purpose.
What is the “margin” of a linear classifier?
What are “support vectors” and what role do they play in a maximum margin classifier?
What is the primary goal of the simplest type of SVM, known as a Large Margin Linear Classifier (LSVM)?
According to the source, why is maximizing the margin a desirable approach for a classifier?
How is the optimization problem for a linear SVM formally stated? What is being minimized, and what are the constraints?
What mathematical object represents the decision boundary in a linear classifier, and how is the classification function expressed in terms of parameters w and b?

Answer Key

The three major types are Discriminative, Generative, and Instance-based classifiers. Discriminative approaches directly estimate a decision boundary (e.g., SVM). Generative approaches build a generative statistical model (e.g., Naïve Bayes). Instance-based classifiers use observations directly without building models (e.g., K-nearest neighbors).
The SVM was first introduced in 1992. Its development was inspired by statistical learning theory, as detailed in V. Vapnik’s work, “The Nature of Statistical Learning Theory.”
SVMs became popular due to their success in handwritten digit recognition. An SVM achieved a 1.1% test error rate, which was the same as the error rate of a carefully constructed neural network known as LeNet 4.
The Mixed National Institute of Standards and Technology (MNIST) dataset is a database for evaluating handwritten digit recognition. It contains 60,000 training instances and 10,000 test instances of hand-written digits, each encoded as a 28x28 pixel grayscale image.
The margin of a linear classifier is defined as the width that the decision boundary could be increased by before it hits a datapoint from either class. A maximum margin classifier seeks to make this width as large as possible.
Support vectors are the specific datapoints that the margin pushes up against. These data points are critical because they alone define, or “support,” the position and orientation of the optimal decision boundary.
The goal of a maximum margin linear classifier (or LSVM) is to identify the single linear classifier (decision boundary) that has the maximum possible margin. Instead of fitting all data points, it focuses on the points at the boundary between classes.
Maximizing the margin is considered a good strategy because it is intuitive, has theoretical support from concepts like VC dimension, and has been shown to work well in practice.
The optimization problem is to minimize (w^T * w) / 2. This is subject to the constraints that for all training samples xi, the expression yi * (w^T * xi + b) must be greater than or equal to 1, ensuring all points are correctly classified outside the margin.
A hyperplane represents the decision boundary. The classification function is expressed as f(x,w,b) = sign(w^T * x + b), where w is a vector of weights and b is a scalar bias term.

Glossary of Key Terms

Term	Definition
Discriminative Classifier	A type of classification approach that directly estimates a decision rule or boundary. Examples include SVM, decision trees, and neural networks.
Generative Classifier	A type of classification approach that builds a generative statistical model. Examples include Bayesian networks and Naïve Bayes classifiers.
Instance-based Classifier	A classification approach that uses observations directly without building an explicit model. An example is K-nearest neighbors.
Hyperplane	A geometric entity that can be represented as the solution to a single linear equation of degree 1. In SVMs, it serves as the decision boundary.
Kernel Methods	A class of algorithms for pattern analysis, of which SVM is considered an important example. The “Kernel Trick” allows linear classifiers to be applied to non-linear problems.
Margin	The width by which the boundary of a linear classifier could be increased before hitting a datapoint. The goal of an SVM is to maximize this margin.
Maximum Margin Linear Classifier	The linear classifier with the maximum possible margin. This is the simplest form of a Support Vector Machine (LSVM).
MNIST	(Mixed National Institute of Standards and Technology) A large database of handwritten digits commonly used for training and testing image processing systems. It contains 60,000 training images and 10,000 testing images.
Quadratic Programming (QP)	A type of optimization problem that involves minimizing a quadratic objective function subject to linear constraints. This method is used to find the optimal parameters for an SVM.
Support Vector Machine (SVM)	A discriminative classifier inspired by statistical learning theory, first introduced in 1992. It operates by finding the hyperplane that best divides a dataset into classes, aiming for the largest possible margin between the hyperplane and the nearest points of any class.
Support Vectors	The specific datapoints that lie closest to the decision boundary, against which the margin pushes. These vectors are the critical elements that “support” and define the optimal hyperplane.

SVM, Kernel

Lecture: S4-SVM-kerneltrick
Version: current
Notes to Read: Practical Guide
Video: M1 + M2 + M3 + (Extra M4)

3Classification Discriminative Regularization Optimization 5Theory

Study Guide for Non-Linear Support Vector Machines and the Kernel Trick

This guide provides a comprehensive review of the concepts presented in the source material on non-linear Support Vector Machines (SVMs). It includes a quiz to test your understanding, an answer key for verification, suggested essay questions for deeper exploration, and a glossary of key terms.

Quiz: Short-Answer Questions

Instructions: Answer each of the following questions in 2-3 sentences, based on the provided source material.

What is the fundamental problem that non-linear SVMs are designed to solve, and what is the general strategy used to address it?
Explain the concept of Vapnik-Chervonenkis (VC) dimension in the context of making data linearly separable.
What are the two primary computational problems that arise from explicitly mapping data to a high-dimensional feature space?
Define the “kernel trick” and explain how it resolves the computational issues associated with high-dimensional feature spaces.
What is the role of the dual formulation in making the kernel trick possible for SVMs?
Name two types of non-linear kernel functions mentioned in the text and describe their mathematical form.
According to the practical guide for LIBSVM, why is scaling features before applying an SVM considered very important?
Describe the effect of using a large versus a small value for the penalty parameter C in SVM model selection.
Despite using potentially infinite-dimensional feature spaces, why do SVMs with kernels not necessarily overfit the data?
What condition must a similarity measure satisfy to be used as a kernel function, and what does this condition guarantee?

Answer Key

What is the fundamental problem that non-linear SVMs are designed to solve, and what is the general strategy used to address it? Non-linear SVMs are designed to classify data that is not linearly separable in its original input space. The strategy is to map the original input space (x) to a higher-dimensional feature space (φ(x)) where the training set becomes linearly separable by a hyperplane.
Explain the concept of Vapnik-Chervonenkis (VC) dimension in the context of making data linearly separable. The source material states that the VC dimension of separating hyperplanes in an N-dimensional space is at least N+1. This theory supports the idea that if data is mapped into a sufficiently high dimension, the samples will generally become linearly separable. Specifically, N data points are generally separable in a space of N-1 dimensions or more.
What are the two primary computational problems that arise from explicitly mapping data to a high-dimensional feature space? The two main problems are a high computational burden due to the high dimensionality and the significantly increased number of parameters that need to be estimated. These issues make the explicit transformation computationally intensive and complex.
Define the “kernel trick” and explain how it resolves the computational issues associated with high-dimensional feature spaces. The kernel trick is a method for efficient computation that allows SVMs to operate in a high-dimensional feature space without ever explicitly representing the features. It achieves this by defining a kernel function, K(x, z), that computes the dot product Φ(x)TΦ(z) directly in the original input space, avoiding the high computational cost of the explicit transformation.
What is the role of the dual formulation in making the kernel trick possible for SVMs? The dual formulation of the SVM optimization problem is crucial because it expresses the objective function in terms of the dot product of input samples (xi * Txj). This structure allows the dot product to be directly replaced by a kernel function K(xi, xj), enabling the model to learn a non-linear boundary without ever computing in the high-dimensional space.

Name two types of non-linear kernel functions mentioned in the text and describe their mathematical form. The text mentions the Polynomial kernel, with the form K(x, z) = (1 + xTz)^d, and the Radial Basis Function (RBF) kernel, with the form K(x, z) = exp(-r

x-z

^2). The RBF kernel is noted to have a feature space with an infinite number of dimensions.

According to the practical guide for LIBSVM, why is scaling features before applying an SVM considered very important? Feature scaling is very important for two main reasons. First, it prevents attributes in greater numeric ranges from dominating those in smaller numeric ranges. Second, it helps to avoid numerical difficulties that can occur during the calculation process.
Describe the effect of using a large versus a small value for the penalty parameter C in SVM model selection. A large value of C heavily penalizes misclassifications, resulting in a smaller margin and less training error, which can lead to overfitting. Conversely, a small value for C is more tolerant of training errors, resulting in a larger margin and potentially better true error (generalization).
Despite using potentially infinite-dimensional feature spaces, why do SVMs with kernels not necessarily overfit the data? SVMs avoid overfitting for three key reasons. The number of parameters (dual weights αi) remains the same and most are set to zero; the final model only depends on the support vectors, which are usually a small subset of samples; and the objective of maximizing the margin acts as a form of regularization.
What condition must a similarity measure satisfy to be used as a kernel function, and what does this condition guarantee? A similarity measure must satisfy Mercer’s condition, which states that the kernel K(x, y) must be positive semi-definite. This condition guarantees that the function can be expressed as a dot product in a high-dimensional space, making it a valid kernel for an SVM.

Glossary of Key Terms

Term	Definition
Dual Formulation	An alternative representation of the SVM optimization problem that is expressed in terms of Lagrange multipliers (αi) and dot products of the input data points (xi * Txj). This formulation is essential for applying the kernel trick.
Feature Space	A high-dimensional space (φ(x)) to which the original input data is mapped. The goal is for the data to become linearly separable in this new space.
Input Space	The original space where the input data points (x) reside.
Kernel Function	A function K(x, z) that computes the dot product of two vectors, x and z, in a high-dimensional feature space (Φ(x)TΦ(z)) without explicitly calculating the transformation Φ.
Kernel Trick	The technique of using kernel functions to efficiently perform calculations in a high-dimensional feature space, avoiding the computational burden of explicit transformation. It works by replacing all dot products in the dual formulation with the kernel function.
LIBSVM	A software implementation for Support Vector classification that supports multi-class classification and is known for its fast SMO (Sequential Minimal Optimization) implementation.
Linear Kernel	A kernel function defined as K(x, z) = xTz. This is the standard dot product used in a linear SVM.
Mercer’s Condition	A theorem stating that any positive semi-definite kernel function K(x, y) can be expressed as a dot product in a high-dimensional space, thereby ensuring it is a valid kernel for use in an SVM.
Penalty Parameter (C)	A hyperparameter in SVMs that controls the trade-off between maximizing the margin and minimizing the classification error. A large C imposes a high penalty for misclassified training examples, while a small C allows for a wider margin at the cost of more misclassifications.
Polynomial Kernel	A kernel function defined as K(x, z) = (1 + xTz)^d. It maps data into a feature space of d-th order polynomial terms.
Radial Basis Function (RBF) Kernel	A kernel function defined as `K(x, z) = exp(-r \\|\\|x-z\\|\\|^2)`. It maps data into an infinite-dimensional feature space and is particularly effective for non-linear classification problems.
Support Vectors	The training data points that lie closest to the decision boundary (hyperplane). In the dual formulation, these are the points for which the Lagrange multipliers (αi) are non-zero, and they are the only points that determine the final model.
Vapnik-Chervonenkis (VC) Dimension	A measure of the capacity or flexibility of a classifier. The theory suggests that N data points are generally linearly separable in a space of N-1 dimensions or more, providing a theoretical justification for mapping data to higher dimensions.

DecisionTree and Bagging

Lecture: S4-DecisionTree
Version: current
Video: M1 + M2 + M3

3Classification Ensemble 5Theory

Summary:

Module1: Basic Tree4Classifier @ https://youtu.be/JKcTiyvIpp8
Module2: How2Learn Tree https://youtu.be/iKTxnJU0L1E
Module3: Bagged DT https://youtu.be/WaWTw07Luzs

Study Guide for Decision Trees and Ensemble Methods

Quiz

Answer the following questions in 2-3 sentences, based on the provided lecture material.

What are the three major types of classification approaches described in the lecture, and which category do decision trees belong to?
Describe the basic anatomy of a decision tree as explained in the “Play Tennis” example.
What is Entropy in the context of information theory, and what does a lower entropy value signify for a node in a decision tree?
Explain the concept of Information Gain (IG) and its function in the process of building a decision tree.
What is a primary caveat or disadvantage of using Information Gain as a purity measure for selecting splits?
Identify two common problems associated with decision trees and list the two general approaches mentioned for addressing overfitting.
Define Bootstrap Aggregation (Bagging) and explain its primary purpose.
How does the “instability” or “high variance” of a model like a decision tree relate to the effectiveness of Bagging?
What are the main characteristics of the CART (Classification and Regression Trees) algorithm?
What is the purpose of “pruning” in decision tree construction, and what are the two main types mentioned?

Answer Key

The three major types of classification approaches are Discriminative, Generative, and Instance-based. Decision trees are a discriminative approach, meaning they directly estimate a decision rule or boundary to separate classes.
A decision tree consists of nodes, which represent tests on a specific feature or attribute. The branches from a node correspond to the possible values of that attribute, and the leaves of the tree represent the final decisions or classifications. As data is passed down the tree, the sample size at each node gets progressively smaller.
Entropy is defined as the expected amount of information when observing the output of a random variable. In a decision tree, entropy measures the purity of a node; a lower entropy value indicates that the distribution of classes within that node is less uniform and therefore purer.
Information Gain, calculated as entropy(parent) – [average entropy(children)], measures the reduction in uncertainty about the target variable (Y) after knowing a feature variable (X). It is used as a criterion to select which attribute provides the best split at each node, with the attribute offering the highest IG being chosen.
A primary caveat is that the number of possible values for an attribute influences the information gain. An attribute with more possible values is more likely to have a higher gain because it can more easily form small, pure partitions, which can be misleading.
Two common problems are the instability of trees (high variance) and overfitting. The two approaches to combat overfitting are: 1) stop growing the tree when further splitting provides no improvement, or 2) grow a full tree and then prune it by eliminating nodes.
Bootstrap Aggregation, or Bagging, is a technique for reducing the variance of a prediction function. It works by creating multiple bootstrap samples (datasets drawn with replacement from the original training data), fitting a model to each sample, and then having the models vote on the final prediction.
Model instability is considered beneficial for Bagging. High-variability methods like decision trees see more improvement from Bagging because the process of averaging predictions from different models (fit on different data samples) helps to smooth out their instability and reduce overall variance.
The CART algorithm is a popular tree-builder that constructs binary trees. It can work with discrete, continuous, and categorical values, handles missing values using “surrogate splits,” and uses cost-complexity pruning to reduce overfitting. The Scikit-learn library uses CART for its tree implementations.
Pruning is a technique used to reduce the size of decision trees by removing branches with little predictive power, which helps to reduce overfitting. The two main types mentioned are Reduced Error Pruning, which replaces nodes starting from the leaves, and Cost Complexity Pruning, which removes the subtree that minimizes a specific error-to-complexity ratio.

Essay Questions

Formulate comprehensive, essay-style answers to the following prompts. No answer key is provided.

Using the “Play Tennis” example, explain how a decision tree represents a “disjunction of conjunctions.” Detail the logical structure and show how a new test case is classified by being passed down the tree.
Describe the role of information theory in the greedy construction of a decision tree. Define Information, Entropy, and Information Gain, explaining the mathematical relationship between them and how they guide the selection of the optimal attribute at each split.
Discuss the problem of overfitting in decision trees. Elaborate on why decision trees are particularly susceptible to this issue, mentioning concepts like model instability and error propagation.
Explain the Bagging technique in detail. Describe the process of creating bootstrap samples, training a “committee of trees,” and making a final prediction. Why is this method particularly effective for high-variance models?
Compare and contrast the ID3, C4.5, and CART tree-building algorithms. Discuss their historical progression, the purity metrics they commonly use, and their respective capabilities in handling different data types (continuous vs. discrete), missing values, and overfitting.

Glossary of Key Terms

Term	Definition
Bagging (Bootstrap Aggregation)	A technique for reducing the variance of a prediction function by creating a committee of models (e.g., trees) trained on bootstrap samples of the data and having them vote on the predicted class.
Bootstrap	The process of randomly drawing datasets with replacement from the original training data. Each new sample is the same size as the original training set.
C4.5	A tree-building algorithm and successor to ID3. It can handle both continuous and discrete features, manage missing values, and uses pruning to reduce overfitting.
CART (Classification and Regression Trees)	A popular tree-building algorithm that constructs binary trees. It can handle various value types, uses surrogate splits for missing data, and employs cost-complexity pruning. It is used by Scikit-learn.
Conditional Entropy	The entropy of a variable Y given that another variable X is known, expressed as H(Y\|X). It represents the average entropy of the children nodes after a split on variable X.
Discriminative Classifier	A type of classification approach that directly estimates a decision rule or boundary. Examples include decision trees, SVMs, and logistic regression.
Entropy (H(X))	The expected amount of information when observing the output of a random variable. In machine learning, it measures the impurity or non-uniformity of a set of examples; lower entropy signifies higher purity.
Generative Classifier	A type of classification approach that builds a generative statistical model for the data. Examples include Naïve Bayes classifiers and Bayesian networks.
Gini (Population Impurity)	An alternative purity measure to Information Gain used for splitting nodes in a decision tree.
ID3 (Iterative Dichotomiser 3)	An early tree-building algorithm that uses a top-down, greedy approach with the Information Gain metric. It only works with discrete features.
Information	A measure of the reduction in uncertainty or the amount of surprise in an outcome. It is calculated as I(X) = -log₂(p(x)).
Information Gain (IG)	The reduction in uncertainty of a target variable achieved by knowing a feature variable. Calculated as entropy(parent) – average entropy(children), it is used to select the best attribute for a split.
Instance-based Classifier	A type of classification approach that uses observations directly without building an explicit model. An example is the K-Nearest Neighbors (KNN) algorithm.
Overfitting	A phenomenon where a model learns the training data so perfectly that it captures noise and random fluctuations, leading to poor performance on new, unseen data. Decision trees are susceptible to this.
Pruning	The process of reducing the size of a decision tree by removing branches that have little predictive power. This is a technique to combat overfitting.
Purity	A measure of how uniform the class distribution is within a node. A completely pure node contains samples of only one class.
Regression Tree	A type of decision tree used for regression tasks, where the prediction is typically the average of the values at a given leaf node. It can be considered a piecewise constant regression model.
Surrogate Splits	A technique used in the CART algorithm to handle missing data. A surrogate split is a mimic of the original split in a node but uses a different predictor, allowing an observation with a missing value to proceed down the tree.

RF and Boosting

Lecture: S4-DT-Ensemble
Version: next
More advanced to Read: S4-DT-moreBoosting
Notes to Read: xgboost
Video: M1 + M2 + M3 + M4

3Classification Ensemble Discriminative

Summary:

Module1: Committee of Models and Analysis https://youtu.be/NuxS9SycZG8
Module2: Random Forest and Analysis https://youtu.be/m52bovS_eNA
Module3: Stacking https://youtu.be/sRLyzg5hmuM
Module4: Boosting https://youtu.be/VwFTEW_NM4o

Study Guide: Ensemble Methods in Machine Learning

This guide provides a comprehensive review of ensemble methods in machine learning, including Bagging, Random Forests, Stacking, and Boosting, based on the provided lecture materials. It includes a quiz to test your knowledge, an answer key, a set of essay questions for deeper consideration, and a glossary of key terms.

Short-Answer Quiz

Instructions: Answer the following questions in 2-3 sentences each, based on the provided source material.

What is the core two-step framework of an ensemble method in machine learning?
Explain the process of “bootstrap sampling” as it is used in Bagging.
Why is model instability considered a beneficial trait for the base classifiers used in Bagging?
What is the primary limitation of Bagging that the Random Forest algorithm is designed to address?
How does the Random Forest algorithm de-correlate the individual decision trees within its ensemble?
Explain the role that pairwise correlation (ρ) between trees plays in determining the overall variance of an ensemble model.
Describe the general architecture of Stacking, identifying its main components.
What are the three core principles that guide boosting strategies?
Contrast the training processes of Boosting and Bagging, highlighting a key difference in their approach to the training data.
List at least three key features of the XGBoost implementation that contribute to its efficiency.

Answer Key

The framework of an ensemble method involves two main steps. First, a set of diverse classifiers is generated. Second, the predictions of these individual classifiers are aggregated, for instance by taking a majority vote, to form a final prediction.
Bootstrap sampling is a technique of randomly drawing datasets with replacement from the original training data. Each new sample is the same size as the original training set, and because the sampling is done with replacement, individual data points can be duplicated within a sample.
Model instability is beneficial in Bagging because the more variable or unstable a basic model is, the more improvement can be gained by averaging multiple versions of it. High-variability models like decision trees see more improvement from Bagging than low-variability methods.
The primary limitation of Bagging is that the bagged trees are often correlated with each other. Random Forest is an extension of Bagging that was specifically designed to reduce this correlation between the trees.
Random Forest de-correlates its trees by introducing an additional layer of randomness during the tree-building process. At each node, when choosing a feature for a split, the algorithm only considers a random subset of m features from the total p available features, rather than all of them.
Pairwise correlation (ρ) between classifiers directly impacts the variance of the ensemble’s average prediction. Even as the number of classifiers (B) approaches infinity, the variance term related to correlation (ρσ²) remains, meaning that higher correlation prevents the total variance from being minimized.
Stacking involves learning and combining multiple classifiers. It uses a set of “base learners” trained on the data, and their predictions are then fed as input into a “meta learner” or “blender,” which is a higher-level classifier that makes the final prediction.
The three core principles of boosting are: (1) using many base classifiers to vote on a decision, (2) sequentially training classifiers where each corrects the mistakes of the previous ones, thereby focusing on hard examples, and (3) giving higher weight to the better-performing base classifiers.
Unlike Bagging, which trains classifiers in parallel on different bootstrap samples of the data, Boosting trains classifiers sequentially on the entire training set. Boosting adaptively re-weights the training examples at each step to focus on instances that previous classifiers misclassified.
Key features of the XGBoost implementation include: the ability to use L1 or L2 regularization, a sparsity-aware split finding algorithm for handling sparse data, and a block structure system design that enables parallel learning across multiple CPU cores. Other features include cache awareness and out-of-core computing for very large datasets.

Essay Questions

Instructions: The following questions are designed for longer, more detailed responses. Formulate your answers by synthesizing information from across the lecture materials.

Compare and contrast the Bagging and Boosting ensemble methods. Discuss their approaches to training data, model creation (parallel vs. sequential), and their primary goals in improving model performance.
Explain the Bias-Variance Tradeoff in the context of ensemble methods. How do techniques like Bagging leverage complex, high-variance base models (like full decision trees) to create a final classifier with significantly lower variance?
Describe the evolution from a simple Bagged Decision Tree to a Random Forest. What specific mechanism was introduced in Random Forest, and why is it mathematically significant for reducing the ensemble’s variance, as explained by the role of pairwise tree correlation?
Detail the sequential nature of boosting algorithms like AdaBoost. How does the process focus on “hard examples” from one iteration to the next, and what roles do example weights and classifier weights play in constructing the final, powerful model?
Discuss the practical implementation and features of modern boosting libraries like XGBoost and LGBM. What technical innovations allow them to be so efficient and popular in machine learning competitions, and what is a key difference in their tree growth strategies?

Glossary of Key Terms

Term	Definition
AdaBoost	An adaptive boosting algorithm that sequentially trains base classifiers, focusing on hard examples from previous iterations by updating example weights, and combines them into a final weighted sum.
Bagging	Stands for “Bootstrap Aggregation.” An ensemble technique for reducing the variance of a prediction function by generating multiple versions of a predictor from bootstrap samples and aggregating them.
Base Learner	An individual classifier (e.g., a decision tree) that is part of a larger ensemble model, particularly used in the context of Stacking.
Boosting	An ensemble method that sequentially trains base classifiers, where each new classifier is trained to correct the mistakes of the previous ones. It combines a weighted sum of many classifiers, often shallow decision trees.
Bootstrap Sampling	A re-sampling technique that involves randomly drawing datasets with replacement from the original training data. Each sample is the same size as the original set.
Discriminative Classifier	A type of classification approach that directly estimates a decision rule or boundary. Examples include SVM, decision trees, and neural networks.
Ensemble Method	A machine learning framework that involves generating a set of diverse classifiers and aggregating their predictions to produce a final result. Examples include Bagging, Boosting, and Stacking.
Generative Classifier	A type of classification approach that builds a generative statistical model. Examples include Bayesian networks and Naïve Bayes classifiers.
Instance-based Classifier	A type of classification approach that uses observations directly without building an explicit model. An example is the K-nearest neighbors algorithm.
LGBM	Stands for Light Gradient Boosted Machines. An efficient library for training GBMs developed by Microsoft that uses leaf-wise tree growth and novel techniques like Gradient-based one-side sampling.
Meta Learner (Blender)	In Stacking, the higher-level classifier that is trained on the predictions generated by the base learners to make the final prediction.
Random Forest	An ensemble classifier that extends Bagging by de-correlating the trees. It does this by considering only a random subset of features at each split when building the individual decision trees.
Stacking	An ensemble method where multiple base learners are trained, and a meta learner is then trained on the predictions of the base learners to produce the final output.
XGBoost	Stands for Extreme Gradient Boosting. A highly efficient and popular implementation of a Gradient Boosting Decision Tree that incorporates features like regularization, parallel learning, and handling of sparse data.

SVM, Dual

Lecture: S4-SVM-optimDual
Version: current
Notes to Read: SMO
Extra Notes to Read: Paper SMO
Video: Extra M4

3Classification Discriminative Regularization Optimization 5Theory

convex optim with Dual

Lecture: S4-SVMoptimDual-Extra
Version: current
Notes to Read: VC Theory

3Classification Discriminative Regularization Optimization 5Theory

Section 5 - Not Supervised

Clustering Hier

Lecture: S5-clustering1-Hier
Version: next
Notes to Read: compare Hier clusterings
Video: M1 + M2

4Unsupervised

Summary:

Hierarchical:

Module1: Basics of Unsupervised Clustering @ https://youtu.be/mxrcPHWCZAI
Module2: Hierarchical Clustering @ https://youtu.be/2K6gXLf4Em4

K-means

Module1: Basic K-Means Algorithm @ https://youtu.be/239r6zlZYhY
Module2: Theory/Complexity of K-Means @ https://youtu.be/OFKKeVhCQfA
Extra Module3: Gaussian Mixture Models @ https://youtu.be/pmNPXpj0eK4

Unsupervised Learning: A Study Guide on Hierarchical Clustering

Quiz: Short Answer Questions

Answer the following questions in 2-3 sentences based on the provided course material.

What is the core difference between unsupervised and supervised machine learning?
Define the primary goal of clustering and describe the two main properties of an ideal clustering solution.
List the four desirable properties of a distance measure and provide a brief, intuitive explanation for one of them.
What is the Minkowski metric, and what are its three most common forms?
Under what conditions is Hamming distance used, and how does it relate to the Manhattan distance?
Explain the fundamental difference between the operational approaches of partitional and hierarchical clustering algorithms.
What is a dendrogram, and what role does it play in hierarchical clustering?
Describe the process of bottom-up agglomerative clustering.
Name and briefly define the three common methods for calculating the distance between two clusters in hierarchical clustering.
What is the primary limitation of hierarchical clustering methods regarding their performance on large datasets?

Answer Key

Unsupervised learning operates on raw, unlabeled data (an unlabeled dataset X with no Y variable), aiming to find inherent structures or patterns. In contrast, supervised learning is provided with labeled examples (both X and Y variables) and aims to learn a function that maps inputs to outputs, such as in regression (continuous Y) or classification (discrete Y).
The goal of clustering is to group a set of objects into classes where objects within a class are similar to each other and dissimilar to objects in other classes. An ideal solution achieves high intra-class similarity (minimized intra-cluster distances) and low inter-class similarity (maximized inter-cluster distances).
The four properties are: Symmetry, Constancy of Self-Similarity, Positivity Separation, and Triangular Inequality. Symmetry (D(A,B) = D(B,A)) provides the intuition that if “Alex looks like Bob,” then “Bob must look like Alex.”
The Minkowski metric is a generalized formula for calculating the distance between two objects, x and y, that each have p features. Its three most common forms are determined by the parameter ‘r’: Euclidean distance (r=2), Manhattan distance (r=1), and “sup” distance (r=+∞).
Hamming distance is used when all features of the data are binary or discrete, such as gene expression levels recorded as either high (1) or low (0). It is a special case of the Manhattan distance applied to this type of data.
Partitional algorithms typically start with a random partitioning of the data into a pre-specified number of clusters and then refine these clusters iteratively. Hierarchical algorithms do not require a fixed number of clusters in advance and instead build a tree-based structure, either by starting with individual objects and merging them (agglomerative) or by starting with a single group and splitting it (divisive).
A dendrogram is a tree-based diagram that represents the hierarchical taxonomy produced by a clustering algorithm. It illustrates the history of merges or splits, showing which objects and clusters were grouped together and at what distance, allowing users to form partitions by “cutting” the tree at a desired level.
Bottom-up agglomerative clustering begins with each data object as its own separate cluster. In each step, the algorithm identifies the two closest clusters based on a chosen distance metric and merges them into a single new cluster, repeating this process until only one cluster containing all objects remains.
The three methods are Single-Link, Complete-Link, and Average-Link. Single-Link defines cluster distance as the distance between the two closest members of the clusters. Complete-Link uses the distance between the two farthest members. Average-Link calculates the average distance of all cross-cluster pairs.
The primary limitation is that they do not scale well to large datasets. The time complexity is at least O(n²), where n is the number of objects, because in the first iteration, the algorithm must compute the similarity for all pairs of instances.

Essay Questions

The following questions are designed for a more in-depth, essay-format response. No answers are provided.

Discuss the subjective nature of clustering. Using the concepts of “groupness,” “similarity/distance,” and the determination of the number of clusters, explain why different yet equally valid clustering solutions can exist for the same dataset.
Compare and contrast the use of Euclidean distance and the Pearson correlation coefficient as similarity measures. Explain a scenario where one would be significantly more appropriate than the other, referencing the provided discussion on gene expression time series data.
Provide a comprehensive explanation of the bottom-up agglomerative hierarchical clustering process, from the initial pairwise distance matrix to the final dendrogram. In your explanation, detail how the choice between Single-Link, Complete-Link, and Average-Link methods influences the process and the potential shapes of the resulting clusters.
Analyze the computational complexity and scalability challenges of hierarchical clustering. Elaborate on why the time complexity is at least O(n²) and can become O(n³) if implemented naively. Discuss the practical implications of this complexity when applying these methods to modern, large-scale datasets.
Machine learning is framed as a combination of Representation, a Score Function, and a Search/Optimization strategy. Deconstruct hierarchical clustering using this framework. What constitutes the data representation, what is the search strategy, and why is its score function described as having “no clearly defined loss”?

Glossary of Key Terms

Term	Definition
Agglomerative Clustering	A bottom-up hierarchical clustering method that starts with each object in its own cluster and repeatedly joins the closest pair of clusters until only one remains.
Average-Link	A method for computing the distance between two clusters based on the average distance of all cross-cluster pairs. It is noted as the most widely used measure and is robust against noise.
Clustering	The process of grouping a set of objects into classes of similar objects. It is the most common form of unsupervised learning.
Complete-Link	A method for computing the distance between two clusters based on the distance of their two furthest members. This method tends to produce tight clusters.
Correlation Coefficient	A similarity measure that quantifies the linear correlation between two sequences, giving a value between +1 (total positive correlation) and -1 (total negative correlation). It is unit-independent.
Data Matrix	A representation of data in a tabular format with n observations (rows) on p variables (columns). Rows can be called points, instances, or samples, while columns are features, attributes, or predictors.
Dendrogram	A tree-based hierarchical taxonomy that is the output of a hierarchical clustering algorithm. It visually represents the history of merges or splits and the distances at which they occurred.
Divisive Clustering	A top-down hierarchical clustering method that starts with all data in a single cluster, considers every possible way to divide it into two, chooses the best division, and recursively operates on the new clusters.
Distance Measure Properties	A set of four desirable properties for any distance measure: Symmetry (D(A,B) = D(B,A)), Constancy of Self-Similarity (D(A,A) = 0), Positivity Separation (D(A,B) = 0 if and only if A=B), and Triangular Inequality (D(A,B) <= D(A,C) + D(B,C)).
Edit Distance	A generic technique for measuring similarity by calculating the “effort” it takes to transform one object into another. The measure of effort becomes the distance.
Euclidean Distance	A common distance measure, corresponding to the Minkowski metric with r=2. It calculates the straight-line distance between two points in a p-dimensional space.
Hamming Distance	A distance measure used when all features are binary or discrete. It is a special case of the Manhattan distance.
Hierarchical Clustering	An algorithm that builds a tree-based hierarchy (dendrogram) from a set of objects. It can be performed bottom-up (agglomerative) or top-down (divisive).
Inter-cluster Distance	The distance between different groups or clusters. An effective clustering algorithm seeks to maximize this distance.
Intra-cluster Distance	The distance between data points within the same group or cluster. An effective clustering algorithm seeks to minimize this distance.
Manhattan Distance	A common distance measure, corresponding to the Minkowski metric with r=1. It is calculated as the sum of the absolute differences between the coordinates of two points.
Minkowski Metric	A generalized formula for defining the distance between two objects in a p-dimensional space, controlled by a parameter ‘r’.
Partitional Algorithms	A class of clustering algorithms (e.g., K-means) that usually start with a random partitioning of the data and refine it iteratively to find the clusters.
Single-Link	A method for computing the distance between two clusters based on the distance of their two closest members (nearest neighbors). This method can result in long and skinny clusters.
Unsupervised Learning	A type of machine learning that involves learning from raw, unlabeled, or unannotated data, as opposed to supervised learning where the label of each example is given.

Clustering Partition

Lecture: S5-clustering2-kMeans
Version: next
More advanced to Read: L19d-EMextra-EM
Notes to Read: compare clusterings
Video: M1 + M2

4Unsupervised Generative

Summary:

Module1: Basic K-Means Algorithm @ https://youtu.be/239r6zlZYhY
Module2: Theory/Complexity of K-Means @ https://youtu.be/OFKKeVhCQfA
Extra Module3: Gaussian Mixture Models @ https://youtu.be/pmNPXpj0eK4

K-Means and Partitional Clustering Study Guide

Quiz: Short Answer Questions

Answer the following questions in 2-3 sentences, based on the provided source material.

What is the fundamental difference between unsupervised and supervised machine learning?
Describe the primary goal of clustering in terms of intra-cluster and inter-cluster distances.
How do partitional clustering algorithms, like K-means, generally operate?
Outline the five core steps of the K-means algorithm.
What role does the centroid play in the K-means algorithm, and how is its position updated?
Explain the concept of “elbow finding” or “knee finding” for determining the number of clusters.
Why is the initial selection of cluster centers (seeds) a critical consideration in K-means?
What is the mathematical objective of the K-means algorithm, and how does it relate to convergence?
Briefly describe the Gaussian Mixture Model (GMM) approach to clustering.
Define “purity” as an external criterion for evaluating cluster quality.

Answer Key

Unsupervised learning works with raw, unlabeled data, where no target variable (Y) is given. In contrast, supervised learning uses labeled data where a classification label or a continuous target variable is provided for each example.
The goal of clustering is to find groups where data points within a single group are similar to one another and different from data points in other groups. This is achieved by minimizing intra-cluster distances (distances within a cluster) and maximizing inter-cluster distances (distances between different clusters).
Partitional algorithms construct a partition of n objects into a set of K clusters, where the user must specify the number of clusters K. They typically start with a random partitioning and refine it iteratively until a stopping condition is met.
The K-means algorithm follows these steps: 1) Decide on a value for k. 2) Initialize k cluster centers randomly. 3) Assign each object to its nearest cluster centroid. 4) Re-estimate the cluster centers based on the new memberships. 5) Repeat steps 3 and 4 until no objects change membership.
A centroid is the center of gravity or mean of all data points belonging to a cluster. In each iteration of the K-means algorithm, after data points are assigned to their nearest centroid, each centroid’s position is updated by “jumping” to the mean location of the points it now “owns”.
“Elbow finding” is a heuristic method to determine the right number of clusters (k). It involves running K-means for a range of k values and plotting the objective function (sum of squared distances) for each k. The optimal k is often found at the “elbow” or “knee” of the plot, where the rate of decrease in the objective function value abruptly slows.
The initial selection of random seeds can significantly impact the results of K-means. Some seed selections can lead to a poor convergence rate or cause the algorithm to converge to a sub-optimal clustering solution. To mitigate this, it is very important to try multiple starting points or use heuristics to select good seeds.
The mathematical objective is to minimize the sum of squared distances between each data point and the centroid of its assigned cluster. The algorithm converges because each reassignment step monotonically decreases this objective function, and it will eventually reach a fixed point where cluster memberships no longer change.
A Gaussian Mixture Model (GMM) is a partitional clustering method that assumes data is generated from a mixture of Gaussian distributions. It uses the Expectation-Maximization (EM) algorithm to iteratively revise each point’s proportional membership to clusters and update each cluster’s mean (centroid) and covariance (shape).
Purity is an external evaluation measure that assesses clustering quality with respect to ground truth (gold standard classes). For a single cluster, it is calculated as the ratio between the number of objects from the dominant class within that cluster and the total size of the cluster.

Essay Questions

Develop a detailed response to the following prompts, synthesizing information from the source material.

Compare and contrast the K-means algorithm with the Gaussian Mixture Model (GMM) approach to partitional clustering. Discuss their core assumptions, iterative processes, and how they define cluster membership for data points.
Discuss the computational complexity and potential challenges of the K-means algorithm. How does its time complexity scale with the number of objects, dimensions, clusters, and iterations, and what are the primary strategies for overcoming its sensitivity to initialization?
Explain the mathematical objective function that the K-means algorithm seeks to optimize. Detail how the two main iterative steps—assigning memberships and updating centroids—work in conjunction to minimize this function and guarantee convergence.
Describe the methods for evaluating the quality of a clustering result. Differentiate between internal and external criteria, providing a specific example of an external criterion like “purity” and explaining how it is calculated and what it signifies.
Imagine you are tasked with clustering an unlabeled tabular dataset. Outline a complete workflow using the K-means methodology, from defining the problem to evaluating the final output. Your plan should address data representation, determining the value of K, the iterative clustering process, and dealing with potential convergence issues.

Glossary of Key Terms

Term	Definition
Centroid	The center of gravity or mean of the data points assigned to a specific cluster. In K-means, centroids are iteratively re-calculated and updated.
Clustering	An unsupervised learning task aimed at finding groups (clusters) in data such that points within a group are similar to one another, while points in different groups are dissimilar.
Covariance	In the context of Gaussian Mixture Models (GMM), covariance defines the shape and orientation of the Gaussian distribution for a specific cluster.
Elbow Method	A heuristic for finding the optimal number of clusters (K) by plotting the value of the objective function for different values of K and identifying the “elbow” point where the rate of improvement sharply decreases. Also known as “knee finding”.
Expectation-Maximization (EM) Algorithm	A general iterative procedure for finding maximum likelihood estimates of parameters in statistical models. K-means is considered a special case of the EM algorithm, which is known to converge.
External Criterion	A measure of clustering quality that assesses the result against a known ground truth or gold standard data. Purity is an example of an external criterion.
Gaussian Mixture Model (GMM)	A probabilistic model for clustering that assumes the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.
Hierarchical Algorithms	A class of clustering algorithms that create a nested sequence of clusters, either through a bottom-up (agglomerative) or top-down (divisive) approach.
Inter-cluster Distance	The measure of similarity or distance between different clusters. The goal of clustering is to maximize this distance.
Internal Criterion	A measure of clustering quality that relies only on the clustered data itself, such as evaluating for high intra-cluster similarity and low inter-cluster similarity.
Intra-cluster Distance	The measure of similarity or distance among data points within the same cluster. The goal of clustering is to minimize this distance.
K-means	A partitional clustering algorithm that aims to partition n observations into K clusters in which each observation belongs to the cluster with the nearest mean (centroid).
Objective Function	In K-means, this is the function the algorithm seeks to minimize. It is defined as the sum of the squared distances from each data point to its assigned cluster’s centroid.
Partitional Algorithms	A class of clustering algorithms that divide a dataset into a set of K non-overlapping clusters, where K is specified by the user. They typically start with a random partition and refine it iteratively.
Purity	An external measure of cluster quality. For a given cluster, it is the ratio of the count of the most frequent class (from the ground truth) to the total number of data points in that cluster.
Seed Choice	The initial placement of the K cluster centers. This choice can be random and can significantly affect the final clustering result and convergence rate.
Time Complexity	A measure of the computational cost of an algorithm. For K-means, it is O(lKnp), where l is the number of iterations, K is the number of clusters, n is the number of data points, and p is the number of dimensions.
Unsupervised Learning	A type of machine learning that involves learning from raw, unlabeled, or unannotated data, without a given output variable (Y).

Reinforcement Learning

Lecture: S5_RL22
Version: next

notSupervised RL 5Theory

Reinforcement Learning Study Guide

Quiz

Answer the following questions in 2-3 sentences each, based on the provided lecture materials.

How does Reinforcement Learning (RL) fundamentally differ from Supervised and Unsupervised Learning?
Describe the four core elements of the Reinforcement Learning setup.
What is a Markov Decision Process (MDP) and what is its key property?
Explain the difference between a “reward function” and a “value function” in RL.
What was the significance of the TD-Gammon program developed by Gerry Tesauro?
Compare the primary requirements of Dynamic Programming (DP) versus Monte Carlo (MC) methods for solving RL problems.
What is the “explore vs. exploit” dilemma, and what is the ε-greedy policy strategy for managing it?
How does Temporal-Difference (TD) learning combine ideas from both Dynamic Programming and Monte Carlo methods?
Define Q-learning and explain why it is considered an “off-policy” algorithm.
What is reward shaping and what is the guiding principle for designing good reward functions?

Answer Key

Unlike Supervised Learning which learns a map from data x to a label y, and Unsupervised Learning which learns underlying structure from unlabeled data x, Reinforcement Learning’s goal is to maximize future rewards over many steps. The data for RL consists of state-action pairs and the resulting rewards, as the agent learns by interacting with an environment.
The core elements are the agent, which is the learner or decision-maker; the environment, which the agent interacts with; the action, which is what the agent chooses to do; and the reward and new state, which are the feedback the environment provides in response to an action.
A Markov Decision Process (MDP) is a discrete-time stochastic control process that formally defines the environment in an RL problem. Its key property is that every outcome (the new state and reward) depends only on the current state and the chosen action, not on any previous states or actions.
A reward function maps a state or state-action pair to an immediate, single numerical value called a reward. In contrast, a value function represents the total expected future reward an agent can accumulate over the long run, starting from a specific state or state-action pair.
TD-Gammon was a Backgammon-playing program that became a major success story for RL. It used temporal-difference learning and a neural network, training itself to a world-class level simply by playing 1.5 million games against itself and learning from the outcomes.
Dynamic Programming requires a perfect and complete model of the environment, including knowledge of state transition probabilities and rewards. Monte Carlo methods do not need a model of the environment; instead, they learn directly from experience (or simulated experience) by averaging sample returns from completed episodes.
The “explore vs. exploit” dilemma is the challenge of choosing between actions that are known to yield good rewards (exploit) and trying new, unknown actions that might lead to better rewards (explore). The ε-greedy policy is a strategy where the agent chooses the known best action with probability 1-ε, but chooses a random action with a small probability ε to ensure continued exploration.
TD learning combines ideas from both methods by learning directly from experience without a model, similar to MC methods. However, it updates its value estimates based on the values of successor states, a technique known as bootstrapping that is characteristic of DP.
Q-learning is a temporal-difference method that directly approximates the optimal state-action value function, Q*. It is considered “off-policy” because it can learn the optimal policy regardless of the policy the agent is actually following to explore the environment, as long as every state-action pair continues to be updated.
Reward shaping is the practice of providing rewards for achieving subgoals to guide the agent, which is useful when the primary positive reward is very “far away” or difficult to achieve. The principle for good rewards is that they should indicate what the agent needs to accomplish, not how it should be accomplished.

Essay Questions

Construct detailed, essay-format answers for the following prompts, synthesizing information from across the lecture materials.

Discuss the historical development of Reinforcement Learning, citing the key contributions of Thorndike, Bellman, Samuel, and Watkins. Explain how each contribution added a foundational piece to the modern understanding of RL.
Compare and contrast the three main approaches for solving RL problems: Dynamic Programming, Monte Carlo methods, and Temporal-Difference learning. For each, describe its core idea, requirements (e.g., model of the environment), and its primary advantages and disadvantages.
Using the “Robot in a room” example, explain the roles of states, actions, rewards, the policy, and the value function. Describe how a discount factor (γ) affects the calculation of the value function and why it is a critical component in defining the agent’s goal.
AlphaGo is presented as a major success for Deep Reinforcement Learning. Describe its learning pipeline, explaining how it combines Supervised Learning (SL) and Reinforcement Learning (RL) and the specific roles of the SL policy network, the value network, and the Monte Carlo Tree Search (MCTS).
What are the key challenges in Reinforcement Learning? Discuss at least three challenges mentioned in the lecture, such as delayed rewards, the explore vs. exploit trade-off, state representation, and reward design, providing examples for each.

Glossary of Key Terms

Term	Definition
Action	A choice made by the agent from a set of available options in a given state.
Agent	The learner or decision-maker that interacts with the environment.
Bellman Equation	A fundamental equation in RL that expresses the value of a state or state-action pair in terms of the expected reward plus the discounted value of successor states.
Deep Reinforcement Learning	The combination of Reinforcement Learning with deep neural networks, where a neural network can learn the policy or value function.
Discount Factor (γ)	A value between 0 and 1 that determines the present value of future rewards. It models the trade-off between immediate and future rewards.
Dynamic Programming (DP)	A method for solving RL problems that uses value functions to structure the search for good policies, but requires a perfect model of the environment.
Environment	The external system with which the agent interacts, providing states and rewards in response to the agent’s actions.
ε-greedy policy	An exploration strategy where the agent chooses the action with the highest estimated value with probability 1-ε and a random action with probability ε.
Episodic Task	A task that has a clear starting and ending point, breaking the agent’s experience into distinct episodes (e.g., a game of chess).
Explore vs. Exploit	The fundamental trade-off in RL between taking actions with known high rewards (exploitation) and trying new actions to discover potentially better rewards (exploration).
Markov Decision Process (MDP)	A mathematical framework for modeling decision-making in an environment where outcomes are partly random and partly under the control of a decision-maker. It assumes that future states depend only on the current state and action.
Monte Carlo (MC) Methods	A class of RL algorithms that learn from complete episodes of experience without needing a model of the environment. They estimate value functions by averaging the sample returns observed.
Off-policy Learning	An RL algorithm (like Q-learning) that learns the value of the optimal policy independently of the agent’s actions.
On-policy Learning	An RL algorithm (like Sarsa) that learns the value of the policy being carried out by the agent.
Policy (π)	A mapping from states to actions, defining the agent’s behavior. It can be deterministic (always the same action in a state) or stochastic (a probability distribution over actions).
Q-function	Also known as the state-action value function, Q(s,a). It represents the expected total future reward for taking action a in state s and following the policy thereafter.
Q-learning	An off-policy, temporal-difference control algorithm that directly approximates the optimal Q-function, independent of the policy being followed.
Reinforcement Learning (RL)	A class of machine learning where an agent learns to interact with an environment to maximize a cumulative reward over many steps.
Reward	A numerical signal sent from the environment to the agent in response to an action. The agent’s sole objective is to maximize the total reward it receives.
Reward Shaping	The process of designing a reward function, often by adding rewards for achieving subgoals, to guide the learning process. The principle is to specify what to do, not how to do it.
Sarsa	An on-policy temporal-difference control algorithm that updates the Q-value based on the quintuple: State, Action, Reward, next State, next Action.
State	A representation of the environment’s current situation as perceived by the agent.
Temporal-Difference (TD) Learning	An RL method that combines ideas from Monte Carlo methods and Dynamic Programming. It learns directly from experience (like MC) but updates values based on estimates of successor states (like DP).
Value Function	A function that estimates the total expected future reward from a given state (V(s)) or from a given state-action pair (Q(s,a)).