---- ----
1Basic
Lecture 1 Study Guide
Course Logistics and Overview
Course Staff and Communication
- Instructor: Prof. Yanjun Qi (Pronounced: /ch ee/)
- Communication: Email and Slack (details on Course overview)
- TA and Office Hours: Information available on the Course Website
Course Materials
- Textbook: None required; multiple reference books shared via Course Website
- Official Content: Anything not mentioned in Prof. Qi’s slides is not an official topic
- Resources:
- All slides, videos, readings, and homework on Course Website
- Assignments/Project: UVA CourseSite
- Quizzes: Google Forms
Background Needed
- Required:
- Calculus
- Basic linear algebra
- Basic probability
- Basic Algorithm
- Recommended: Statistics
- Programming: Python (required for all assignments)
Assessment Breakdown
- Weekly Quizzes: 20% (top 10 scores count, 14 total, 10 minutes each, closed book via Google Form)
- Homework Assignments: 65% (HW1-HW5, 10% each)
- Final Project: 10% (apply ML to real-world problems)
- Weekly Reading Question Submissions: 5%
- Final Exam: 15% (December 9th)
- See CourseWeb for detailed policy
In-Class Activities
- Pre-class question submission
- Quizzes
- Quiz review
- Assignment review
- In-class solution examples
- Short presentations/code demos
Emphasis
- Active learning
- Understanding core ML concepts
- Applying models (Python, scikit-learn, Keras)
- Analyzing performance and ethics
- Becoming an ML tool builder
Key Objectives
- Learn fundamental principles, algorithm design, methods, and applications
- Build simple machine learning tools from scratch (not just tool users)
- Understand complex machine learning methods at the source code level
Machine Learning History and Concepts
Artificial Intelligence (AI)
- Definition: The study of computer systems that attempt to model and apply the intelligence of the human mind
- Attributes of Intelligence (for machines):
- Perceive
- Understand
- Interact
- Think/Reason/Learn
- Imagine/Make analogy
- Impact: Economic, cultural, social, health disruptions; potential for job automation; existential threat concerns
History of AI/ML (Timeline)
- 1950: Alan Turing’s “Computing Machinery and Intelligence” (“Can machines think?”)
- 1956: Dartmouth workshop, John McCarthy coins “artificial intelligence”
- 1950s-60s: Enthusiasm, early Neural Networks (NNs) popular
- 70s-80s: Rule-based/Knowledge-based systems, Expert systems
- 1959: Arthur Samuel popularizes “machine learning” (Samuel Checkers-playing Program)
- 90s: Classic ML (SVM, RF, BN), Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) invented
- 2000s: ImageNet, Deep Learning breakthroughs
- 2010s: Deep Learning breakthroughs continue (2015-2016 CNN/RNN invented, 2017 Deep Learning, Deep Reinforcement Learning, GANs)
- 2023-Now: ChatGPT / Generative AI / Agentic AI
Reasons for 2010s Deep Learning Breakthroughs
- Plenty of good quality (labeled) data (text, visual, audio, user activity, knowledge graphs, genomics, medical imaging)
- Advanced computer architecture that fits Deep Learning (e.g., GPUs for faster, more accurate results)
- Powerful, well-engineered machine learning libraries (easy to learn, use, extend)
Recent Advances (2023-Now) on Generative AI
- Definition: AI systems that create content (text, images, audio, video, code), powered by foundation models (transformers, diffusion, LLMs). Distinct from discriminative AI
- Key Applications:
- LLMs (text/chat)
- Diffusion models (images)
- Gen-2/Sora (video)
- GitHub Copilot (code)
- Multi-modal models
- Enabling Factors: Scaling law, data, advanced architecture, powerful libraries
Machine Learning Basics
Goal
Build computer systems that learn and adapt from “experience” (data examples)
Traditional Programming vs. Machine Learning
- Traditional: Computer + Data + Program → Output
- ML (Training Phase): Computer + Input Data (X) + Output (Y) → Program/Model f()
Common Notations
- Inputs (X):
- Matrix (bold capital)
- p variables, n observations
- Xⱼ is jth element
- Xⱼ,ᵢ is ith observed value
- Vectors are column vectors
- Outputs:
- Quantitative (Y)
- Qualitative/Categorical (C)
- Observed Variables: Lowercase
Supervised Learning
Find a function f() to map input space X to output space Y such that the difference between true y and f(x) is small
Example: Linear Binary Classifier
f(x,w,b) = sign(wᵀx + b)
- wᵀx + b > 0 (e.g., +1 point)
- wᵀx + b < 0 (e.g., -1 point)
Training
Learning parameters w, b by minimizing a loss/cost function L() on the training set (available x and y pairs)
Testing
Evaluating performance on “future” (unseen) points by comparing true y to predicted f(x). Key: testing examples are not in the training set
Basic Concepts
- Loss Function: Measures difference between y and f(x) (e.g., hinge loss for binary classification)
- Regularization: Additional information added to the loss function to control f
- Generalization: The ability to learn a function from past data to “explain,” “predict,” “model,” or “control” new, unseen data examples (inductive reasoning)
Two Modes of ML
- Training: Input-output pairs (X, Y) are fed into a model to learn f()
- Testing/Production: Unseen input X’ is fed into the learned model f() to produce f(X’) (predicted output)
Three Aimed Features of ML Models
- Robustness
- Computation
- Prediction
General Lessons for Excellence
- Good breadth in fundamentals is key
- Strength in particular targeted topics helps stand out
- Master Algorithm by Pedro Domingos (explores different “tribes” of ML: Symbolists, Connectionists, Evolutionists, Bayesians, Analogizers)
- Homo Deus: A Brief History of Tomorrow by Yuval Noah Harari
Quiz: UVA CS 4774: Machine Learning - Introduction
Instructions: Answer each question in 2-3 sentences.
- What is the main distinction between “Traditional Programming” and the “Machine Learning (training phase)” as illustrated in the lecture?
- Name three specific reasons cited in the lecture for the significant breakthroughs in Deep Learning during the 2010s.
- Define “Generative AI” and explain how it differs from “discriminative AI” as presented in the source material.
- In the context of supervised learning, what is the purpose of the “training phase” for a linear binary classifier, and what is being minimized?
- What is the concept of “generalization” in machine learning, and why is it crucial for model performance?
Quiz Answer Key
-
Machine Learning Goal
The primary goal of machine learning is to build computer systems that can learn and adapt from their “experience.” In this context, “experience” refers to the available data examples, also known as instances or samples, which are described with various properties.
-
Traditional vs ML Programming
Traditional Programming involves a computer executing a pre-defined program on data to produce an output. In contrast, the Machine Learning training phase takes input data (X) and corresponding outputs (Y) to learn or create the program/model f().
-
Deep Learning Breakthroughs
The significant breakthroughs in Deep Learning during the 2010s were primarily due to the availability of plenty of good quality labeled data, advanced computer architectures suitable for Deep Learning (like GPUs), and powerful, well-engineered machine learning libraries.
-
Generative AI
Generative AI refers to AI systems that are capable of creating new content, such as text, images, or code, powered by foundation models. This differs from discriminative AI, which primarily focuses on tasks like classification rather than content generation.
-
Training Phase Purpose
In supervised learning, the training phase for a linear binary classifier aims to learn the parameters w and b. This is achieved by minimizing a loss or cost function L(), which quantifies the difference between the true output y and the model’s prediction f(x) on the available training examples.
-
Generalization
Generalization in machine learning is the ability of a model to apply the knowledge learned from past, observed data to effectively “explain,” “predict,” or “model” new, unseen data examples. It is crucial because it indicates whether the model has truly learned the underlying patterns rather than just memorizing the training data.
1Basic
Comprehensive Linear Algebra and Matrix Calculus Review
Study Guide
This study guide is designed to reinforce your understanding of fundamental concepts in linear algebra and matrix calculus, as presented in the source material. Mastering these topics is crucial for advanced studies in machine learning and related fields.
I. Fundamental Definitions
- Scalar: Understand what a scalar is and how it’s denoted.
- Vector: Define a vector, differentiate between row and column vectors, and understand its representation in real space (R^n).
- Matrix: Define a matrix, understand its dimensions (m-by-n), how its entries are denoted, and the concept of a square matrix.
II. Special Matrices
- Identity Matrix (I): Understand its structure and significance.
- Diagonal Matrix: Recognize its form.
- Upper/Lower Triangular Matrices: Identify these structures.
- Symmetric Matrix: Understand the property that defines a symmetric matrix.
- Orthogonal Matrix: Define an orthogonal matrix and its key property related to its inverse.
III. Matrix Operations
Transposition
- Concept: “Flipping” rows and columns.
- Notation: A^T.
- Application: Understand how to transpose a given matrix.
Addition and Subtraction
- Condition: Matrices must be of the same size.
- Process: Entry-wise operation.
- Examples: Be able to perform addition and subtraction.
Multiplication
- Notation: AB (pre-multiplying B by A, post-multiplying A by B).
- Conformability: Understand the condition for multiplication (number of columns in premultiplier must equal number of rows in postmultiplier).
- Dimensions: Predict the dimensions of the resulting matrix (m × n * n × p = m × p).
- Non-Commutativity: Recognize that AB ≠ BA, generally.
- Associativity: Understand A(BC) = (AB)C.
- Properties: (AB)^T = B^T A^T; multiplication with Identity Matrix.
Special Uses
- Scalar & Matrix Product: Understand bA = Ab.
- Dot (Inner) Product of Vectors: a^T b (result is a scalar).
- Outer Product of Vectors: ab^T (result is a matrix).
- Sum of Squared Elements of a Vector: a^T a (useful for L2 norm).
Norm (of Vector)
- Concept: Measure of “length” of a vector.
- Common Norms: L1, L2 (Euclidean), L∞.
-
L2 Norm (Euclidean): |
|
x |
|
= √(x^T x). |
- General Definition: Understand the properties a function must satisfy to be a norm.
Matrix Inversion
- Notation: A^(-1) or inv A.
- Definition: AA^(-1) = I = A^(-1)A.
- Nonsingular vs. Singular: Differentiate between matrices that have an inverse and those that don’t.
-
Determinant Condition: An inverse exists if and only if |
A |
≠ 0. |
- Properties: (AB)^(-1) = B^(-1)A^(-1); (A^T)^(-1) = (A^(-1))^T; (A^(-1))^(-1) = A.
- Special Cases: Inverse of diagonal matrices, orthogonal matrices (A^(-1) = A^T).
- Pseudo-inverse (Extra): Basic understanding of its existence for non-square matrices.
Matrix Rank
- Linear Independence: Define linearly independent vectors.
- Definition of Rank: Maximal number of linearly independent columns (or rows).
- Properties: rank(A) ≤ min(m,n). Full row rank, full column rank.
- Determinant Connection: Rank is the dimension of the largest square sub-matrix with a non-zero determinant.
- Relationship to Singular Matrices: How rank relates to singularity.
- Rank of Product: rank(AB) ≤ min(rank(A), rank(B)).
Matrix Calculus
- Review of Derivatives: Understand the basic definition of a derivative for single-variable functions.
- Multivariate Calculus Concepts:
- Partial Derivative: Extension of derivative to functions of multiple variables.
- Gradient: Vector of partial derivatives (first derivative to gradient).
- Hessian Matrix: Matrix of second partial derivatives (second derivative to Hessian).
- Denominator Layout: Understand that the gradient’s size is the same as the variable’s size.
- Examples: Be able to calculate simple gradients.
- Trace(): Sum of diagonal elements (not detailed in this source, but mentioned as “must know”).
- Eigenvalue / Eigenvectors: Important for future topics.
- Positive Definite Matrix, Gram Matrix, Quadratic Form, Projection: Future topics.
- Khan Academy: A recommended resource for review.
- Linear Transformation and Determinant: Connection to matrix determinant and Jacobian determinant.
- Laplacian of a function: Trace of the Hessian.
- Harmonic function: Function whose Laplacian is 0.
Quiz: Short-Answer Questions
- What is the primary difference between a scalar and a vector in linear algebra notation?
- Define a “column vector” and provide an example of a vector in R^4.
- Explain the condition for two matrices to be “conformable” for multiplication. Why is this condition important?
- Given matrices A (3×2) and B (2×4), what are the dimensions of the resulting matrix C = AB? What about C = BA?
- State two key properties of matrix multiplication that distinguish it from scalar multiplication.
- Describe the concept of a “norm” of a vector. What does the L2 norm represent?
- What is the definition of the inverse of a matrix A (denoted A^(-1))? Under what crucial condition does a matrix inverse exist?
- Differentiate between a “nonsingular” and a “singular” matrix.
- Explain the concept of “linear independence” in the context of vectors. How does this relate to the rank of a matrix?
- What is the gradient of a multivariate function? How does it relate to partial derivatives?
Quiz Answer Key
-
Scalar vs. Vector: A scalar is a single number (e.g., 1 or 22), denoted with regular type. A vector is a single row or column of numbers (e.g., [1 2 3]), denoted with bold small letters.
-
Column Vector and R^4 Example: A column vector is a vector arranged vertically. An example of a vector in R^4 (a column vector) is v = (1,6,3,4)^T.
-
Conformable Matrices: For two matrices A and B to be conformable for multiplication (AB), the number of columns in the premultiplier (A) must equal the number of rows in the postmultiplier (B). This ensures that the dot product for each entry in the resulting matrix can be calculated.
-
Matrix Multiplication Dimensions: If A is (3×2) and B is (2×4), then C = AB will have dimensions (3×4). C = BA cannot be performed because the number of columns in B (4) does not equal the number of rows in A (3).
- Properties of Matrix Multiplication:
- Matrix multiplication is generally not commutative (AB ≠ BA).
- Matrices must be conformable for multiplication to occur.
-
Vector Norm: A norm of a vector |
|
x |
|
is a measure of its “length.” The L2 norm, also known as the Euclidean norm, represents the straight-line distance from the origin to the vector’s endpoint (calculated as the square root of the sum of the squared elements). |
-
Matrix Inverse Definition and Condition: The inverse of an n × n matrix A is the matrix A^(-1) such that AA^(-1) = I = A^(-1)A, where I is the identity matrix. A crucial condition for an inverse to exist is that the determinant of A ( |
A |
) must not be zero. |
-
Nonsingular vs. Singular Matrix: A “nonsingular” matrix is an n × n matrix that has an inverse. A “singular” matrix is an n × n matrix that does not have an inverse.
-
Linear Independence and Rank: A set of vectors is linearly independent if none of them can be written as a linear combination of the others. The rank of a matrix is defined as the maximal number of linearly independent columns (or rows) in that matrix.
- Gradient of a Multivariate Function: The gradient of a multivariate function is a vector containing its partial derivatives with respect to each variable. It extends the concept of a derivative to functions of multiple variables, indicating the direction of the steepest ascent of the function.
1Basic
Machine Learning in a Nutshell: Comprehensive Study Guide
I. Core Concepts of Machine Learning
A. Definition and Goal
- Machine Learning (ML): A field of artificial intelligence where systems learn from data to optimize a performance criterion, aiming to generalize to unseen data.
- Goal: To enable computers to learn from example data or past experience without being explicitly programmed, and then apply this learned knowledge to make predictions or decisions on new, unseen data.
B. The Machine Learning Pipeline (Nutshell Components)
- Data: The raw material for learning, consisting of examples/instances/samples, each with features/attributes and potentially a target/label.
- Task: The specific problem ML aims to solve (e.g., classification, regression, clustering).
- Representation: The way data and the learning function are structured (e.g., linear function, neural network).
- Score Function (Loss/Cost Function): A metric used to evaluate how well the model’s predictions align with the actual values. The goal is often to minimize this function.
- Search/Optimization: The algorithms used to find the best model parameters by minimizing the score function (e.g., gradient descent).
Models, Parameters, Hyperparameters, Metrics
- Models: The learned functions or structures.
- Parameters: Internal variables of the model learned from data (e.g., w, b in a linear classifier).
- Hyperparameters: Parameters whose values are set before the learning process begins (e.g., learning rate, regularization strength).
- Metrics: Measures used to evaluate model performance (e.g., accuracy, precision, recall, MSE).
C. Two Phases of Machine Learning
Training (Learning Parameters)
- Uses a training set of input-output pairs (X, Y).
- The objective is to find model parameters (e.g., w, b) that minimize the loss/cost function, representing the difference between predicted f(x) and true y on the training examples.
- Uses a testing set of examples (x?) that were not included in the training set.
- Evaluates the model’s performance by measuring the difference between true y? and predicted f(x?).
- Crucial for assessing the model’s ability to generalize to new data.
D. When to Use Machine Learning
- Extracting knowledge from data: Discovering hidden relationships and correlations in large datasets.
- Learning tasks difficult to formalize: Problems hard to define explicitly with rules (e.g., face recognition).
- Creating software that improves over time: Systems that can adapt to new knowledge and data without continuous manual redesign.
II. Data Types in Machine Learning
A. General Terminology
- Data/points/instances/examples/samples/records: Rows in a dataset.
- Features/attributes/dimensions/independent variables/covariates/predictors/regressors: Columns in a dataset (excluding the target).
- Target/outcome/response/label/dependent variable: The special column to be predicted.
B. Main Types of Columns
- Continuous: Real numbers (e.g., weight, temperature).
- Discrete: Symbols or categories (e.g., “Good”, “Bad”, “Cat”, “Dog”).
C. Examples of Data Types
- Tabular Data: Structured data in rows and columns (e.g., patient records with gene quantities).
- 1D Sequence Data: Data with a sequential order (e.g., language text, genome sequences, audio).
- 2D Grid Data: Data arranged in a grid (e.g., images).
- Graph Data: Data representing relationships between entities (e.g., social networks).
III. Machine Learning Tasks
A. Supervised Learning
Definition: Learning from labeled input-output pairs (X, Y). The goal is to predict Y for unseen X.
Classification: Predicting a discrete target variable (Y)
- Binary Classification: Two possible output classes (e.g., positive/negative review, disease present/absent).
- Multi-class Classification: More than two possible output classes (e.g., categorizing text into Sports, Business, Education).
- Multi-label Classification: Assigning a set of target labels to a single input (e.g., an image being tagged with “Castle” and “Mountains”).
- Hierarchical Classification: Classification where classes are organized in a hierarchy.
Regression: Predicting a continuous target variable (Y)
Generating: Creating new data instances based on learned patterns (e.g., Text2Image, Edges2Image)
B. Unsupervised Learning
- Definition: Learning from unlabeled data (no Y provided).
- Goal: Finding patterns, structures, or relationships within the data.
- Example: Clustering – grouping instances into “natural” clusters.
C. Structured Output Learning
- Definition: Prediction tasks where output labels have complex structured correlations or constraints (e.g., spatial, temporal, relational dependencies).
- Example: Language parsing (predicting a parse tree), sequence labeling (predicting a sequence of labels for a sequence of inputs).
D. Reinforcement Learning (RL)
- Definition: An agent interacts with an environment and learns to maximize a scalar reward signal.
- Key Characteristics: Not independent and identically distributed (IID) data, sequential decision-making.
- Variations: Basic RL (no labels/supervision), imitation learning (supervised).
IV. Representation Types
A. Concept of Representation
How the input data (X) and the function (f) mapping X to Y are structured.
B. Examples of Representations
- Linear functions: f(x,w,b) = sign(w^T x + b) for binary classification.
- Nonlinear functions: Polynomial expansion, logistic function, trees, multi-layer networks (neural networks).
- Prob-density family: Bernoulli, multinomial, Gaussian, mixture of Gaussians.
- Vector Space Representation: Representing text as vectors (e.g., Bag of Words).
- Representation Learning (Deep Learning/Feature Learning): Automatically learning useful features from raw data, often through multi-layer architectures. This replaces traditional, time-consuming “Feature Engineering.”
V. Loss/Cost Types
A. Purpose of Loss Functions
- Quantify the difference between the model’s prediction and the true target value.
- Minimizing the loss function guides the learning process.
B. Examples of Loss Functions
- Mean Squared Error (MSE): Common for regression tasks.
- Hinge Loss: Used in binary classification (e.g., Support Vector Machines).
- Log-likelihood / Cross-entropy: Common for classification tasks, especially with probabilistic models.
- 0-1 Loss: Simple loss for classification, counts misclassifications.
- Regularized loss functions (L1, L2): Add penalty terms to the loss function to control model complexity and prevent overfitting.
VI. Optimization and Model Properties
A. Search/Optimization Algorithms
Purpose: To find the optimal model parameters by minimizing the loss/cost function.
Examples:
- Gradient Descent (GD): An iterative first-order optimization algorithm that takes steps proportional to the negative of the gradient.
- Stochastic Gradient Descent (SGD): A variant of GD that updates parameters using a single training example or a small batch at a time.
- Newton’s Method: A second-order optimization algorithm.
- Linear Programming, Quadratic Programming: For specific types of optimization problems.
- EM (Expectation-Maximization): For models with latent variables.
- Backpropagation: Used to train neural networks by efficiently computing gradients.
B. Model Properties / Basic Concepts
- Generalization: The ability of a model to perform well on unseen data. This is the ultimate goal of ML.
- Overfitting: When a model learns the training data too well, including noise, leading to poor performance on new data.
- Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test data.
- Bias-Variance Tradeoff: A fundamental concept balancing model simplicity (high bias, low variance) and complexity (low bias, high variance).
- Big gap between train/validation loss suggests overfitting (high variance).
- No gap, both bad, suggests underfitting (high bias).
- Regularization: Techniques (e.g., L1, L2) that add a penalty to the loss function to discourage overly complex models and prevent overfitting.
VII. Evaluation and Advanced Considerations
A. Measuring Prediction Accuracy
- Evaluate models on test data using appropriate metrics.
- For supervised classification, common metrics include accuracy, precision, recall, F1-score.
B. Human-Centric ML/AI Research
- Trustworthy ML: Ensuring models are reliable and fair.
- Explainable ML: Understanding why a model makes certain predictions.
- Robustness: How well a model performs under adversarial conditions (e.g., adversarial examples).
- Interacting with/Understanding/Modeling Human Intent: Developing AI that better collaborates with humans.
Quiz: Machine Learning Fundamentals
Instructions: Answer each question in 2-3 sentences.
- What is the primary goal of machine learning, and how does it differ from traditional programming?
- Explain the concept of “generalization” in machine learning. Why is it important, and what happens if a model fails to generalize?
- Describe the two main phases of a machine learning process. What is the key difference in the data used during these phases?
- Define a “loss function” and explain its role in training a machine learning model.
- What is the difference between “parameters” and “hyperparameters” in a machine learning model? Provide a brief example for each.
- List three types of data commonly encountered in machine learning. Give a real-world example for each.
- Distinguish between supervised and unsupervised learning tasks. Provide a task example for each.
- Briefly explain what “representation learning” is and why it has become more prominent compared to traditional “feature engineering.”
- What is the “bias-variance tradeoff”? How might you diagnose a model that is suffering from high variance?
- Describe the purpose of “regularization” in machine learning. How does it help in model training?
Answer Key
-
Primary Goal of ML: The primary goal of ML is to optimize a performance criterion using example data or past experience, aiming to generalize to unseen data. It differs from traditional programming where rules are explicitly coded; instead, ML systems learn patterns from data to make predictions or decisions.
-
Generalization Concept: Generalization refers to a model’s ability to perform accurately on new, unseen data, not just the data it was trained on. It is crucial because the ultimate purpose of an ML model is to make predictions in real-world scenarios. A model that fails to generalize is said to be overfitting, meaning it has learned the training data too specifically, including noise, and will perform poorly on new examples.
-
Two Main Phases: The two main phases are training and testing/deployment. During training, the model learns parameters from a labeled “training set.” During testing, the model’s performance is evaluated on a separate “testing set” of examples that were not part of the training data.
-
Loss Function: A loss function quantifies the discrepancy between a model’s predicted output and the actual true output for a given example. Its role in training is to provide a metric that the optimization algorithm (e.g., gradient descent) attempts to minimize, thereby iteratively adjusting the model’s parameters to improve its accuracy.
-
Parameters vs Hyperparameters: Parameters are internal variables of the model learned from the data during training, like the weights (w) and bias (b) in a linear classifier. Hyperparameters are external configurations set before the training process begins, such as the learning rate in gradient descent or the regularization strength.
-
Three Data Types: Three common data types are tabular, 2D grid, and 1D sequence data. Tabular data can be patient records with various medical attributes. 2D grid data commonly refers to images, like photographs for object recognition. 1D sequence data includes text documents for sentiment analysis or genome sequences.
-
Supervised vs Unsupervised: Supervised learning involves learning from labeled input-output pairs (X, Y) to predict Y for new X; an example is classifying emails as spam or not spam. Unsupervised learning, conversely, deals with unlabeled data to find patterns or structures within it; an example is clustering customer data to identify distinct market segments.
-
Representation Learning: Representation learning is the process where a machine learning model automatically discovers the representations (features) needed for detection or classification from raw data. It has become prominent because it often yields more effective features and reduces the laborious, time-consuming, and often task-dependent manual effort required in traditional “feature engineering.”
-
Bias-Variance Tradeoff: The bias-variance tradeoff refers to the dilemma of simultaneously minimizing two sources of error that prevent models from generalizing well: bias (error from overly simplistic assumptions) and variance (error from sensitivity to small fluctuations in the training data). High variance (overfitting) can be diagnosed by observing a large gap between the model’s performance on the training data (very good) and its performance on validation or test data (significantly worse).
-
Regularization Purpose: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function during training. This penalty discourages the model from becoming too complex or assigning excessively large weights to features. By controlling model complexity, regularization helps the model generalize better to unseen data.
2Regression
Linear
1Basic
Understanding Linear Regression: A Study Guide
Quiz
Instructions: Answer the following questions in 2-3 sentences each.
- What is the primary goal of Machine Learning in a nutshell, as described in the lecture?
- Define “regression” in the context of machine learning, specifically mentioning the nature of the target variable.
- Explain the purpose of the “score function” (or loss/cost function) in linear regression.
- Describe what the training dataset consists of for supervised regression.
- In the linear regression equation f(x) = wx + b, what do w and b represent?
- How does the concise notation ŷ = f(x) = x^T θ = θ^T x simplify the representation of linear regression with multiple features?
- What is the “Sum of Squared Error” (SSE) and why is it used as a loss function in linear regression?
- Briefly explain the concept of “Normal Equations” in the context of finding optimal regression coefficients.
- Why is it important for a learned model to “Generalize Well” to unseen data?
- What are the three main aims for a learned model besides generalizing well, as discussed in the lecture?
Answer Key
-
Primary Goal of Machine Learning: The primary goal of Machine Learning is to optimize a performance criterion using example data or past experience. The ultimate aim is to generalize this learned knowledge effectively to unseen data.
-
Regression Definition: In machine learning, regression refers to tasks where the target variable (Y) is continuous. This means the model predicts a numerical value rather than a category or class.
-
Score Function Purpose: The score function (or loss/cost function), denoted as L(), quantifies the difference between the model’s predictions and the actual target values. The goal of the learning algorithm is to minimize this function, thereby finding the optimal model parameters.
-
Training Dataset: The training dataset for supervised regression consists of input-output pairs. These pairs include available examples (input features, X) and their corresponding continuous target labels (output, Y), which the model uses to learn.
-
Linear Regression Parameters: In the linear regression equation f(x) = wx + b (for a single variable), w represents the weight or slope, indicating how much Y changes for a unit change in X. b represents the bias or intercept, which is the value of Y when X is zero.
-
Concise Notation: The concise notation ŷ = f(x) = x^T θ = θ^T x uses vector/matrix products to represent multiple features and their corresponding weights in a single expression. This compact form simplifies mathematical derivations and computational implementation for multivariate linear regression.
-
Sum of Squared Error (SSE): The Sum of Squared Error (SSE) measures the sum of the squared differences between the predicted values (f(x)) and the actual target values (y). Squaring the errors ensures that all errors contribute positively to the total loss and penalizes larger errors more heavily, providing a clear objective for minimization.
-
Normal Equations: The Normal Equations provide a closed-form solution to find the optimal regression coefficients (θ) by setting the derivative (gradient) of the cost function with respect to θ to zero. This method directly calculates the θ that minimizes the SSE for linear regression.
-
Generalization: Generalizing well means that the model performs accurately not only on the data it was trained on but also on new, unseen data. This indicates that the model has learned underlying patterns rather than just memorizing the training examples, making it useful in real-world applications.
-
Additional Model Aims: Besides generalizing well, the lecture suggests that a learned model should also be computationally scalable and efficient, and robust, trustworthy, and interpretable. These attributes contribute to the practical utility and reliability of machine learning models.
Linear Regression Basics
1. Machine Learning in a Nutshell
Definition: Machine Learning (ML) is a field that grew out of Artificial Intelligence (AI). Its core purpose is to “Optimize a performance criterion using example data or past experience, aiming to generalize to unseen data.”
Key Components of ML: The process of machine learning can be broken down into several fundamental elements:
- Task: What needs to be predicted or achieved (e.g., y in regression).
- Representation: How the data and model are structured (e.g., x, f()).
- Score Function: A measure of how well the model is performing (e.g., L()).
- Search/Optimization: The method used to find the best model parameters (e.g., argmin()).
- Models, Parameters: The specific mathematical structure and adjustable values of the model (e.g., w, b for regression coefficients).
- Data: The input used for training and testing (e.g., X: Tabular).
2. Regression Fundamentals
Definition: Regression specifically deals with predicting a continuous target variable y.
Linear Regression: In linear regression, the target Y is modeled as a “Weighted linear sum of Xs.”
- For a single variable x, the function is f(x) = wx + b. Here, w is the weight (slope) and b is the bias (intercept). A slope of 2 means “every 1-unit change in X yields a 2-unit change in Y.”
- For multiple features x₁, x₂, …, the model becomes ŷ = f(x) = θ₀ + θ₁x₁ + θ₂x₂. Here, θ₀ is the intercept, and θ₁, θ₂ are coefficients for x₁, x₂ respectively.
Supervised Learning: Linear Regression is a type of supervised learning. This means the training dataset consists of “input-output pairs” (X, Y), where Y (the target) is known for each input X.
3. Data Representation for Regression
Tabular Dataset Structure
- Data/points/instances/examples/samples/records: Rows of the dataset.
- Features/attributes/dimensions/independent variables/covariates/predictors/regressors: Columns of the dataset, excluding the target.
- Target/outcome/response/label/dependent variable: A special column (usually the last) that is continuous-valued and to be predicted.
Concise Vector/Matrix Notation
To simplify computations, especially with multiple features:
- Each data sample x is represented as a column vector.
- A pseudo feature x₀=1 (the intercept term) is added to the feature vector, making it x^T = [(x₀=1), x₁, x₂, …, xₚ].
- The parameter vector θ is also a column vector [θ₀, θ₁, …, θₚ]^T.
- The linear regression function can then be concisely written as ŷ = f(x) = x^T θ = θ^T x.
Training Set in Matrix Form: The entire training set with n samples can be represented as a matrix X (where each row is a feature vector x_i^T) and a target vector y_train = [y₁, y₂, …, yₙ]^T.
4. Training and Optimization
Training Goal: To find the optimal parameters (w, b or θ) by minimizing a Loss/Cost function L(). This function quantifies the “difference between y and f(x) on available examples in the training set.”
Sum of Squared Error (SSE) / Mean Squared Error (MSE)
The most common loss function for linear regression:
- J(θ) = 1/2 × Σ (f(xᵢ) - yᵢ)²
- In matrix form, J(θ) = 1/2 × (Xθ - y)^T (Xθ - y)
Finding Optimal Parameters (θ)
Since J(θ) is a convex function, its minimum can be found by taking the derivative (gradient) with respect to θ and setting it to zero (∇θJ(θ) = 0). This leads to the “normal equations”: X^T Xθ = X^T y.
The solution for θ is then θ* = (X^T X)^(-1) X^T y.
Convexity of J()
The loss function J(θ) is convex. “Intuitively, a convex function (1D case) has a single point at which the derivative goes to zero, and this point is a minimum.” For multivariate functions, if the Hessian matrix is positive definite (which X^T X is, given certain conditions), the function is convex and the point where the gradient is zero is a minimum.
Gradient Descent (GD) / Stochastic Gradient Descent (SGD)
These are iterative optimization algorithms used when a closed-form solution is not feasible or computationally too expensive. (Mentioned in outline, but not detailed in excerpts).
Generalization: A primary aim is for the “learned model [to] Generalize Well” to unseen data. It should also be “Computational Scalable and Efficient” and “Robust / Trustworthy / Interpretable.”
Testing Mode: After training, the model is tested on “Production Data” (also called test data), which consists of new inputs X? for which the output f(X?) is predicted.
Metrics for Regression
- Test MSE Error: A common metric reported for evaluating regression models, calculated on the test set: J_test = 1/m × Σ (x_i^T θ* - y_i)².
- Other metrics exist (referencing scikit-learn documentation).
6. Computational Considerations
Why Vector/Matrix Form? It allows for “Concise Notation,” “Closed form solution” for training, and efficient computation for testing (ŷ_test = X_test θ*).
Computational Cost (Naive Matrix Multiply)
A naive matrix multiplication C = C + A×B has a complexity of O(n³) flops.
Optimizing Computational Cost
Data Parallelization
Utilizing CPU SIMD (Single Instruction, Multiple Data), multithreading, and GPU parallelization. SIMD allows “one operation [to produce] multiple results.”
Memory Hierarchy/Locality
Exploiting spatial (accessing nearby data) and temporal (reusing data) locality to improve average memory access speed through caches.
Better Algorithms
Examples include Strassen’s Matrix Multiply, which reduces complexity to O(n^2.81) (asymptotically faster than naive O(n³)).
BLAS (Basic Linear Algebra Subroutines)
Industry-standard interfaces (e.g., numpy wraps BLAS) providing optimized implementations for vector (BLAS1), matrix-vector (BLAS2), and matrix-matrix (BLAS3) operations. BLAS3 is generally preferred for its potential for higher computational intensity.
7. Advanced/Optional Concepts
Probabilistic Interpretation: Linear regression can be viewed probabilistically by assuming the error term ε (representing unmodeled effects or random noise) follows a Gaussian distribution N(0, σ). This leads to the concept of Maximum Likelihood Estimation for finding θ.
Gram Matrix: X^T X is referred to as a Gram Matrix and is always positive semi-definite. If X has full rank, X^T X is positive definite and invertible, ensuring a unique minimum for J(θ).
Regularization: Used when X has less than full column rank to handle non-invertible X^T X (not detailed in these excerpts but mentioned as a solution).
Scalability to Big Data: While polynomial time algorithms are good, O(n) can still be too slow for truly massive datasets, leading to research in “low rank, sparse, hardware, sampling, randomized…” solutions.
2Regression
Optimization
Optimization for Linear Regression: A Study Guide
I. Course Overview & Core Concepts
This section of the course focuses on optimization techniques for Linear Regression (LR), specifically for tabular data. It revisits fundamental machine learning concepts within this context.
A. Course Sectioning Context
- Section 1: Basic Supervised Regression on Tabular Data
- Section 2: Basic Deep Learning on 2D Imaging Data
- Section 3: Advanced Supervised Learning on Tabular Data
- Section 4: Generative and Deep Learning on 1D Sequence Text Data
- Section 5: Unsupervised Learning (Mostly on Tabular Data)
B. Regression Fundamentals
- Regression: Deals with predicting a continuous output variable (y).
- Linear Regression (LR): Models the output (Y) as a weighted linear sum of input features (Xs).
- Sum of Squared Error (Least Squares): The standard cost function used to measure the difference between predicted and actual values in LR.
- Metrics, Implementation, Regression Coefficients (w, b): Key aspects to consider in LR.
C. Machine Learning Task Components
- Task: Predicting ‘y’.
- Representation: Input features ‘x’ and the function ‘f()’.
- Score Function: The loss function ‘L()’.
- Search/Optimization: The process of finding ‘argmin()’ for the loss function.
- Models, Parameters: The chosen model and its adjustable parameters.
- Data: ‘X’, typically tabular.
- Prediction: ŷ = f(x) = θ^T x (where θ represents the parameters/weights).
D. Optimization Principles
- Objective Function: The function to be minimized or maximized (e.g., the cost function).
- Variables: The parameters being optimized (e.g., weights ‘w’ or ‘θ’).
- Constraints: Conditions that the variables must satisfy.
- Goal: Find variable values that optimize the objective function while satisfying constraints.
II. Methods for Optimizing Linear Regression
The lecture covers two main approaches to optimizing Linear Regression: Direct Optimization (Normal Equation) and Iterative Optimization (Gradient Descent variants).
A. Method 1: Directly Optimize (Normal Equation)
Concept: For convex functions (like the Sum of Squared Error in LR), the minimum occurs where the derivative (slope) is zero.
Process:
- Write the cost function J(θ) in matrix form.
- Take the gradient of J(θ) with respect to θ and set it to zero.
- Solve the resulting equation for θ.
Formula: θ = (X^T X)^(-1) X^T y
Pros: A single-shot algorithm, easiest to implement.
Cons:
- Requires computing the pseudo-inverse (X^T X)^(-1), which can be computationally expensive (especially for large datasets).
- Prone to numerical issues if the matrix is singular or ill-conditioned.
Direct vs. Iterative: A direct method, meaning it finds the solution in a finite number of steps.
B. Method 2: Iteratively Optimize via Gradient Descent
Concept: An iterative algorithm that moves towards the minimum of a function by taking steps proportional to the negative of the gradient at the current point.
Gradient Definition: A vector whose entries contain the partial derivatives of the function with respect to each variable. It points in the direction of the greatest rate of increase of the function.
General Algorithm:
- Initialize k=0, choose an initial x₀ (randomly or by prior).
- While k < k_max (or until a stopping criterion is met): x_k = x_{k-1} - α∇x F(x{k-1})
Key Factors Influencing GD:
- Learning Rate (α): Determines the step size. A critical parameter.
- Objective Function: Its shape (e.g., convexity) affects convergence.
- Starting Point: Can influence which local minimum is found (though not an issue for LR’s convex cost function).
- Stop Criterion: When to stop the iterative process.
Pros:
- Works on any objective function as long as its gradient can be evaluated.
- Useful for minimizing complex functions.
- Guaranteed convergence for convex functions like LR’s cost function.
Cons:
- Can be slow converging (especially batch GD).
- Can get stuck in local minima for non-convex functions (not an issue for LR).
- Learning rate selection is crucial.
C. Variants of Gradient Descent
1. Batch Gradient Descent (GD)
Concept: Calculates the gradient of the cost function with respect to the parameters for all training examples in each iteration.
Update Rule: θ_{t+1} = θ_t - α∇_θ J(θ_t) = θ_t + αX^T (y - Xθ_t)
Characteristics:
- Computes the exact gradient in each step.
- Smooth convergence path.
- Can be computationally expensive for very large datasets, as it needs to process the entire dataset per update.
2. Stochastic Gradient Descent (SGD)
Concept: Calculates the gradient and updates the parameters for each individual training example (or a very small subset) at a time.
Update Rule (for a single training point i): θ_{t+1} = θ_t + α(y_i - x_i^T θ_t)x_i
Characteristics:
- “Noisier” updates due to processing one example at a time.
- Faster per-step computation compared to batch GD.
- Can escape shallow local minima due to the noise (advantage for non-convex problems).
- Useful for massive datasets that don’t fit in memory or for online/streaming data.
Epoch: One complete pass over the entire training data when using SGD for offline training.
Pros:
- Low per-step cost.
- Can be faster converging to an approximate solution.
- Less prone to getting stuck in local optima (for non-convex cases).
- Efficient for large datasets and online learning.
Cons:
- Convergence to the exact optimum is not always guaranteed (it tends to oscillate around the minimum).
- More variation in the objective function during training.
3. Mini-Batch Gradient Descent
Concept: A compromise between Batch GD and SGD. Computes the gradient on a small “mini-batch” of samples (e.g., B=32, 64, 128).
Update Rule: θ_{t+1} = θ_t + (1/B)αX_B^T (y_B - X_B θ_t) (where B refers to the mini-batch size and X_B, y_B are the mini-batch data)
Characteristics:
- Combines benefits of both: more stable updates than SGD, but computationally more efficient than Batch GD (especially with GPU acceleration).
- Commonly used in practice.
- Special Cases: B=1 is standard SGD, B=N (total data size) is standard Batch GD.
Pros:
- Much faster computationally than single-point SGD (better use of computer architecture like GPUs).
- Low per-step cost.
- Fast convergence.
D. Stopping Criteria for Gradient Descent
- Predetermined Maximum Number of Iterations: A common and simple stopping rule.
- Threshold for Improvement: Stop when the improvement in the objective function (e.g., MSE) drops below a certain threshold.
- Monitoring Training Error: Plotting Train-MSE per sample to observe its decrease over epochs is helpful for understanding behavior.
Quiz: Optimization for Linear Regression
- What is the primary goal of optimization in the context of Linear Regression?
- Briefly explain the main idea behind the “Normal Equation” method for Linear Regression.
- What are two significant disadvantages of using the Normal Equation, especially for large datasets?
- Define “Gradient Descent” as an iterative optimization algorithm.
- List three critical factors that influence the behavior and performance of Gradient Descent.
- How does Stochastic Gradient Descent (SGD) differ from Batch Gradient Descent (GD) in terms of gradient calculation and parameter updates?
- Provide two advantages of using Stochastic Gradient Descent (SGD) compared to Batch Gradient Descent for very large datasets.
- What is “Mini-Batch Gradient Descent,” and why is it often preferred in practice?
- Explain the concept of an “epoch” in the context of Stochastic Gradient Descent.
- Name two common stopping criteria for iterative optimization algorithms like Gradient Descent.
Quiz Answer Key
-
Primary Goal of Optimization: The primary goal of optimization in Linear Regression is to find the set of model parameters (weights ‘w’ or ‘θ’) that minimize the chosen cost function, typically the Sum of Squared Error (Least Squares). This minimization results in the best-fit line or hyperplane for the given data.
-
Normal Equation Method: The Normal Equation method directly solves for the optimal Linear Regression parameters by setting the gradient of the cost function to zero. It analytically calculates the parameters in a single step using matrix operations, without needing an iterative process.
-
Disadvantages of Normal Equation: Two significant disadvantages of the Normal Equation are its high computational cost due to the requirement of computing a matrix inverse, particularly for large numbers of features, and its susceptibility to numerical issues if the (X^T X) matrix is singular or ill-conditioned.
-
Gradient Descent Definition: Gradient Descent is an iterative optimization algorithm that minimizes an objective function by repeatedly moving in the direction opposite to the gradient of the function. Each step adjusts the parameters by a small amount proportional to the negative of the gradient, gradually approaching a local (or global) minimum.
-
Critical Factors for GD: Three critical factors influencing Gradient Descent are the learning rate (α), which determines the step size; the shape of the objective function, which dictates the gradient landscape; and the starting point, which can affect the path to the minimum or which local minimum is found.
-
SGD vs Batch GD: Stochastic Gradient Descent (SGD) updates parameters using the gradient calculated from a single training example at a time, leading to noisy but frequent updates. In contrast, Batch Gradient Descent (GD) calculates the gradient using all training examples before making a single parameter update, resulting in smoother but less frequent updates.
-
Advantages of SGD: Two advantages of SGD for very large datasets are its low per-step computational cost, as it processes only one sample at a time, making it suitable for data that doesn’t fit in memory, and its ability to converge faster to an approximate solution.
-
Mini-Batch Gradient Descent: Mini-Batch Gradient Descent is an optimization technique that computes the gradient and updates parameters using a small, randomly selected subset (mini-batch) of the training data. It is often preferred because it balances the stability of Batch GD with the computational efficiency of SGD, leading to faster training and better utilization of hardware like GPUs.
-
Epoch Concept: In the context of Stochastic Gradient Descent, an “epoch” refers to one complete pass through the entire training dataset. If SGD is used for offline training, it repeatedly cycles through all samples in the dataset, and each such pass constitutes an epoch.
-
Stopping Criteria: Two common stopping criteria for Gradient Descent algorithms include a predetermined maximum number of iterations (epochs) and stopping when the improvement in the objective function (e.g., Train MSE) between consecutive iterations drops below a specified small threshold.
2Regression
Nonlinear
ModelSelection
Local
Understanding Linear Regression with Basis Functions Expansion
Study Guide
This study guide is designed to review your understanding of Linear Regression with Basis Functions Expansion, as presented in the provided lecture material.
I. Core Concepts of Linear Regression
Definition of Regression
What is the primary characteristic of the y variable in a regression task?
Goal of Linear Regression with Basis Expansion
How does Linear Regression (LR) achieve the ability to model non-linear relationships using basis functions?
Key Components of a Machine Learning Task
Identify and briefly describe the four fundamental components mentioned for any machine learning task:
- Task: What needs to be predicted
- Representation: How data and functions are structured
- Score Function: Metric to evaluate model performance
- Search/Optimization: Algorithms to find optimal parameters
Parameters vs. Models
Differentiate between “models” and “parameters” in the context of machine learning.
Optimization Methods
List the three primary methods mentioned for optimizing the Sum of Squared Error:
- Normal Equation
- Gradient Descent (GD)
- Stochastic Gradient Descent (SGD)
“KEY” Insight
Explain the fundamental implication of having predefined basis functions on the nature of the learning problem.
II. Basis Functions Expansion
Purpose of Basis Functions
Why are basis functions introduced in linear regression?
Types of Basis Functions
Name at least three different types of basis functions mentioned in the lecture:
- Polynomial basis functions
- Radial Basis Functions (RBFs)
- Trigonometric basis functions
Polynomial Regression
What is the specific form of the basis function φ(x) for polynomial regression up to degree two (d=2)?
For degree 2: φ(x) = [1, x, x²]ᵀ
How is the prediction ŷ formulated using polynomial basis functions?
ŷ = θᵀφ(x) = θ₀ + θ₁x + θ₂x²
Radial Basis Functions (RBFs)
Define what an RBF is in terms of its dependency.
RBFs are functions whose output depends only on the distance from a central point.
Describe the characteristic behavior of a Gaussian RBF as distance from its center increases.
The output decreases exponentially as the distance from the center increases, creating a bell-shaped curve.
What are the “hyperparameters” that users need to define for RBF basis functions?
- Centers (μⱼ): The central points where RBFs are located
- Widths (λⱼ): Parameters controlling the spread of the RBF
Provide the general mathematical form of a Gaussian RBF, specifically φⱼ(x).
φⱼ(x) = exp(-λⱼ |
|
x - μⱼ |
|
²) |
III. Parametric vs. Non-Parametric Models
Parametric Learning Algorithm
Define a parametric learning algorithm, using (unweighted) linear regression as an example. What is a key characteristic regarding the need for training data after fitting?
Parametric algorithms have a fixed number of parameters that are learned from data. Once trained, the training data can be discarded as predictions are made using only the learned parameters.
Non-Parametric Learning Algorithm
Define a non-parametric learning algorithm, providing examples like K-Nearest Neighbor (KNN) or Locally Weighted Linear Regression. How does the “amount of knowledge” required to represent the hypothesis differ from parametric models?
Non-parametric algorithms require the entire training dataset to be kept for making predictions. The amount of knowledge grows with the size of the training set.
General Prediction Equation
Write down the general equation for ŷ (predicted output) using basis functions φ(x) and weights θ.
ŷ = θᵀφ(x)
Normal Equation
Provide the formula for calculating the optimal weight vector θ* using the Normal Equation, given φ(x) and y.
θ* = (ΦᵀΦ)⁻¹Φᵀy
where Φ is the design matrix with rows φ(xᵢ)ᵀ
Quiz: Linear Regression with Basis Functions
Instructions: Answer each question in 2-3 sentences.
-
How does Linear Regression with basis functions expansion allow for the modeling of non-linear relationships, despite being fundamentally a “linear” model?
-
What is the significance of the “KEY” insight mentioned in the lecture regarding predefined basis functions?
-
Describe the primary goal of the “Search/Optimization” component in a machine learning task, as it relates to the score function.
-
If you are performing polynomial regression with a degree up to three, what would be the form of the basis function vector φ(x)?
-
Explain why Radial Basis Functions (RBFs) are often described as “bell-shaped” or having an output that decreases with distance from a center.
-
What role do the “centers” and “widths” play in defining Radial Basis Functions?
-
Differentiate between the “Task” and “Representation” components of a machine learning problem.
-
Briefly explain the main difference in how parametric and non-parametric learning algorithms utilize training data after the initial fitting phase.
-
Which optimization methods, other than the Normal Equation, are mentioned for finding the optimal regression coefficients?
-
In the context of the general prediction equation ŷ = θᵀφ(x), what do θ and φ(x) represent, respectively?
Quiz Answer Key
-
Non-linear Modeling: Linear Regression with basis functions models non-linear relationships by transforming the input features x into a higher-dimensional space using non-linear basis functions φ(x). The relationship between the transformed features φ(x) and the output y remains linear, allowing linear regression techniques to be applied.
-
KEY Insight: The “KEY” insight is that even when non-linear basis functions are used, the problem of learning the parameters from the data is still considered Linear Regression. This is because the model is linear with respect to the parameters θ, even if it’s non-linear with respect to the original input x.
-
Search/Optimization Goal: The “Search/Optimization” component’s primary goal is to find the set of model parameters that minimize (or maximize, depending on the objective) the score function. This process iteratively adjusts the parameters to achieve the best possible fit to the training data according to the defined score.
-
Polynomial Basis Vector: For polynomial regression with a degree up to three, the basis function vector φ(x) would be [1, x, x², x³]ᵀ. This expands the single input feature x into a set of features including its powers.
-
RBF Bell Shape: RBFs are “bell-shaped” because their output typically reaches a maximum at a central point and then decreases symmetrically as the input moves further away from that center. For Gaussian RBFs, this decay is exponential, creating a characteristic bell-like curve.
-
RBF Centers and Widths: In RBFs, “centers” (μⱼ) define the point in the input space where the basis function’s influence is maximal, while “widths” (λⱼ) determine how rapidly the function’s output decreases as the input moves away from that center, thus controlling the spread or local influence of the RBF.
-
Task vs Representation: “Task” defines what needs to be predicted (e.g., y is continuous for regression). “Representation” refers to how the input data x and its potential transformations f() are structured for the model, essentially defining the features used for prediction.
-
Parametric vs Non-parametric Data Usage: Parametric algorithms, once trained, can discard the training data as predictions are made solely using the learned, fixed parameters. Non-parametric algorithms, in contrast, require the entire training dataset to be kept available for making future predictions, as their “knowledge” grows with the data size.
-
Other Optimization Methods: Besides the Normal Equation, the lecture also mentions Gradient Descent (GD) and Stochastic Gradient Descent (SGD) as optimization methods for finding optimal regression coefficients by minimizing the Sum of Squared Error.
-
Prediction Equation Components: In ŷ = θᵀφ(x), θ represents the vector of regression coefficients (weights) that the model learns from the data. φ(x) represents the basis function expansion of the input x, which transforms the original features into a potentially higher-dimensional space.
2Regression
Nonlinear
ModelSelection
Local
Model Selection and Validation Study Guide
Quiz: Short Answer Questions
-
What is the primary goal of model selection in machine learning?
The primary goal is to choose the right model type and its hyperparameters to ensure good generalization, meaning the model can accurately predict future data drawn from the same distribution, rather than just fitting the training data well. This involves balancing between underfitting and overfitting.
-
Explain the difference between overfitting and underfitting.
Underfitting occurs when a model is too simple to capture the underlying patterns in the data, leading to high training and test errors (high bias). Overfitting happens when a model is too complex and fits noise in the training data, resulting in low training error but high test error (high variance).
-
Why is simply choosing the model with the best fit to the training data not a reliable strategy for model selection?
Choosing the model with the best fit to the training data is unreliable because it often leads to overfitting. An overfit model memorizes the training data, including noise, and performs poorly on new, unseen data, failing to generalize effectively.
-
Describe the process of the “Train-Validation” (Hold-out) method for model selection.
The train-validation method involves splitting the labeled data into two sets: a training set (e.g., 70%) and a validation set (e.g., 30%). The model is trained on the training set, and its performance is then estimated using the validation set to evaluate its future performance.
-
What are the main advantages and disadvantages of the “Train-Validation” method?
Advantages include its simplicity and ease of implementation. Disadvantages are that it “wastes data” (the validation set is not used for training the final model) and can lead to a high variance in performance estimation if the validation set is small or unrepresentative.
-
What problem does K-Fold Cross-Validation aim to solve compared to the simple train-validation method?
K-Fold Cross-Validation addresses the issue of data wastage and high variance in performance estimation inherent in the train-validation method, especially when data is scarce. It ensures that each data point is used for both training and validation, providing a more robust estimate.
-
Briefly explain how K-Fold Cross-Validation works.
In K-Fold Cross-Validation, the dataset is divided into K equal “folds.” The model is trained K times; in each iteration, one fold is used as the validation set, and the remaining K-1 folds are used as the training set. The scores from each validation step are collected, and their mean is typically reported as the overall performance estimate.
-
What is Leave-One-Out Cross-Validation (LOOCV), and how does it relate to K-Fold Cross-Validation?
Leave-One-Out Cross-Validation (LOOCV) is a specific type of K-Fold Cross-Validation where K is equal to n, the number of data points in the dataset. In LOOCV, each data point individually serves as the validation set, and the model is trained on the remaining n-1 points.
-
When analyzing a validation curve (score vs. model complexity), what typically happens to the training and validation scores as model complexity increases, and why?
As model complexity increases, the training score generally increases (or error decreases) because a more complex model can fit the training data better. The validation score will initially increase alongside the training score but will eventually start to decrease after a certain point due to overfitting, as the model begins to learn noise from the training data.
-
Define Bias and Variance in the context of machine learning models.
Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a simplified model. It represents how much the average model over all training sets differs from the true model. Variance refers to how much the models estimated from different training sets differ from each other, reflecting the model’s sensitivity to small fluctuations in the training data.
-
Compare and contrast the Train-Validation (Hold-out), K-Fold Cross-Validation, and Leave-One-Out Cross-Validation methods for model selection. Discuss their respective advantages, disadvantages, and typical use cases, particularly considering data availability.
-
Elaborate on the concepts of overfitting and underfitting. Explain how these phenomena manifest in model performance metrics (training error, test error) and discuss various strategies for mitigating each, referencing the bias-variance tradeoff.
-
Discuss the “generalization” goal in machine learning. Why is it more important than achieving a perfect fit to training data, and what techniques are employed to ensure a model generalizes well to new, unseen data?
-
Explain the significance of the bias-variance tradeoff in model selection. Provide examples of how model complexity influences bias and variance, and describe how one might navigate this tradeoff using techniques discussed in the source material.
-
Imagine you are tasked with selecting the optimal polynomial degree for a regression model on a limited dataset. Describe a step-by-step process you would follow, incorporating at least two different validation techniques, to make an informed decision while addressing potential pitfalls.
Glossary of Key Terms
-
Model Selection: The process of choosing the right model type and its hyperparameters to achieve the best performance on unseen data.
-
Hyperparameter: A parameter whose value is used to control the learning process, which is set before the learning process begins (e.g., polynomial degree, number of folds in cross-validation).
-
Overfitting: A phenomenon where a model learns the training data too well, including its noise and irrelevant details, leading to excellent performance on training data but poor performance on new, unseen data. Characterized by low bias and high variance.
-
Underfitting: A phenomenon where a model is too simple to capture the underlying patterns in the training data, resulting in poor performance on both training and new data. Characterized by high bias and low variance.
-
Generalization: The ability of a machine learning model to perform accurately on new, unseen data, reflecting its capacity to “explain,” “predict,” or “model” new examples from the same distribution as the training data.
-
Training Set: The portion of the labeled dataset used to train the machine learning model.
-
Validation Set (Hold-out Set): A separate portion of the labeled dataset used to estimate the model’s performance during training and hyperparameter tuning, helping to avoid overfitting to the test set.
-
Test Set: A completely independent portion of the labeled dataset, kept separate from both training and validation, used to provide an unbiased evaluation of the final chosen model’s performance on new data.
-
Mean Squared Error (MSE): A common metric for regression models, calculated as the average of the squares of the differences between predicted and actual values. Lower MSE indicates better model performance.
-
Variance (of performance estimator): Refers to how much an estimate of performance might change if a different validation set or data split were used. High variance implies an unreliable estimate.
-
K-Fold Cross-Validation: A resampling procedure used to evaluate machine learning models on a limited data sample. The dataset is divided into K folds, and the model is trained and validated K times, with each fold serving as the validation set once.
-
Leave-One-Out Cross-Validation (LOOCV): A special case of K-Fold Cross-Validation where K equals the number of data points (n). Each data point is used as a validation set, with the remaining n-1 points used for training.
-
Bias: The error due to inaccurate assumptions or simplifications made by the model. A high bias model typically underfits the data.
-
Variance: The error due to the model’s sensitivity to small fluctuations in the training data. A high variance model typically overfits the data.
-
Bias-Variance Tradeoff: The inherent conflict in machine learning where reducing one type of error (bias or variance) tends to increase the other. The goal is to find a balance that minimizes overall prediction error.
-
Learning Curve: A plot showing the performance of a learning model (e.g., score or error) on both the training set and validation set as a function of the number of training examples or model complexity.
-
Validation Curve: A plot showing the performance of a learning model (e.g., score or error) on both the training set and validation set as a function of a model hyperparameter (e.g., model complexity).
2Regression
Optimization
Regularization
ModelSelection
Study Guide: Regularized Linear Regression
This guide provides a comprehensive review of the concepts surrounding regularized multivariate linear regression, based on the provided source material. It includes a quiz to test understanding, essay questions for deeper exploration, and a glossary of key terminology.
Quiz
Answer the following questions in 2-3 sentences each, based on the provided source material.
- Why is standard linear regression often problematic for datasets where the number of features (p) is greater than the number of samples (n)?
- What is regularization in the context of machine learning, and what problem does it primarily address?
- Describe the penalty term used in Ridge (L2) regression and explain how it affects the model’s coefficients.
- Describe the penalty term used in LASSO (L1) regression and explain its key advantage over Ridge regression.
- What is the primary purpose of the regularization parameter, λ (lambda), and how is its optimal value typically determined?
- Explain the concept of a “regularization path” and what it visualizes.
- What is Elastic Net regularization and why was it developed?
- From a geometric perspective, why does LASSO (L1) regularization tend to produce sparse models (i.e., set some coefficients to exactly zero)?
- What are the three main goals for a trained machine learning model mentioned in the lecture?
- What is the “grouping effect” in the context of Elastic Net, and when is it particularly useful?
Quiz Answer Key
-
When p > n, the matrix X^T X in the normal equation for linear regression has less than full column rank and is therefore not invertible. This prevents the direct calculation of a unique solution for the regression coefficients using Ordinary Least Squares.
-
Regularization is an additional criterion added to the loss function to ensure the model does not overfit the training data. It primarily addresses the issue of overfitting by keeping the model’s parameters more “normal” or regular, effectively adding a bias that prefers certain types of weights (e.g., smaller ones) over others.
-
Ridge regression adds an L2 penalty, which is proportional to the sum of the squared coefficient weights (λ Σ βj²). This penalty shrinks the coefficients towards zero, with the effect becoming stronger as λ increases. However, it does not set the coefficients exactly to zero.
-
**LASSO regression uses an L1 penalty, proportional to the sum of the absolute values of the weights (λ Σ |
βj |
).** Its key advantage is that it performs both shrinkage and variable selection simultaneously, as the penalty can force some coefficients to be exactly zero, creating a more interpretable and sparse model. |
-
The regularization parameter λ controls the strength of the penalty; λ=0 corresponds to standard least squares, while a large λ pushes coefficients towards zero. The optimal value is chosen to ensure good generalization ability on new data and is typically determined using k-fold cross-validation.
-
A regularization path is a plot that visualizes how the model’s regression coefficients (βj) change in value as the regularization parameter λ is varied. It illustrates the effect of the penalty on each feature’s weight, from the unregularized solution to a highly constrained one.
-
Elastic Net is a hybrid regularization method that combines both L1 (LASSO) and L2 (Ridge) penalties. It was developed to address disadvantages of LASSO, such as its limitation of selecting at most n variables in a p>n scenario and its tendency to arbitrarily select only one variable from a group of highly correlated ones.
-
Geometrically, the L1 constraint region is a diamond shape with sharp corners that lie on the axes. The elliptical contour lines of the least squares error function are more likely to first make contact with this region at one of its corners. This contact point corresponds to a solution where coefficients for other axes are zero, thus producing sparsity.
-
The three main goals for a trained model are to (1) generalize well to new data, (2) be computationally scalable and efficient, and (3) be trustworthy, meaning it is robust and interpretable.
- The “grouping effect” is a property of Elastic Net that encourages the model to select or deselect groups of highly correlated variables together. This is useful when features are naturally grouped, as LASSO by itself tends to arbitrarily select only one variable from such a group.
Essay Questions
The following questions are designed for longer, essay-style responses to encourage a deeper synthesis of the material. No answers are provided.
-
Compare and contrast Ridge, LASSO, and Elastic Net regularization. Discuss the mathematical formulation of their penalty terms, their impact on model coefficients, and the specific scenarios where one might be preferred over the others.
-
Explain the problem of overfitting in multivariate linear regression, particularly in the context of “large p, small n” datasets. How does regularization serve as a solution to this problem, and what is the trade-off involved?
-
Describe the role and importance of the hyperparameter λ in regularized regression. Detail the process of using k-fold cross-validation to select an optimal λ and explain why this process is crucial for building a model that generalizes well.
-
Using the geometric interpretation of regularized regression, explain why L1 regularization (LASSO) leads to sparse solutions while L2 regularization (Ridge) does not. Use the concepts of constraint regions and objective function contour lines in your explanation.
-
Discuss the practical application of regularized regression in a real-world problem like text regression for predicting movie revenues. Explain how model interpretability, a key feature of LASSO and Elastic Net, is valuable in such a domain.
Glossary of Key Terms
Term |
Definition |
Regularization |
An additional criterion added to a model’s loss function to prevent overfitting by keeping the parameters “regular” or small. |
Overfitting |
A state where a model is too complex and learns “noise” from the training data, resulting in low training error but high error on new, unseen data. |
Underfitting |
A state where a model is too simple, resulting in high errors on both the training and test datasets. |
Ridge Regression (L2) |
A regularized linear regression method that adds a penalty proportional to the sum of the squared coefficient weights (λ Σ βj²). It performs parameter shrinkage. |
LASSO (L1) |
Stands for “Least Absolute Shrinkage and Selection Operator.” A regularized regression method that adds a penalty proportional to the sum of the absolute values of the weights (λ Σ |βj|). It performs both shrinkage and variable selection. |
Elastic Net |
A regularized regression method that combines the L1 and L2 penalties. It was developed to handle highly correlated variables and p>n scenarios more effectively than LASSO alone. |
Normal Equation |
The analytical solution for Ordinary Least Squares regression: β = (XᵀX)⁻¹ Xᵀy. It is not solvable if XᵀX is non-invertible. |
p > n problem |
A common scenario in datasets, such as in gene expression or text analysis, where the number of features (p) is greater than the number of data samples (n). |
λ (Lambda) |
The regularization parameter, also known as a hyperparameter, that controls the strength of the penalty term in regularized models. Its value is often chosen via cross-validation. |
K-fold Cross Validation |
A technique used to evaluate a model’s ability to generalize to new data and to select optimal hyperparameters like λ. |
Regularization Path |
A plot that shows how the values of the estimated coefficients (βj) change as the regularization parameter λ is varied from zero to a large value. |
Shrinkage |
The effect of regularization where coefficient values are reduced in magnitude, pulling them closer to zero. This helps to reduce model complexity. |
Sparsity |
A property of a model where many of its coefficients are exactly zero. This is a primary feature of LASSO, which effectively performs feature selection. |
Grouping Effect |
A property of Elastic Net where it tends to select groups of highly correlated predictor variables together, rather than arbitrarily choosing one from the group as LASSO might. |
Pearson Correlation (r) |
A statistical measure of the linear correlation between two variables, yielding a value between -1 (total negative correlation) and +1 (total positive correlation). |
3Classification
5Theory
Local
ModelSelection
K-Nearest Neighbor (KNN) Study Guide
Short-Answer Quiz
Answer each question in 2-3 sentences.
- Describe the three main categories of supervised classifiers and provide an example for each.
- Explain the concept of “lazy learning” as it applies to the K-Nearest Neighbor algorithm.
- What are the three essential components required for a K-Nearest Neighbor (Instance-based) method?
- Contrast the process for making a prediction in KNN for a regression task versus a classification task.
- What happens during the “Training Mode” of a naïve K-Nearest Neighbor algorithm?
- Outline the three steps involved in the “Testing Mode” when using KNN to classify an unknown sample.
- Why is feature scaling important for K-Nearest Neighbor algorithms? Provide a brief example.
- Discuss the bias-variance tradeoff when selecting the value of k. What is the effect of a k that is too small or too large?
- What is the “curse of dimensionality,” and how can it negatively impact the performance of a KNN model?
- According to the Cover and Hart theorem (1967), what is the asymptotic relationship between the error rate of 1-nearest-neighbor classification and the Bayes rate?
Quiz Answer Key
-
The three major types of supervised classifiers are Discriminative, Generative, and Instance-based. Discriminative classifiers, like logistic regression or support vector machines, directly estimate a decision boundary. Generative classifiers, such as Naïve Bayes, build a statistical model of the data. Instance-based classifiers, like KNN, use the training observations directly without building an explicit model.
-
Instance-based learning is referred to as “lazy learning” because the decision of the output value is delayed until a new instance arrives for prediction. In KNN, no model is built during a training phase; instead, the original training instances themselves represent the knowledge, and computation is deferred until a prediction is requested.
-
The three essential components for a KNN method are: (1) the set of stored training samples, (2) a distance metric to compute the distance between samples (e.g., Euclidean distance), and (3) the value of k, which specifies the number of nearest neighbors to retrieve.
-
For a regression task, where the target variable is continuous, KNN makes a prediction by taking the mean value of the k nearest training examples. For a classification task, where the target is discrete, the prediction is determined by a majority vote among the class labels of the k nearest neighbors.
-
In the naïve version of the K-Nearest Neighbor algorithm, the training mode involves doing nothing other than storing the training samples. The algorithm does not build a model but simply retains the dataset, which will be used directly during the testing or deployment phase.
-
The three steps in the testing mode are: (1) Compute the distance from the unknown sample to all records in the training set. (2) Identify the k nearest neighbors based on these computed distances. (3) Use the class labels of these nearest neighbors to determine the class label of the unknown record, typically by taking a majority vote.
-
Feature scaling is important because attributes with larger ranges can dominate distance calculations, skewing the model’s performance. For example, if one feature is a person’s income (e.g., $10K to $1M) and another is height (e.g., 1.5m to 1.8m), the income attribute would have a much larger influence on the Euclidean distance calculation unless the features are scaled to a comparable range.
-
When selecting k, a small value can lead to a model that is sensitive to noise points, resulting in high variance and an unstable, though potentially accurate, local model. Conversely, a large k can cause the neighborhood to include points from other classes, leading to a model with high bias and an inaccurate estimation, though it will be more stable.
-
The “curse of dimensionality” refers to the problem where a nearest neighbor algorithm can be easily misled when working with high-dimensional data (i.e., data with many features). In a high-dimensional space, most points are far apart, and the concept of a “close” neighbor becomes less meaningful, especially if many attributes are irrelevant to the target function.
-
The Cover and Hart theorem states that asymptotically (with an infinite number of training samples), the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate. The Bayes rate is the error rate of an optimal classifier that knows the true data-generating model, representing the lowest achievable error.
Essay Questions
- Discuss the core principles of instance-based learning. How does KNN exemplify these principles, and what are the primary advantages and disadvantages of this “lazy” approach compared to “eager” learning models like linear regression?
- Explain the role of distance metrics in the KNN algorithm. Compare and contrast at least three different distance metrics mentioned in the source material (e.g., Euclidean, Manhattan, Mahalanobis) and discuss scenarios where one might be preferable over the others.
- Elaborate on the different variations of the standard KNN algorithm, such as Distance-Weighted KNN and RBF kernel Weighted KNN. How do these methods attempt to improve upon the naïve majority-vote approach?
- Describe the relationship between KNN, non-parametric density estimation, and the Bayes classifier. How can the principles of the Bayes classifier be used to provide a probabilistic interpretation for KNN?
- While KNN is conceptually simple, its computational cost can be a significant drawback. Discuss the computational time cost of the naïve KNN search and explain how a data structure like a KD Tree can be used to optimize the search for nearest neighbors.
Glossary of Key Terms
Term |
Definition |
Asymptotic Analysis |
The study of an algorithm’s properties (like error rate) as the amount of data approaches infinity. In KNN, it shows the error rate of 1-NN is less than twice the Bayes rate. |
Bayes Classifier |
A theoretically optimal classifier that minimizes the probability of classification error. It is achieved when density estimates converge to the true densities with an infinite number of samples. |
Bayes Error |
The lower bound of the probability of classification error, which is the error achieved by the Bayes classifier. |
Curse of Dimensionality |
A phenomenon where a nearest neighbor algorithm is easily misled when the input data has a high number of features (dimensions), making the concept of proximity less meaningful. |
Discriminative Classifier |
A type of supervised classifier that directly estimates a decision rule or boundary. Examples include support vector machines, logistic regression, and neural networks. |
Distance Metric |
A function used to compute the distance or similarity between two data samples. Examples include Euclidean, L1 norm (Manhattan), L∞ norm, and Mahalanobis distance. |
Feature Scaling |
The process of scaling attributes to prevent distance measures from being dominated by features with large ranges. |
Features |
The columns in a tabular dataset used for prediction, also known as attributes, dimensions, independent variables, covariates, predictors, or regressors. |
Generative Classifier |
A type of supervised classifier that builds a generative statistical model of the data. Examples include Bayesian networks and Naïve Bayes classifiers. |
Instance-based Learning |
A form of learning where the training instances themselves represent the knowledge. Predictions are made by searching for training instances that most closely resemble a new instance. It is also known as lazy learning. |
K-Nearest Neighbor (KNN) |
An instance-based algorithm for classification or regression. It predicts the output for a new data point based on the outputs of its k closest neighbors in the training data. |
KD Tree |
A data structure used to optimize the search for nearest neighbors (NN Search), reducing the computational time cost compared to a naïve search across all data points. |
Lazy Learning |
A learning method where the generalization of the training data is deferred until a query is made to the system, as opposed to eager learning, where a model is constructed explicitly. KNN is a form of lazy learning. |
Local Smoothness |
An underlying assumption in KNN that points that are close to each other in the feature space are likely to have similar target values. |
Regression |
A supervised learning task where the target or outcome variable is a continuous, numerical value. |
Supervised Classifier |
An algorithm that learns from a labeled dataset (where both features and outcomes are known) to classify new, unseen data. |
Target |
The special column in a tabular dataset that is to be predicted, also known as the outcome, response, label, or dependent variable. |
Voronoi Diagram |
A partitioning of a plane into regions based on the distance to points in a specific subset of the plane. It can be used to visualize the decision boundaries of a 1-nearest neighbor classifier. |
Nonlinear
Deep
Discriminative
4Unsupervised
Generative
In this lecture, we cover:
- What is NLP?
- Typical NLP tasks / Challenges / Pipeline
- f() on natural language
- Before Deep NLP (Pre 2012) • (BOW / LSI / Topic Modeling LDA )
- Word2Vec (2013-2016) • (GloVe/ FastText)
- Recurrent NN (2014-2016) • LSTM
- Seq2Seq
- Attention
- Self-Attention (2016 – now )
- Transformer (attention only Seq2Seq)
- BERT / RoBERTa/ XLNet/ GPT-2 / …
3Classification
Ensemble
5Theory
- Module1: Basic Tree4Classifier @ https://youtu.be/JKcTiyvIpp8
- Module2: How2Learn Tree https://youtu.be/iKTxnJU0L1E
- Module3: Bagged DT https://youtu.be/WaWTw07Luzs
3Classification
Ensemble
Discriminative
- Module1: Committee of Models and Analysis https://youtu.be/NuxS9SycZG8
- Module2: Random Forest and Analysis https://youtu.be/m52bovS_eNA
- Module3: Stacking https://youtu.be/sRLyzg5hmuM
- Module4: Boosting https://youtu.be/VwFTEW_NM4o
4Unsupervised
Hierarchical:
- Module1: Basics of Unsupervised Clustering @ https://youtu.be/mxrcPHWCZAI
- Module2: Hierarchical Clustering @ https://youtu.be/2K6gXLf4Em4
K-means
- Module1: Basic K-Means Algorithm @ https://youtu.be/239r6zlZYhY
- Module2: Theory/Complexity of K-Means @ https://youtu.be/OFKKeVhCQfA
- Extra Module3: Gaussian Mixture Models @ https://youtu.be/pmNPXpj0eK4
4Unsupervised
Generative
- Module1: Basic K-Means Algorithm @ https://youtu.be/239r6zlZYhY
- Module2: Theory/Complexity of K-Means @ https://youtu.be/OFKKeVhCQfA
- Extra Module3: Gaussian Mixture Models @ https://youtu.be/pmNPXpj0eK4
3Classification
Nonlinear
Deep
Discriminative
4Unsupervised
Generative
This lecture covers 10 deep learning trends that go beyond classic machine learning:
Disclaimer: it is quite hard to make important topics of deep learning fit on a one-session schedule. We aim to make the content reasonably digestible in an introductory manner. We try to focus on a modularity view by introducing important variables to digest deep learning into chunks regarding data/ model architecture /tasks / training workflows and model characteristics. We think this teaching style provides students with context concerning those choices and helps them build a much deeper understanding.