1Basic
Comprehensive Linear Algebra and Matrix Calculus Review
Study Guide
This study guide is designed to reinforce your understanding of fundamental concepts in linear algebra and matrix calculus, as presented in the source material. Mastering these topics is crucial for advanced studies in machine learning and related fields.
I. Fundamental Definitions
- Scalar: Understand what a scalar is and how it’s denoted.
- Vector: Define a vector, differentiate between row and column vectors, and understand its representation in real space (R^n).
- Matrix: Define a matrix, understand its dimensions (m-by-n), how its entries are denoted, and the concept of a square matrix.
II. Special Matrices
- Identity Matrix (I): Understand its structure and significance.
- Diagonal Matrix: Recognize its form.
- Upper/Lower Triangular Matrices: Identify these structures.
- Symmetric Matrix: Understand the property that defines a symmetric matrix.
- Orthogonal Matrix: Define an orthogonal matrix and its key property related to its inverse.
III. Matrix Operations
Transposition
- Concept: “Flipping” rows and columns.
- Notation: A^T.
- Application: Understand how to transpose a given matrix.
Addition and Subtraction
- Condition: Matrices must be of the same size.
- Process: Entry-wise operation.
- Examples: Be able to perform addition and subtraction.
Multiplication
- Notation: AB (pre-multiplying B by A, post-multiplying A by B).
- Conformability: Understand the condition for multiplication (number of columns in premultiplier must equal number of rows in postmultiplier).
- Dimensions: Predict the dimensions of the resulting matrix (m × n * n × p = m × p).
- Non-Commutativity: Recognize that AB ≠ BA, generally.
- Associativity: Understand A(BC) = (AB)C.
- Properties: (AB)^T = B^T A^T; multiplication with Identity Matrix.
Special Uses
- Scalar & Matrix Product: Understand bA = Ab.
- Dot (Inner) Product of Vectors: a^T b (result is a scalar).
- Outer Product of Vectors: ab^T (result is a matrix).
- Sum of Squared Elements of a Vector: a^T a (useful for L2 norm).
Norm (of Vector)
- Concept: Measure of “length” of a vector.
- Common Norms: L1, L2 (Euclidean), L∞.
-
L2 Norm (Euclidean): |
|
x |
|
= √(x^T x). |
- General Definition: Understand the properties a function must satisfy to be a norm.
Matrix Inversion
- Notation: A^(-1) or inv A.
- Definition: AA^(-1) = I = A^(-1)A.
- Nonsingular vs. Singular: Differentiate between matrices that have an inverse and those that don’t.
-
Determinant Condition: An inverse exists if and only if |
A |
≠ 0. |
- Properties: (AB)^(-1) = B^(-1)A^(-1); (A^T)^(-1) = (A^(-1))^T; (A^(-1))^(-1) = A.
- Special Cases: Inverse of diagonal matrices, orthogonal matrices (A^(-1) = A^T).
- Pseudo-inverse (Extra): Basic understanding of its existence for non-square matrices.
Matrix Rank
- Linear Independence: Define linearly independent vectors.
- Definition of Rank: Maximal number of linearly independent columns (or rows).
- Properties: rank(A) ≤ min(m,n). Full row rank, full column rank.
- Determinant Connection: Rank is the dimension of the largest square sub-matrix with a non-zero determinant.
- Relationship to Singular Matrices: How rank relates to singularity.
- Rank of Product: rank(AB) ≤ min(rank(A), rank(B)).
Matrix Calculus
- Review of Derivatives: Understand the basic definition of a derivative for single-variable functions.
- Multivariate Calculus Concepts:
- Partial Derivative: Extension of derivative to functions of multiple variables.
- Gradient: Vector of partial derivatives (first derivative to gradient).
- Hessian Matrix: Matrix of second partial derivatives (second derivative to Hessian).
- Denominator Layout: Understand that the gradient’s size is the same as the variable’s size.
- Examples: Be able to calculate simple gradients.
- Trace(): Sum of diagonal elements (not detailed in this source, but mentioned as “must know”).
- Eigenvalue / Eigenvectors: Important for future topics.
- Positive Definite Matrix, Gram Matrix, Quadratic Form, Projection: Future topics.
- Khan Academy: A recommended resource for review.
- Linear Transformation and Determinant: Connection to matrix determinant and Jacobian determinant.
- Laplacian of a function: Trace of the Hessian.
- Harmonic function: Function whose Laplacian is 0.
Quiz: Short-Answer Questions
- What is the primary difference between a scalar and a vector in linear algebra notation?
- Define a “column vector” and provide an example of a vector in R^4.
- Explain the condition for two matrices to be “conformable” for multiplication. Why is this condition important?
- Given matrices A (3×2) and B (2×4), what are the dimensions of the resulting matrix C = AB? What about C = BA?
- State two key properties of matrix multiplication that distinguish it from scalar multiplication.
- Describe the concept of a “norm” of a vector. What does the L2 norm represent?
- What is the definition of the inverse of a matrix A (denoted A^(-1))? Under what crucial condition does a matrix inverse exist?
- Differentiate between a “nonsingular” and a “singular” matrix.
- Explain the concept of “linear independence” in the context of vectors. How does this relate to the rank of a matrix?
- What is the gradient of a multivariate function? How does it relate to partial derivatives?
Quiz Answer Key
-
Scalar vs. Vector: A scalar is a single number (e.g., 1 or 22), denoted with regular type. A vector is a single row or column of numbers (e.g., [1 2 3]), denoted with bold small letters.
-
Column Vector and R^4 Example: A column vector is a vector arranged vertically. An example of a vector in R^4 (a column vector) is v = (1,6,3,4)^T.
-
Conformable Matrices: For two matrices A and B to be conformable for multiplication (AB), the number of columns in the premultiplier (A) must equal the number of rows in the postmultiplier (B). This ensures that the dot product for each entry in the resulting matrix can be calculated.
-
Matrix Multiplication Dimensions: If A is (3×2) and B is (2×4), then C = AB will have dimensions (3×4). C = BA cannot be performed because the number of columns in B (4) does not equal the number of rows in A (3).
- Properties of Matrix Multiplication:
- Matrix multiplication is generally not commutative (AB ≠ BA).
- Matrices must be conformable for multiplication to occur.
-
Vector Norm: A norm of a vector |
|
x |
|
is a measure of its “length.” The L2 norm, also known as the Euclidean norm, represents the straight-line distance from the origin to the vector’s endpoint (calculated as the square root of the sum of the squared elements). |
-
Matrix Inverse Definition and Condition: The inverse of an n × n matrix A is the matrix A^(-1) such that AA^(-1) = I = A^(-1)A, where I is the identity matrix. A crucial condition for an inverse to exist is that the determinant of A ( |
A |
) must not be zero. |
-
Nonsingular vs. Singular Matrix: A “nonsingular” matrix is an n × n matrix that has an inverse. A “singular” matrix is an n × n matrix that does not have an inverse.
-
Linear Independence and Rank: A set of vectors is linearly independent if none of them can be written as a linear combination of the others. The rank of a matrix is defined as the maximal number of linearly independent columns (or rows) in that matrix.
- Gradient of a Multivariate Function: The gradient of a multivariate function is a vector containing its partial derivatives with respect to each variable. It extends the concept of a derivative to functions of multiple variables, indicating the direction of the steepest ascent of the function.
1Basic
Machine Learning in a Nutshell: Comprehensive Study Guide
I. Core Concepts of Machine Learning
A. Definition and Goal
- Machine Learning (ML): A field of artificial intelligence where systems learn from data to optimize a performance criterion, aiming to generalize to unseen data.
- Goal: To enable computers to learn from example data or past experience without being explicitly programmed, and then apply this learned knowledge to make predictions or decisions on new, unseen data.
B. The Machine Learning Pipeline (Nutshell Components)
- Data: The raw material for learning, consisting of examples/instances/samples, each with features/attributes and potentially a target/label.
- Task: The specific problem ML aims to solve (e.g., classification, regression, clustering).
- Representation: The way data and the learning function are structured (e.g., linear function, neural network).
- Score Function (Loss/Cost Function): A metric used to evaluate how well the model’s predictions align with the actual values. The goal is often to minimize this function.
- Search/Optimization: The algorithms used to find the best model parameters by minimizing the score function (e.g., gradient descent).
Models, Parameters, Hyperparameters, Metrics
- Models: The learned functions or structures.
- Parameters: Internal variables of the model learned from data (e.g., w, b in a linear classifier).
- Hyperparameters: Parameters whose values are set before the learning process begins (e.g., learning rate, regularization strength).
- Metrics: Measures used to evaluate model performance (e.g., accuracy, precision, recall, MSE).
C. Two Phases of Machine Learning
Training (Learning Parameters)
- Uses a training set of input-output pairs (X, Y).
- The objective is to find model parameters (e.g., w, b) that minimize the loss/cost function, representing the difference between predicted f(x) and true y on the training examples.
- Uses a testing set of examples (x?) that were not included in the training set.
- Evaluates the model’s performance by measuring the difference between true y? and predicted f(x?).
- Crucial for assessing the model’s ability to generalize to new data.
D. When to Use Machine Learning
- Extracting knowledge from data: Discovering hidden relationships and correlations in large datasets.
- Learning tasks difficult to formalize: Problems hard to define explicitly with rules (e.g., face recognition).
- Creating software that improves over time: Systems that can adapt to new knowledge and data without continuous manual redesign.
II. Data Types in Machine Learning
A. General Terminology
- Data/points/instances/examples/samples/records: Rows in a dataset.
- Features/attributes/dimensions/independent variables/covariates/predictors/regressors: Columns in a dataset (excluding the target).
- Target/outcome/response/label/dependent variable: The special column to be predicted.
B. Main Types of Columns
- Continuous: Real numbers (e.g., weight, temperature).
- Discrete: Symbols or categories (e.g., “Good”, “Bad”, “Cat”, “Dog”).
C. Examples of Data Types
- Tabular Data: Structured data in rows and columns (e.g., patient records with gene quantities).
- 1D Sequence Data: Data with a sequential order (e.g., language text, genome sequences, audio).
- 2D Grid Data: Data arranged in a grid (e.g., images).
- Graph Data: Data representing relationships between entities (e.g., social networks).
III. Machine Learning Tasks
A. Supervised Learning
Definition: Learning from labeled input-output pairs (X, Y). The goal is to predict Y for unseen X.
Classification: Predicting a discrete target variable (Y)
- Binary Classification: Two possible output classes (e.g., positive/negative review, disease present/absent).
- Multi-class Classification: More than two possible output classes (e.g., categorizing text into Sports, Business, Education).
- Multi-label Classification: Assigning a set of target labels to a single input (e.g., an image being tagged with “Castle” and “Mountains”).
- Hierarchical Classification: Classification where classes are organized in a hierarchy.
Regression: Predicting a continuous target variable (Y)
Generating: Creating new data instances based on learned patterns (e.g., Text2Image, Edges2Image)
B. Unsupervised Learning
- Definition: Learning from unlabeled data (no Y provided).
- Goal: Finding patterns, structures, or relationships within the data.
- Example: Clustering – grouping instances into “natural” clusters.
C. Structured Output Learning
- Definition: Prediction tasks where output labels have complex structured correlations or constraints (e.g., spatial, temporal, relational dependencies).
- Example: Language parsing (predicting a parse tree), sequence labeling (predicting a sequence of labels for a sequence of inputs).
D. Reinforcement Learning (RL)
- Definition: An agent interacts with an environment and learns to maximize a scalar reward signal.
- Key Characteristics: Not independent and identically distributed (IID) data, sequential decision-making.
- Variations: Basic RL (no labels/supervision), imitation learning (supervised).
IV. Representation Types
A. Concept of Representation
How the input data (X) and the function (f) mapping X to Y are structured.
B. Examples of Representations
- Linear functions: f(x,w,b) = sign(w^T x + b) for binary classification.
- Nonlinear functions: Polynomial expansion, logistic function, trees, multi-layer networks (neural networks).
- Prob-density family: Bernoulli, multinomial, Gaussian, mixture of Gaussians.
- Vector Space Representation: Representing text as vectors (e.g., Bag of Words).
- Representation Learning (Deep Learning/Feature Learning): Automatically learning useful features from raw data, often through multi-layer architectures. This replaces traditional, time-consuming “Feature Engineering.”
V. Loss/Cost Types
A. Purpose of Loss Functions
- Quantify the difference between the model’s prediction and the true target value.
- Minimizing the loss function guides the learning process.
B. Examples of Loss Functions
- Mean Squared Error (MSE): Common for regression tasks.
- Hinge Loss: Used in binary classification (e.g., Support Vector Machines).
- Log-likelihood / Cross-entropy: Common for classification tasks, especially with probabilistic models.
- 0-1 Loss: Simple loss for classification, counts misclassifications.
- Regularized loss functions (L1, L2): Add penalty terms to the loss function to control model complexity and prevent overfitting.
VI. Optimization and Model Properties
A. Search/Optimization Algorithms
Purpose: To find the optimal model parameters by minimizing the loss/cost function.
Examples:
- Gradient Descent (GD): An iterative first-order optimization algorithm that takes steps proportional to the negative of the gradient.
- Stochastic Gradient Descent (SGD): A variant of GD that updates parameters using a single training example or a small batch at a time.
- Newton’s Method: A second-order optimization algorithm.
- Linear Programming, Quadratic Programming: For specific types of optimization problems.
- EM (Expectation-Maximization): For models with latent variables.
- Backpropagation: Used to train neural networks by efficiently computing gradients.
B. Model Properties / Basic Concepts
- Generalization: The ability of a model to perform well on unseen data. This is the ultimate goal of ML.
- Overfitting: When a model learns the training data too well, including noise, leading to poor performance on new data.
- Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test data.
- Bias-Variance Tradeoff: A fundamental concept balancing model simplicity (high bias, low variance) and complexity (low bias, high variance).
- Big gap between train/validation loss suggests overfitting (high variance).
- No gap, both bad, suggests underfitting (high bias).
- Regularization: Techniques (e.g., L1, L2) that add a penalty to the loss function to discourage overly complex models and prevent overfitting.
VII. Evaluation and Advanced Considerations
A. Measuring Prediction Accuracy
- Evaluate models on test data using appropriate metrics.
- For supervised classification, common metrics include accuracy, precision, recall, F1-score.
B. Human-Centric ML/AI Research
- Trustworthy ML: Ensuring models are reliable and fair.
- Explainable ML: Understanding why a model makes certain predictions.
- Robustness: How well a model performs under adversarial conditions (e.g., adversarial examples).
- Interacting with/Understanding/Modeling Human Intent: Developing AI that better collaborates with humans.
Quiz: Machine Learning Fundamentals
Instructions: Answer each question in 2-3 sentences.
- What is the primary goal of machine learning, and how does it differ from traditional programming?
- Explain the concept of “generalization” in machine learning. Why is it important, and what happens if a model fails to generalize?
- Describe the two main phases of a machine learning process. What is the key difference in the data used during these phases?
- Define a “loss function” and explain its role in training a machine learning model.
- What is the difference between “parameters” and “hyperparameters” in a machine learning model? Provide a brief example for each.
- List three types of data commonly encountered in machine learning. Give a real-world example for each.
- Distinguish between supervised and unsupervised learning tasks. Provide a task example for each.
- Briefly explain what “representation learning” is and why it has become more prominent compared to traditional “feature engineering.”
- What is the “bias-variance tradeoff”? How might you diagnose a model that is suffering from high variance?
- Describe the purpose of “regularization” in machine learning. How does it help in model training?
Answer Key
-
Primary Goal of ML: The primary goal of ML is to optimize a performance criterion using example data or past experience, aiming to generalize to unseen data. It differs from traditional programming where rules are explicitly coded; instead, ML systems learn patterns from data to make predictions or decisions.
-
Generalization Concept: Generalization refers to a model’s ability to perform accurately on new, unseen data, not just the data it was trained on. It is crucial because the ultimate purpose of an ML model is to make predictions in real-world scenarios. A model that fails to generalize is said to be overfitting, meaning it has learned the training data too specifically, including noise, and will perform poorly on new examples.
-
Two Main Phases: The two main phases are training and testing/deployment. During training, the model learns parameters from a labeled “training set.” During testing, the model’s performance is evaluated on a separate “testing set” of examples that were not part of the training data.
-
Loss Function: A loss function quantifies the discrepancy between a model’s predicted output and the actual true output for a given example. Its role in training is to provide a metric that the optimization algorithm (e.g., gradient descent) attempts to minimize, thereby iteratively adjusting the model’s parameters to improve its accuracy.
-
Parameters vs Hyperparameters: Parameters are internal variables of the model learned from the data during training, like the weights (w) and bias (b) in a linear classifier. Hyperparameters are external configurations set before the training process begins, such as the learning rate in gradient descent or the regularization strength.
-
Three Data Types: Three common data types are tabular, 2D grid, and 1D sequence data. Tabular data can be patient records with various medical attributes. 2D grid data commonly refers to images, like photographs for object recognition. 1D sequence data includes text documents for sentiment analysis or genome sequences.
-
Supervised vs Unsupervised: Supervised learning involves learning from labeled input-output pairs (X, Y) to predict Y for new X; an example is classifying emails as spam or not spam. Unsupervised learning, conversely, deals with unlabeled data to find patterns or structures within it; an example is clustering customer data to identify distinct market segments.
-
Representation Learning: Representation learning is the process where a machine learning model automatically discovers the representations (features) needed for detection or classification from raw data. It has become prominent because it often yields more effective features and reduces the laborious, time-consuming, and often task-dependent manual effort required in traditional “feature engineering.”
-
Bias-Variance Tradeoff: The bias-variance tradeoff refers to the dilemma of simultaneously minimizing two sources of error that prevent models from generalizing well: bias (error from overly simplistic assumptions) and variance (error from sensitivity to small fluctuations in the training data). High variance (overfitting) can be diagnosed by observing a large gap between the model’s performance on the training data (very good) and its performance on validation or test data (significantly worse).
-
Regularization Purpose: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function during training. This penalty discourages the model from becoming too complex or assigning excessively large weights to features. By controlling model complexity, regularization helps the model generalize better to unseen data.
2Regression
Linear
1Basic
Understanding Linear Regression: A Study Guide
Quiz
Instructions: Answer the following questions in 2-3 sentences each.
- What is the primary goal of Machine Learning in a nutshell, as described in the lecture?
- Define “regression” in the context of machine learning, specifically mentioning the nature of the target variable.
- Explain the purpose of the “score function” (or loss/cost function) in linear regression.
- Describe what the training dataset consists of for supervised regression.
- In the linear regression equation f(x) = wx + b, what do w and b represent?
- How does the concise notation ŷ = f(x) = x^T θ = θ^T x simplify the representation of linear regression with multiple features?
- What is the “Sum of Squared Error” (SSE) and why is it used as a loss function in linear regression?
- Briefly explain the concept of “Normal Equations” in the context of finding optimal regression coefficients.
- Why is it important for a learned model to “Generalize Well” to unseen data?
- What are the three main aims for a learned model besides generalizing well, as discussed in the lecture?
Answer Key
-
Primary Goal of Machine Learning: The primary goal of Machine Learning is to optimize a performance criterion using example data or past experience. The ultimate aim is to generalize this learned knowledge effectively to unseen data.
-
Regression Definition: In machine learning, regression refers to tasks where the target variable (Y) is continuous. This means the model predicts a numerical value rather than a category or class.
-
Score Function Purpose: The score function (or loss/cost function), denoted as L(), quantifies the difference between the model’s predictions and the actual target values. The goal of the learning algorithm is to minimize this function, thereby finding the optimal model parameters.
-
Training Dataset: The training dataset for supervised regression consists of input-output pairs. These pairs include available examples (input features, X) and their corresponding continuous target labels (output, Y), which the model uses to learn.
-
Linear Regression Parameters: In the linear regression equation f(x) = wx + b (for a single variable), w represents the weight or slope, indicating how much Y changes for a unit change in X. b represents the bias or intercept, which is the value of Y when X is zero.
-
Concise Notation: The concise notation ŷ = f(x) = x^T θ = θ^T x uses vector/matrix products to represent multiple features and their corresponding weights in a single expression. This compact form simplifies mathematical derivations and computational implementation for multivariate linear regression.
-
Sum of Squared Error (SSE): The Sum of Squared Error (SSE) measures the sum of the squared differences between the predicted values (f(x)) and the actual target values (y). Squaring the errors ensures that all errors contribute positively to the total loss and penalizes larger errors more heavily, providing a clear objective for minimization.
-
Normal Equations: The Normal Equations provide a closed-form solution to find the optimal regression coefficients (θ) by setting the derivative (gradient) of the cost function with respect to θ to zero. This method directly calculates the θ that minimizes the SSE for linear regression.
-
Generalization: Generalizing well means that the model performs accurately not only on the data it was trained on but also on new, unseen data. This indicates that the model has learned underlying patterns rather than just memorizing the training examples, making it useful in real-world applications.
-
Additional Model Aims: Besides generalizing well, the lecture suggests that a learned model should also be computationally scalable and efficient, and robust, trustworthy, and interpretable. These attributes contribute to the practical utility and reliability of machine learning models.
Linear Regression Basics
1. Machine Learning in a Nutshell
Definition: Machine Learning (ML) is a field that grew out of Artificial Intelligence (AI). Its core purpose is to “Optimize a performance criterion using example data or past experience, aiming to generalize to unseen data.”
Key Components of ML: The process of machine learning can be broken down into several fundamental elements:
- Task: What needs to be predicted or achieved (e.g., y in regression).
- Representation: How the data and model are structured (e.g., x, f()).
- Score Function: A measure of how well the model is performing (e.g., L()).
- Search/Optimization: The method used to find the best model parameters (e.g., argmin()).
- Models, Parameters: The specific mathematical structure and adjustable values of the model (e.g., w, b for regression coefficients).
- Data: The input used for training and testing (e.g., X: Tabular).
2. Regression Fundamentals
Definition: Regression specifically deals with predicting a continuous target variable y.
Linear Regression: In linear regression, the target Y is modeled as a “Weighted linear sum of Xs.”
- For a single variable x, the function is f(x) = wx + b. Here, w is the weight (slope) and b is the bias (intercept). A slope of 2 means “every 1-unit change in X yields a 2-unit change in Y.”
- For multiple features x₁, x₂, …, the model becomes ŷ = f(x) = θ₀ + θ₁x₁ + θ₂x₂. Here, θ₀ is the intercept, and θ₁, θ₂ are coefficients for x₁, x₂ respectively.
Supervised Learning: Linear Regression is a type of supervised learning. This means the training dataset consists of “input-output pairs” (X, Y), where Y (the target) is known for each input X.
3. Data Representation for Regression
Tabular Dataset Structure
- Data/points/instances/examples/samples/records: Rows of the dataset.
- Features/attributes/dimensions/independent variables/covariates/predictors/regressors: Columns of the dataset, excluding the target.
- Target/outcome/response/label/dependent variable: A special column (usually the last) that is continuous-valued and to be predicted.
Concise Vector/Matrix Notation
To simplify computations, especially with multiple features:
- Each data sample x is represented as a column vector.
- A pseudo feature x₀=1 (the intercept term) is added to the feature vector, making it x^T = [(x₀=1), x₁, x₂, …, xₚ].
- The parameter vector θ is also a column vector [θ₀, θ₁, …, θₚ]^T.
- The linear regression function can then be concisely written as ŷ = f(x) = x^T θ = θ^T x.
Training Set in Matrix Form: The entire training set with n samples can be represented as a matrix X (where each row is a feature vector x_i^T) and a target vector y_train = [y₁, y₂, …, yₙ]^T.
4. Training and Optimization
Training Goal: To find the optimal parameters (w, b or θ) by minimizing a Loss/Cost function L(). This function quantifies the “difference between y and f(x) on available examples in the training set.”
Sum of Squared Error (SSE) / Mean Squared Error (MSE)
The most common loss function for linear regression:
- J(θ) = 1/2 × Σ (f(xᵢ) - yᵢ)²
- In matrix form, J(θ) = 1/2 × (Xθ - y)^T (Xθ - y)
Finding Optimal Parameters (θ)
Since J(θ) is a convex function, its minimum can be found by taking the derivative (gradient) with respect to θ and setting it to zero (∇θJ(θ) = 0). This leads to the “normal equations”: X^T Xθ = X^T y.
The solution for θ is then θ* = (X^T X)^(-1) X^T y.
Convexity of J()
The loss function J(θ) is convex. “Intuitively, a convex function (1D case) has a single point at which the derivative goes to zero, and this point is a minimum.” For multivariate functions, if the Hessian matrix is positive definite (which X^T X is, given certain conditions), the function is convex and the point where the gradient is zero is a minimum.
Gradient Descent (GD) / Stochastic Gradient Descent (SGD)
These are iterative optimization algorithms used when a closed-form solution is not feasible or computationally too expensive. (Mentioned in outline, but not detailed in excerpts).
Generalization: A primary aim is for the “learned model [to] Generalize Well” to unseen data. It should also be “Computational Scalable and Efficient” and “Robust / Trustworthy / Interpretable.”
Testing Mode: After training, the model is tested on “Production Data” (also called test data), which consists of new inputs X? for which the output f(X?) is predicted.
Metrics for Regression
- Test MSE Error: A common metric reported for evaluating regression models, calculated on the test set: J_test = 1/m × Σ (x_i^T θ* - y_i)².
- Other metrics exist (referencing scikit-learn documentation).
6. Computational Considerations
Why Vector/Matrix Form? It allows for “Concise Notation,” “Closed form solution” for training, and efficient computation for testing (ŷ_test = X_test θ*).
Computational Cost (Naive Matrix Multiply)
A naive matrix multiplication C = C + A×B has a complexity of O(n³) flops.
Optimizing Computational Cost
Data Parallelization
Utilizing CPU SIMD (Single Instruction, Multiple Data), multithreading, and GPU parallelization. SIMD allows “one operation [to produce] multiple results.”
Memory Hierarchy/Locality
Exploiting spatial (accessing nearby data) and temporal (reusing data) locality to improve average memory access speed through caches.
Better Algorithms
Examples include Strassen’s Matrix Multiply, which reduces complexity to O(n^2.81) (asymptotically faster than naive O(n³)).
BLAS (Basic Linear Algebra Subroutines)
Industry-standard interfaces (e.g., numpy wraps BLAS) providing optimized implementations for vector (BLAS1), matrix-vector (BLAS2), and matrix-matrix (BLAS3) operations. BLAS3 is generally preferred for its potential for higher computational intensity.
7. Advanced/Optional Concepts
Probabilistic Interpretation: Linear regression can be viewed probabilistically by assuming the error term ε (representing unmodeled effects or random noise) follows a Gaussian distribution N(0, σ). This leads to the concept of Maximum Likelihood Estimation for finding θ.
Gram Matrix: X^T X is referred to as a Gram Matrix and is always positive semi-definite. If X has full rank, X^T X is positive definite and invertible, ensuring a unique minimum for J(θ).
Regularization: Used when X has less than full column rank to handle non-invertible X^T X (not detailed in these excerpts but mentioned as a solution).
Scalability to Big Data: While polynomial time algorithms are good, O(n) can still be too slow for truly massive datasets, leading to research in “low rank, sparse, hardware, sampling, randomized…” solutions.
2Regression
Optimization
Optimization for Linear Regression: A Study Guide
I. Course Overview & Core Concepts
This section of the course focuses on optimization techniques for Linear Regression (LR), specifically for tabular data. It revisits fundamental machine learning concepts within this context.
A. Course Sectioning Context
- Section 1: Basic Supervised Regression on Tabular Data
- Section 2: Basic Deep Learning on 2D Imaging Data
- Section 3: Advanced Supervised Learning on Tabular Data
- Section 4: Generative and Deep Learning on 1D Sequence Text Data
- Section 5: Unsupervised Learning (Mostly on Tabular Data)
B. Regression Fundamentals
- Regression: Deals with predicting a continuous output variable (y).
- Linear Regression (LR): Models the output (Y) as a weighted linear sum of input features (Xs).
- Sum of Squared Error (Least Squares): The standard cost function used to measure the difference between predicted and actual values in LR.
- Metrics, Implementation, Regression Coefficients (w, b): Key aspects to consider in LR.
C. Machine Learning Task Components
- Task: Predicting ‘y’.
- Representation: Input features ‘x’ and the function ‘f()’.
- Score Function: The loss function ‘L()’.
- Search/Optimization: The process of finding ‘argmin()’ for the loss function.
- Models, Parameters: The chosen model and its adjustable parameters.
- Data: ‘X’, typically tabular.
- Prediction: ŷ = f(x) = θ^T x (where θ represents the parameters/weights).
D. Optimization Principles
- Objective Function: The function to be minimized or maximized (e.g., the cost function).
- Variables: The parameters being optimized (e.g., weights ‘w’ or ‘θ’).
- Constraints: Conditions that the variables must satisfy.
- Goal: Find variable values that optimize the objective function while satisfying constraints.
II. Methods for Optimizing Linear Regression
The lecture covers two main approaches to optimizing Linear Regression: Direct Optimization (Normal Equation) and Iterative Optimization (Gradient Descent variants).
A. Method 1: Directly Optimize (Normal Equation)
Concept: For convex functions (like the Sum of Squared Error in LR), the minimum occurs where the derivative (slope) is zero.
Process:
- Write the cost function J(θ) in matrix form.
- Take the gradient of J(θ) with respect to θ and set it to zero.
- Solve the resulting equation for θ.
Formula: θ = (X^T X)^(-1) X^T y
Pros: A single-shot algorithm, easiest to implement.
Cons:
- Requires computing the pseudo-inverse (X^T X)^(-1), which can be computationally expensive (especially for large datasets).
- Prone to numerical issues if the matrix is singular or ill-conditioned.
Direct vs. Iterative: A direct method, meaning it finds the solution in a finite number of steps.
B. Method 2: Iteratively Optimize via Gradient Descent
Concept: An iterative algorithm that moves towards the minimum of a function by taking steps proportional to the negative of the gradient at the current point.
Gradient Definition: A vector whose entries contain the partial derivatives of the function with respect to each variable. It points in the direction of the greatest rate of increase of the function.
General Algorithm:
- Initialize k=0, choose an initial x₀ (randomly or by prior).
- While k < k_max (or until a stopping criterion is met): x_k = x_{k-1} - α∇x F(x{k-1})
Key Factors Influencing GD:
- Learning Rate (α): Determines the step size. A critical parameter.
- Objective Function: Its shape (e.g., convexity) affects convergence.
- Starting Point: Can influence which local minimum is found (though not an issue for LR’s convex cost function).
- Stop Criterion: When to stop the iterative process.
Pros:
- Works on any objective function as long as its gradient can be evaluated.
- Useful for minimizing complex functions.
- Guaranteed convergence for convex functions like LR’s cost function.
Cons:
- Can be slow converging (especially batch GD).
- Can get stuck in local minima for non-convex functions (not an issue for LR).
- Learning rate selection is crucial.
C. Variants of Gradient Descent
1. Batch Gradient Descent (GD)
Concept: Calculates the gradient of the cost function with respect to the parameters for all training examples in each iteration.
Update Rule: θ_{t+1} = θ_t - α∇_θ J(θ_t) = θ_t + αX^T (y - Xθ_t)
Characteristics:
- Computes the exact gradient in each step.
- Smooth convergence path.
- Can be computationally expensive for very large datasets, as it needs to process the entire dataset per update.
2. Stochastic Gradient Descent (SGD)
Concept: Calculates the gradient and updates the parameters for each individual training example (or a very small subset) at a time.
Update Rule (for a single training point i): θ_{t+1} = θ_t + α(y_i - x_i^T θ_t)x_i
Characteristics:
- “Noisier” updates due to processing one example at a time.
- Faster per-step computation compared to batch GD.
- Can escape shallow local minima due to the noise (advantage for non-convex problems).
- Useful for massive datasets that don’t fit in memory or for online/streaming data.
Epoch: One complete pass over the entire training data when using SGD for offline training.
Pros:
- Low per-step cost.
- Can be faster converging to an approximate solution.
- Less prone to getting stuck in local optima (for non-convex cases).
- Efficient for large datasets and online learning.
Cons:
- Convergence to the exact optimum is not always guaranteed (it tends to oscillate around the minimum).
- More variation in the objective function during training.
3. Mini-Batch Gradient Descent
Concept: A compromise between Batch GD and SGD. Computes the gradient on a small “mini-batch” of samples (e.g., B=32, 64, 128).
Update Rule: θ_{t+1} = θ_t + (1/B)αX_B^T (y_B - X_B θ_t) (where B refers to the mini-batch size and X_B, y_B are the mini-batch data)
Characteristics:
- Combines benefits of both: more stable updates than SGD, but computationally more efficient than Batch GD (especially with GPU acceleration).
- Commonly used in practice.
- Special Cases: B=1 is standard SGD, B=N (total data size) is standard Batch GD.
Pros:
- Much faster computationally than single-point SGD (better use of computer architecture like GPUs).
- Low per-step cost.
- Fast convergence.
D. Stopping Criteria for Gradient Descent
- Predetermined Maximum Number of Iterations: A common and simple stopping rule.
- Threshold for Improvement: Stop when the improvement in the objective function (e.g., MSE) drops below a certain threshold.
- Monitoring Training Error: Plotting Train-MSE per sample to observe its decrease over epochs is helpful for understanding behavior.
Quiz: Optimization for Linear Regression
- What is the primary goal of optimization in the context of Linear Regression?
- Briefly explain the main idea behind the “Normal Equation” method for Linear Regression.
- What are two significant disadvantages of using the Normal Equation, especially for large datasets?
- Define “Gradient Descent” as an iterative optimization algorithm.
- List three critical factors that influence the behavior and performance of Gradient Descent.
- How does Stochastic Gradient Descent (SGD) differ from Batch Gradient Descent (GD) in terms of gradient calculation and parameter updates?
- Provide two advantages of using Stochastic Gradient Descent (SGD) compared to Batch Gradient Descent for very large datasets.
- What is “Mini-Batch Gradient Descent,” and why is it often preferred in practice?
- Explain the concept of an “epoch” in the context of Stochastic Gradient Descent.
- Name two common stopping criteria for iterative optimization algorithms like Gradient Descent.
Quiz Answer Key
-
Primary Goal of Optimization: The primary goal of optimization in Linear Regression is to find the set of model parameters (weights ‘w’ or ‘θ’) that minimize the chosen cost function, typically the Sum of Squared Error (Least Squares). This minimization results in the best-fit line or hyperplane for the given data.
-
Normal Equation Method: The Normal Equation method directly solves for the optimal Linear Regression parameters by setting the gradient of the cost function to zero. It analytically calculates the parameters in a single step using matrix operations, without needing an iterative process.
-
Disadvantages of Normal Equation: Two significant disadvantages of the Normal Equation are its high computational cost due to the requirement of computing a matrix inverse, particularly for large numbers of features, and its susceptibility to numerical issues if the (X^T X) matrix is singular or ill-conditioned.
-
Gradient Descent Definition: Gradient Descent is an iterative optimization algorithm that minimizes an objective function by repeatedly moving in the direction opposite to the gradient of the function. Each step adjusts the parameters by a small amount proportional to the negative of the gradient, gradually approaching a local (or global) minimum.
-
Critical Factors for GD: Three critical factors influencing Gradient Descent are the learning rate (α), which determines the step size; the shape of the objective function, which dictates the gradient landscape; and the starting point, which can affect the path to the minimum or which local minimum is found.
-
SGD vs Batch GD: Stochastic Gradient Descent (SGD) updates parameters using the gradient calculated from a single training example at a time, leading to noisy but frequent updates. In contrast, Batch Gradient Descent (GD) calculates the gradient using all training examples before making a single parameter update, resulting in smoother but less frequent updates.
-
Advantages of SGD: Two advantages of SGD for very large datasets are its low per-step computational cost, as it processes only one sample at a time, making it suitable for data that doesn’t fit in memory, and its ability to converge faster to an approximate solution.
-
Mini-Batch Gradient Descent: Mini-Batch Gradient Descent is an optimization technique that computes the gradient and updates parameters using a small, randomly selected subset (mini-batch) of the training data. It is often preferred because it balances the stability of Batch GD with the computational efficiency of SGD, leading to faster training and better utilization of hardware like GPUs.
-
Epoch Concept: In the context of Stochastic Gradient Descent, an “epoch” refers to one complete pass through the entire training dataset. If SGD is used for offline training, it repeatedly cycles through all samples in the dataset, and each such pass constitutes an epoch.
-
Stopping Criteria: Two common stopping criteria for Gradient Descent algorithms include a predetermined maximum number of iterations (epochs) and stopping when the improvement in the objective function (e.g., Train MSE) between consecutive iterations drops below a specified small threshold.
3Classification
Nonlinear
Deep
Discriminative
4Unsupervised
Generative
This lecture covers 10 deep learning trends that go beyond classic machine learning:
Disclaimer: it is quite hard to make important topics of deep learning fit on a one-session schedule. We aim to make the content reasonably digestible in an introductory manner. We try to focus on a modularity view by introducing important variables to digest deep learning into chunks regarding data/ model architecture /tasks / training workflows and model characteristics. We think this teaching style provides students with context concerning those choices and helps them build a much deeper understanding.