Logistic and NN

Lecture: S2-LogisticRegression
Version: current

Notes: S2-LogisticRegression

Required Read: compare classifiers
Recorded Videos: M1 + M2
Advanced Read: S2-LogisticRegression

3Classification Nonlinear Deep Linear Discriminative

Att: the following markdown text was generated from the corresponding powerpoint lecture file automatically. Errors and misformatting, therefore, do exist (a lot!)!

Study Guide: Logistic Regression and Bayes Classifiers

Short-Answer Quiz

Instructions: Answer the following questions in 2-3 sentences each, based on the provided source context.

What is the primary goal of a Bayes classifier when presented with a new data sample?
Explain the fundamental difference between discriminative and generative classifiers.
What is the core principle of Maximum Likelihood Estimation (MLE) for determining model parameters?
Describe the concept of “log-odds” or the “logit” function in the context of logistic regression.
How does the Bernoulli distribution serve as a conceptual model for the target variable in logistic regression?
In logistic regression, what is the decision boundary and what is its geometric shape?
Describe the two main stages of logistic regression when viewed as a single “neuron” or block.
What is the Expected Prediction Error (EPE) and how does a 0-1 loss function lead to the Bayes classifier rule?
Why is it often more convenient to work with the log-likelihood function rather than the likelihood function itself during MLE?
Briefly compare Newton’s method to Gradient Descent for optimization.

Answer Key

The primary goal of a Bayes classifier is to predict the class of a new data sample x. It achieves this by finding the class C that maximizes the posterior probability p(C

x), a principle known as the Maximum A Posteriori (MAP) rule.

Discriminative classifiers, such as logistic regression and neural networks, directly estimate a decision rule or boundary to separate classes. In contrast, generative classifiers, like Naïve Bayes, build a statistical model of how the data for each class is generated and use that model to make predictions.
Maximum Likelihood Estimation (MLE) is a method for estimating the unknown parameters of a model. The core principle is to choose the set of parameters that are most likely to have produced the observed data set. This is achieved by maximizing the joint probability, or likelihood, of the observed samples.

**The “log-odds,” or “logit” function, is the natural logarithm of the odds P(y

x) / (1-P(y

x)).** In logistic regression, this logit is modeled as a linear function of the input features (x), forming the core representation of the model: ln[p/(1-p)] = β₀ + β₁x₁ + … + βₚxₚ.

Logistic regression is suitable for target variables coded as 0 or 1. The Bernoulli distribution, which models binary outcomes like a coin flip, is used to conceptually model this target variable. The probability parameter p of the Bernoulli distribution is not fixed but is instead modeled as a function of the input features, p = p(y=1

x).

The decision boundary in logistic regression is the point where the log-odds equation equals zero, which corresponds to a probability of 0.5. Because the log-odds are modeled as a linear function of the features, the resulting decision boundary is also linear.
When viewed as a single block or “neuron,” logistic regression consists of two stages. The first is a summing function that calculates a weighted sum of the inputs (z = wᵀx + b). The second stage is the application of a sigmoid function (y = sigmoid(z)), which squashes the output of the summing function to a value between 0 and 1, representing a probability.
Expected Prediction Error (EPE) is the expected value of a loss function over the joint distribution of inputs and outputs. When using a 0-1 loss function, which penalizes any misclassification, minimizing the EPE is equivalent to choosing the class with the maximum posterior probability, which is precisely the MAP rule used by the Bayes classifier.
Working with the log-likelihood function is convenient because the joint probability of independent samples is a product of individual probabilities. Taking the logarithm transforms this product into a sum (log(Π P) = Σ log(P)), which is mathematically simpler to differentiate and maximize.
Both are optimization algorithms. Gradient Descent (GD) uses a first-order approximation to update parameters. In contrast, Newton’s method uses a second-order (quadratic) Taylor series approximation, incorporating curvature information via the Hessian matrix, which often allows it to find a more direct route to the minimum.

Essay Questions

Instructions: Formulate detailed responses to the following questions, synthesizing information from across the source material.

Synthesize the five different “views” of logistic regression presented: (I) logit as a linear function, (II) Y as a Bernoulli variable, (III) the “S” shape sigmoid function, (IV) a linear classification boundary, and (V) a two-stage neuron model. How do these perspectives complement each other to provide a comprehensive understanding of the model?
Discuss the role of Maximum Likelihood Estimation (MLE) in training a logistic regression model. Trace the process from defining the likelihood for a basic Bernoulli trial to deriving the conditional log-likelihood function for the entire logistic regression training set.
Explain the concept of the MAP (Maximum A Posteriori) rule within the framework of Bayes classifiers. How is this rule justified by the principles of Statistical Decision Theory and the minimization of Expected Prediction Error (EPE) with a 0-1 loss function?
Compare and contrast the three major groups of classifiers: discriminative, generative, and instance-based. Provide examples of each and explain where logistic regression fits within this taxonomy and why.
Describe the optimization challenge in fitting a logistic regression model. Compare the second-order Newton’s method with the first-order Stochastic Gradient Descent (SGD) method, discussing their respective approaches, requirements (e.g., first and second derivatives), and characteristics.

Glossary of Key Terms

Term	Definition
Bayes Classifier	A probabilistic classifier that treats each feature and the class label as random variables. It predicts the class of a sample x by finding the class C that maximizes the posterior probability `p(C \\| x)`.
Bernoulli Distribution	A probability distribution for a binary random variable, such as a coin flip that can be “Head” (with probability p) or “Tail.” In logistic regression, the target variable Y is modeled as a Bernoulli random variable where p is a function of the input features x.
Decision Boundary	In logistic regression, this is the boundary where the log-odds equation equals zero. Because the log-odds are a linear function of the input features, the decision boundary is linear.
Discriminative Classifier	A type of classifier that directly estimates a decision rule or boundary to separate classes. Logistic regression, support vector machines, and neural networks are examples.
Expected Prediction Error (EPE)	A measure of a model’s performance, defined as the expected value of a loss function L(Y, f(X)) over the joint distribution Pr(X,Y). Minimizing EPE is the central goal of statistical decision theory.
Generative Classifier	A type of classifier that builds a generative statistical model of the data for each class. It models `P(X \\| Y)` and `P(Y)` to make predictions.
Hessian Matrix	For a multivariate function, the Hessian is the matrix of second-order partial derivatives. It is the multivariate equivalent of the second derivative and is used in second-order optimization methods like Newton’s method.
Instance-Based Classifier	A type of classifier that uses observations directly for classification without building an explicit model. K-Nearest Neighbors is a key example.
Log-Likelihood	The natural logarithm of the likelihood function. It is often used in MLE because it converts products into sums, which are mathematically easier to differentiate and optimize.
Logistic Regression	A discriminative probabilistic classifier for binary classification tasks. It models the log-odds (logit) of the class probability as a linear function of the input features.
Logit (Log-odds)	The core function in logistic regression, defined as the natural logarithm of the odds: `ln[P(y=1\\|x) / P(y=0\\|x)]`.
Maximum A Posteriori (MAP) Rule	The decision rule used by Bayes classifiers. It assigns a new instance X to the class c* that has the maximum posterior probability `P(c* \\| X)`.
Maximum Likelihood Estimation (MLE)	A method for estimating the parameters of a statistical model by finding the parameter values that maximize the likelihood function of the observed data. It chooses parameters that are “most likely” to have produced the observed results.
Newton’s Method	A second-order optimization algorithm that finds the minimum of a function by making a quadratic Taylor series approximation at each step. It requires both first (gradient) and second (Hessian) derivatives and often converges faster than first-order methods.
Sigmoid Function	An “S”-shaped function used in logistic regression to compress a real-valued number (the output of the linear model) into the range [0, 1], allowing it to be interpreted as a probability. The formula is 1 / (1 + e⁻ᶻ).
Stochastic Gradient Descent (SGD)	A first-order iterative optimization algorithm used for finding the minimum of a function. It is a variant of gradient descent that updates model parameters based on the gradient computed from a single sample or a small batch of samples.
0-1 Loss Function	A loss function used for classification where the loss is 0 for a correct classification and 1 for an incorrect classification. Minimizing the EPE with a 0-1 loss function leads to the MAP classification rule.

2025 Fall UVA CS - Machine Learning