NaiveBC on Text

Lecture: S3-NBCtext
Version: current
Required Read: text NBC notebook
Recorded Videos: M1 + M2 + (Extra M3 + M4)
Extra Notes to Read: Multinomial MLE

Att: the following markdown text was generated from the corresponding powerpoint lecture file automatically. Errors and misformatting, therefore, do exist (a lot!)!

Notebook to run:

Study Guide: Naïve Bayes Classifier for Text Classification

Quiz: Short-Answer Questions

Instructions: Answer the following questions in 2-3 complete sentences, drawing only upon the provided course materials.

What is the core “naïve” assumption made by the Naïve Bayes classifier, particularly in the context of text classification?
Describe the “bag of words” representation for text documents and identify its primary simplifying assumption.
What are the two main probabilistic models discussed for applying the Naïve Bayes classifier to text, and how do their feature representations differ?
According to the materials, which of the two primary Naïve Bayes models is generally more effective for text classification tasks?
Explain the purpose of using log probabilities during the testing stage of a Naïve Bayes classifier.
How does Maximum Likelihood Estimation (MLE) determine the parameters for the Multivariate Bernoulli model?
In the Multinomial Naïve Bayes model, what is the concept of a “mega-document” and how is it used during training?
What is a primary strategy for handling out of vocabulary (OOV) or unknown words that appear in a test document but not in the training corpus?
Despite its simplifying assumption, why is the Naïve Bayes classifier considered robust and a good baseline model?
How can the Multinomial Naïve Bayes classifier be understood as a class-conditional unigram language model?

Answer Key

The core “naïve” assumption is that all input attributes are conditionally independent given the class. In text classification, this means that the appearance of one word in a document is assumed to be independent of the appearance of any other word, given the document’s topic or class.
The “bag of words” representation models a text document as a vector, where each dimension corresponds to a word in a dictionary. This vector can either contain the frequency of each word or a boolean value indicating its presence or absence. The model’s primary simplifying assumption is that word order is not important.
The two models are the Multivariate Bernoulli Naïve Bayes and the Multinomial Naïve Bayes. The Multivariate Bernoulli model uses binary features, representing each word as a boolean (true/false) value indicating whether it appears in the document. The Multinomial model uses integer features representing the frequency (count) of each word in the document.
The course materials state that the Multinomial model is “almost always more effective in text applications.” An experiment by M&N (1998) on classifying university web pages is cited as evidence supporting this conclusion.
Using log probabilities is a technique for underflow prevention. Multiplying many small probability values (which are between 0 and 1) can result in a number too small for standard floating-point representation. By summing the logs of the probabilities instead—since log(xy) = log(x) + log(y)—this numerical instability is avoided while preserving the ability to find the most probable class.

For the Multivariate Bernoulli model, Maximum Likelihood Estimation (MLE) is used to estimate the parameter P(word_i = true

class_j). This is calculated as the fraction of documents belonging to class j in which word i appears. It is the relative frequency of a binary event (the word’s presence).

In the Multinomial model, a “mega-document” for a specific class is created by conceptually concatenating all training documents belonging to that class. The frequency of a word w in this mega-document is then used to calculate the probability P(w

class) via MLE, which simplifies to the relative frequency of w across all words in all documents of that class.

A primary strategy is to train the model with an explicit symbol for an unknown word, such as . During preprocessing, words not in a pre-chosen vocabulary (or rare words) in the training corpus can be replaced with , allowing the model to learn a probability for it.
Naïve Bayes is considered a good baseline because it is very fast to train (one pass over the data) and test, has low storage requirements, and is robust to irrelevant features. For many text categorization tasks with numerous features, it performs well and was even a top performer in the KDD-CUP 97 competition.
The Multinomial model can be seen as a class-conditional unigram language model because it calculates the probability of a document by multiplying the probabilities of its individual words, given a class. This is equivalent to a unigram model (where each word’s probability is independent of others) where a separate set of word probabilities (a separate language model) is learned for each class.

Essay Questions

Instructions: Prepare a detailed, essay-format response for each of the following prompts.

Provide a comprehensive comparison of the Multivariate Bernoulli and Multinomial Naïve Bayes models for text classification. Discuss their underlying assumptions, how text documents are represented as features, the process for parameter estimation (MLE) in each, and the practical reasons one is often preferred over the other.
Explain the complete workflow for training and testing a Multinomial Naïve Bayes text classifier. Begin with raw text documents and a fixed set of classes, and detail the steps of text representation (including dictionary creation and the bag-of-words model), parameter estimation, and the final classification decision process for a new document.
Discuss the statement: “Naive Bayes is Not So Naive.” Elaborate on the strengths of the Naïve Bayes classifier that make it a powerful and dependable baseline for text classification, referencing its performance in competitions like KDD-CUP 97, its robustness to certain types of features, and its computational efficiency.

Describe the mathematical foundation of the Naïve Bayes classifier, starting from Bayes’ rule (argmax P(C

X)). Explain how the “naïve” assumption of conditional independence simplifies the P(X

C) term and makes computation tractable, especially for high-dimensional data like text.

What is a generative model in the context of classification? Explain how both the Multinomial and Multivariate Bernoulli Naïve Bayes classifiers can be viewed as generative models that approximate how a text document is produced, given a class label.

Glossary of Key Terms

Term	Definition
Bag of Words	A representation of a text document that models it as a high-dimensional vector. This vector can either represent the frequency of each word from a dictionary appearing in the document or a boolean value indicating the presence or absence of each word. This model simplifies text by assuming word order is not important.
Conditional Independence Assumption	The core “naïve” assumption of the Naïve Bayes classifier. It posits that, given the class variable, all features (e.g., words in a document) are independent of one another.
Generative Probabilistic Model	A type of statistical model that describes how a dataset is generated. In classification, a generative model learns the joint probability distribution P(X, C) or the class-conditional probability P(X\|C) and the class prior P(C), effectively modeling how to generate data X for each class C.
Maximum Likelihood Estimation (MLE)	A method for estimating the parameters of a probability distribution by maximizing a likelihood function. For Bernoulli distributions, this results in the relative frequency of an event. For Multinomial distributions, it corresponds to the relative frequency of each category’s occurrence.
Multinomial Naïve Bayes	A Naïve Bayes classification model that uses a multinomial distribution for its features. It is well-suited for text classification where features are word counts or frequencies. It is generally considered more effective for text tasks than the Bernoulli model.
Multivariate Bernoulli Naïve Bayes	A Naïve Bayes classification model appropriate for binary feature variables. In text classification, each feature represents a word from the dictionary, and its value is true if the word appears in the document and false otherwise, ignoring frequency.
Out of Vocabulary (OOV)	Words that appear in test data but were not present in the training data used to build the model’s dictionary. These are often handled by mapping them to a special (unknown) token.
Parameter Estimation	The process of using training data to calculate the probability values (parameters) needed by the model. For Naïve Bayes, this involves calculating class priors P(C) and conditional probabilities P(word \| C).
Stochastic Language Model	A probabilistic model of a sequence of words. A unigram language model, for instance, models the probability of a string by multiplying the independent probabilities of each word in it. Multinomial Naïve Bayes acts as a class-conditional unigram language model.
Underflow Prevention	A computational technique used to avoid numerical errors when multiplying many small probabilities. This is typically achieved by converting probabilities to their logarithms and summing them instead of multiplying the original values.

2025 Fall UVA CS - Machine Learning