Feature Selection

Lecture: S3-feaSelc
Version: current
Required Read: S3-QuizReview + ELS Ch3.4 and Ch3.3 + API
Recorded Videos: (Extra M2 + M3 )

Att: the following markdown text was generated from the corresponding powerpoint lecture file automatically. Errors and misformatting, therefore, do exist (a lot!)!

Study Guide: Feature Selection in Machine Learning

The Purpose and Benefits of Feature Selection

In machine learning, datasets can contain thousands or even millions of low-level features. Feature selection is the process of selecting a subset of p’ original features from a total of p features. The primary goal is to identify the most relevant features to build better learning models for tasks like classification and regression.

This process yields several significant benefits:

Improved Generalization: Models with fewer, more relevant features are less sensitive to noise and tend to have lower variance. This aligns with Occam’s Razor (the law of parsimony).
Computational Efficiency: Fewer features reduce training and inference cost and often require fewer labeled examples.
Enhanced Interpretability: Simpler models are easier to understand and explain, important for trust and transparency.

Core Methodologies for Feature Selection

There are three primary approaches to feature selection, each differing in how they interact with the learning algorithm.

Approach	Description	Key Characteristics
Filter	Ranks features or feature subsets independently of the predictor.	Learner-agnostic; fast; used as pre-processing.
Wrapper	Uses a predictor to assess the quality of features or feature subsets.	Learner-dependent; computationally expensive; simple to apply.
Embedded	Performs feature selection during model training.	Learner-specific; selection integrated into training.

1. The Filter Approach

Filter methods select features as a pre-processing step, independently of the classifier. They include univariate and multivariate techniques.

Univariate Filtering

Pearson Correlation: Measures linear correlation between a feature and the target (range [-1, 1]); detects only linear dependencies.
T-test: Tests whether a feature distinguishes two classes (assumes normality and equal variance). Null hypothesis H0: class means are equal; T-statistic measures significance of mean difference.

Multivariate Filtering

Methods consider multiple variables jointly to capture interactions missed by univariate methods.

Core challenges:

Scoring Function: Measure the quality of a subset.
Search Strategy: Explore the 2^p possible subsets efficiently.

Finding the minimal optimal subset is NP-hard; good heuristics are essential.

2. The Wrapper Approach

Uses the predictive performance of a specific learner (treated as a black box) to score feature subsets.

Two key questions:

(a) Assessment: How to measure performance?
- Split into train/validation/test. Train on train, score on validation, choose best subset; optionally use cross-validation; assess final model once on test.
(b) Search: How to explore 2^p subsets?
- Heuristics:
  - Forward Selection: Start empty; add features iteratively.
  - Backward Elimination: Start full; remove features iteratively.
  - Advanced: Beam search, GSFS, PTA(l,r), floating search.

Main criticism: high computational cost due to repeated training/evaluation.

3. The Embedded Approach

Integrates selection into training a single model; selection occurs implicitly.

Key Characteristics: Learner-specific.
Example: L1-regularization shrinks many coefficients to zero, selecting features.