Feature Selection
- Lecture: S3-feaSelc
- Version: current
- Please to Read: S3-QuizReview + ELS Ch3.4 and Ch3.3 + API
- Recorded Videos: (Extra M2 + M3 )
Att: the following markdown text was generated from the corresponding powerpoint lecture file automatically. Errors and misformatting, therefore, do exist (a lot!)!
Study Guide: Feature Selection in Machine Learning
The Purpose and Benefits of Feature Selection
In machine learning, datasets can contain thousands or even millions of low-level features. Feature selection is the process of selecting a subset of p’ original features from a total of p features. The primary goal is to identify the most relevant features to build better learning models for tasks like classification and regression.
This process yields several significant benefits:
- Improved Generalization: Models with fewer, more relevant features are less sensitive to noise and tend to have lower variance. This aligns with Occam’s Razor (the law of parsimony).
- Computational Efficiency: Fewer features reduce training and inference cost and often require fewer labeled examples.
- Enhanced Interpretability: Simpler models are easier to understand and explain, important for trust and transparency.
Core Methodologies for Feature Selection
There are three primary approaches to feature selection, each differing in how they interact with the learning algorithm.
| Approach | Description | Key Characteristics |
|---|---|---|
| Filter | Ranks features or feature subsets independently of the predictor. | Learner-agnostic; fast; used as pre-processing. |
| Wrapper | Uses a predictor to assess the quality of features or feature subsets. | Learner-dependent; computationally expensive; simple to apply. |
| Embedded | Performs feature selection during model training. | Learner-specific; selection integrated into training. |
1. The Filter Approach
Filter methods select features as a pre-processing step, independently of the classifier. They include univariate and multivariate techniques.
Univariate Filtering
- Pearson Correlation: Measures linear correlation between a feature and the target (range [-1, 1]); detects only linear dependencies.
- T-test: Tests whether a feature distinguishes two classes (assumes normality and equal variance). Null hypothesis H0: class means are equal; T-statistic measures significance of mean difference.
Multivariate Filtering
Methods consider multiple variables jointly to capture interactions missed by univariate methods.
Core challenges:
- Scoring Function: Measure the quality of a subset.
- Search Strategy: Explore the 2^p possible subsets efficiently.
Finding the minimal optimal subset is NP-hard; good heuristics are essential.
2. The Wrapper Approach
Uses the predictive performance of a specific learner (treated as a black box) to score feature subsets.
Two key questions:
- (a) Assessment: How to measure performance?
- Split into train/validation/test. Train on train, score on validation, choose best subset; optionally use cross-validation; assess final model once on test.
- (b) Search: How to explore 2^p subsets?
- Heuristics:
- Forward Selection: Start empty; add features iteratively.
- Backward Elimination: Start full; remove features iteratively.
- Advanced: Beam search, GSFS, PTA(l,r), floating search.
- Heuristics:
Main criticism: high computational cost due to repeated training/evaluation.
3. The Embedded Approach
Integrates selection into training a single model; selection occurs implicitly.
- Key Characteristics: Learner-specific.
- Example: L1-regularization shrinks many coefficients to zero, selecting features.
Feature Selection vs. Feature Extraction
- Feature Selection: Selects a subset of original features
(X1, …, Xp → Xk1, …, Xkp’); features remain unchanged. - Feature Extraction / Dimensionality Reduction: Creates new features as (possibly weighted) combinations of originals
(X1, …, Xm → g1(X), …, gp’(X)).
Example: Principal Component Analysis (PCA) finds a linear mapping to a lower-dimensional space that maximally preserves variance; principal components are combinations of original features.
Model Selection and Assessment
Feature selection is tied to broader tasks of model selection and assessment.
- Model Selection: Estimate performance to choose the best model (e.g., hyperparameters or feature subset).
- Model Assessment: Estimate the chosen model’s prediction error on unseen data.
Data Handling Strategies
- Data-Rich Scenario: Split into train (build), validation (select), test (final assess).
- Insufficient Data: Use cross-validation or bootstrap; or analytical approximations (AIC/BIC) to mimic validation.
Review Quiz
Answer the following questions in 2-3 sentences each based on the provided material.
- What is the primary objective of feature selection in machine learning?
- According to the text, what are the three main benefits of applying feature selection?
- Explain the fundamental difference between the Filter and Wrapper approaches.
- What is a major limitation of univariate filtering methods like the Pearson Correlation?
- Why is the task of finding a minimal optimal feature subset described as NP-hard?
- In the Wrapper approach, what are the distinct roles of the training, validation, and test data sets?
- Describe the core ideas behind the Forward Selection and Backward Elimination search strategies.
- How does the Embedded approach to feature selection differ from both Filter and Wrapper methods?
- Distinguish between Feature Selection and Feature Extraction, using Principal Components Analysis (PCA) as an example.
- What is the purpose of model selection, and how does it differ from model assessment?
Answer Key
- Objective: Select a subset of the most relevant original features to build models that are better, faster, and more interpretable for classification or regression.
- Benefits: Improved generalization (lower variance, less noise sensitivity); computational efficiency (faster training/inference); enhanced interpretability (more explainable).
- Filter vs Wrapper: Filter ranks/selects features independently of the learner as pre-processing; Wrapper uses the performance of a specific learner to score feature subsets.
- Limitation of Univariate Filtering: Detects only linear, single-feature relationships; may miss features useful only in combination.
- NP-hardness: With p features, there are 2^p subsets; exhaustive search is infeasible, requiring heuristics.
- Data splits: Train to fit predictors per subset; Validation to select the best subset; Test used once for final unbiased performance estimate.
- Search strategies: Forward Selection adds features iteratively from empty; Backward Elimination removes features iteratively from full.
- Embedded: Selection is integrated into model training (learner-specific), unlike separate pre-processing (Filter) or external evaluation loops (Wrapper).
- Selection vs Extraction: Selection keeps original features; Extraction creates new features (e.g., PCA components) as combinations that preserve variance.
- Model Selection vs Assessment: Selection chooses the best model via validation; Assessment estimates final generalization on unseen data (test set).
Essay Questions
Formulate comprehensive responses for each:
- Discuss Occam’s Razor and its relationship to the three primary goals of feature selection.
- Compare univariate vs. multivariate Filter methods; give a scenario where multivariate succeeds but univariate fails.
- With millions of features and limited compute, which approach (Filter, Wrapper, Embedded) would you choose and why? Discuss trade-offs.
- Explain the inherent “search problem” in feature selection. Describe three heuristic strategies and when each is preferable.
- Detail the complete Wrapper process with emphasis on data splitting. Why separate validation and test sets to avoid bias?
Glossary of Key Terms
| Term | Definition |
|---|---|
| Backward Elimination | Heuristic that starts with all features and iteratively removes the least useful ones. |
| Cross-Validation | Efficient sample reuse to estimate learner performance when validation data is limited. |
| Dimensionality Reduction | Creating new features g(X) by combining originals; also called Feature Extraction. |
| Embedded Approach | Selection performed implicitly during model training; learner-specific. |
| Feature Extraction | Transforming X into new features g(X), often lower dimensional. |
| Feature Selection | Selecting a subset of p’ original features from p available features. |
| Filter Approach | Ranks features/subsets independent of the predictor as pre-processing. |
| Forward Selection | Heuristic that starts empty and iteratively adds the most useful features. |
| L1-Regularization | Embedded method that penalizes complexity and drives some coefficients to zero. |
| Model Assessment | Estimating prediction error of the final model on new data. |
| Model Selection | Estimating performance of candidate models to choose the best. |
| Multivariate Methods | Filter methods assessing multiple variables jointly. |
| NP-hard | Class of problems lacking known polynomial-time solutions; subset search is NP-hard. |
| Occam’s Razor | Prefer simpler explanations/models with fewer assumptions. |
| Pearson Correlation | Measures linear relationship between two variables (range [-1, 1]). |
| Principal Components Analysis (PCA) | Feature extraction mapping to a lower-dimensional space preserving variance. |
| T-test | Univariate test assessing whether a feature separates two normally distributed classes. |
| Univariate Methods | Filter methods assessing one variable at a time. |
| Wrapper Approach | Uses a learner as a black box to score feature subsets by predictive power. |