16.1 Introduction
16.1.1 Supervised Learning
Objective: Prediction
- Access to a set of variables, \(x_1,x_2,\dots,x_p\), measured on \(n\) observations, and a response \(y\) also measured on those same \(n\) observations.
- Objective is to predict \(y\) using \(x_1,x_2,\dots,x_p\)
In many common situations there are a well-developed set of tools for supervised learning:
- Regression and classification
- logistic regression, trees, random forests, bagging, boosting
- Clear understanding of how to assess the quality of obtained results
- cross-validation, model fit, etc.
16.1.2 Unsupervised Learning
Objective: Description
- A set of statistical tools intended for when we have a set of variables, \(x_1,x_2,\dots,x_p\), measured on \(n\) observations.
- Objective is to uncover interesting patterns in the measurements of \(x_1,x_2,\dots,x_p\)
- Not interested in prediction because we do not have an associated response variable \(y\)
- Can we discover subgroups among the variables or among the observations?
- Is there an informative way to visualize the data?
Unsupervised learning is, in some ways, more challenging than supervised learning.
- Tends to be more subjective.
- No simple goal for the analysis, such as prediction of a response
- Often hard to assess the results obtained from unsupervised learning methods because there is no universally accepted mechanism for validating the results
- No way to check our work because we do not know the true answer—the problem is unsupervised
The importance of reliable unsupervised learning methods is growing in a number of fields:
- A cancer researcher might assay gene expression levels in patients with breast cancer to look for subgroups among the breast cancer samples in order to obtain a better understanding of the disease
- An online shopping site might try to identify groups of shoppers with similar browsing and purchase histories to target coupons, sales, etc.
- A search engine might choose what search results to display to a particular individual based on the click histories of other individuals with similar search patterns.