16.1 Introduction

16.1.1 Supervised Learning

Objective: Prediction

  • Access to a set of variables, \(x_1,x_2,\dots,x_p\), measured on \(n\) observations, and a response \(y\) also measured on those same \(n\) observations.
  • Objective is to predict \(y\) using \(x_1,x_2,\dots,x_p\)

In many common situations there are a well-developed set of tools for supervised learning:

  • Regression and classification
    • logistic regression, trees, random forests, bagging, boosting
  • Clear understanding of how to assess the quality of obtained results
    • cross-validation, model fit, etc.

16.1.2 Unsupervised Learning

Objective: Description

  • A set of statistical tools intended for when we have a set of variables, \(x_1,x_2,\dots,x_p\), measured on \(n\) observations.
  • Objective is to uncover interesting patterns in the measurements of \(x_1,x_2,\dots,x_p\)
    • Not interested in prediction because we do not have an associated response variable \(y\)
    • Can we discover subgroups among the variables or among the observations?
    • Is there an informative way to visualize the data?

Unsupervised learning is, in some ways, more challenging than supervised learning.

  • Tends to be more subjective.
  • No simple goal for the analysis, such as prediction of a response
  • Often hard to assess the results obtained from unsupervised learning methods because there is no universally accepted mechanism for validating the results
  • No way to check our work because we do not know the true answer—the problem is unsupervised

The importance of reliable unsupervised learning methods is growing in a number of fields:

  • A cancer researcher might assay gene expression levels in patients with breast cancer to look for subgroups among the breast cancer samples in order to obtain a better understanding of the disease
  • An online shopping site might try to identify groups of shoppers with similar browsing and purchase histories to target coupons, sales, etc.
  • A search engine might choose what search results to display to a particular individual based on the click histories of other individuals with similar search patterns.