ETC3250/5250 Tutorial 5

Logistic regression and discriminant analysis

Author

Prof. Di Cook

Published

25 March 2024

Load the libraries and avoid conflicts
# Load libraries used everywhere
library(tidyverse)
library(tidymodels)
library(patchwork)
library(mulgar)
library(palmerpenguins)
library(GGally)
library(tourr)
library(MASS)
library(discrim)
library(classifly)
library(detourr)
library(crosstalk)
library(plotly)
library(viridis)
library(colorspace)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(palmerpenguins::penguins)
conflicts_prefer(viridis::viridis_pal)

options(digits=2)
p_tidy <- penguins |>
  select(species, bill_length_mm:body_mass_g) |>
  rename(bl=bill_length_mm,
         bd=bill_depth_mm,
         fl=flipper_length_mm,
         bm=body_mass_g) |>
  filter(!is.na(bl)) |>
  arrange(species) |>
  na.omit()
p_tidy_std <- p_tidy |>
    mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))

🎯 Objectives

The goal for this week is learn to fit, diagnose, assess assumptions, and predict from logistic regression models, and linear discriminant analysis models.

🔧 Preparation

  • Make sure you have all the necessary libraries installed. There are a few new ones this week!

Exercises:

Open your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.

set.seed(1148)
p_split <- initial_split(p_tidy_std, 2/3, strata = species)
p_tr <- training(p_split)
p_ts <- testing(p_split)

1. LDA

This problem uses linear discriminant analysis on the penguins data.

  1. Is the assumption of equal variance-covariance reasonable to make for this data?
  1. Fit the LDA model to the training data, using this code
lda_spec <- discrim_linear() |>
  set_mode("classification") |>
  set_engine("MASS", prior = c(1/3, 1/3, 1/3))
lda_fit <- lda_spec |> 
  fit(species ~ ., data = p_tr)
  1. Compute the confusion matrices for training and test sets, and thus the error for the test set.
  1. Plot the training and test data in the discriminant space, using symbols to indicate which set. See if you can mark the misclassified cases, too.
  1. Re-do the plot of the discriminant space, to examine the boundary between groups. You’ll need to generate a set of random points in the domain of the data, predict their class, and projection into the discriminant space. The explore() in the classifly package can help you generate the box of random points.
  1. What happens to the boundary, if you change the prior probabilities? And why does this happen? Change the prior probabilities to be 1.999/3, 0.001/3, 1/3 for Adelie, Chinstrap, Gentoo, respectively. Re-do the plot of the boundaries in the discriminant space.

2. Logistic

  1. Fit a logistic discriminant model to the training set. You can use this code:
log_fit <- multinom_reg() |> 
  fit(species ~ ., 
      data = p_tr)
  1. Compute the confusion matrices for training and test sets, and thus the error for the test set. You can use this code to make the predictions.
p_tr_pred <- log_fit |> 
  augment(new_data = p_tr) |>
  rename(pspecies = .pred_class)
p_ts_pred <- log_fit |> 
  augment(new_data = p_ts) |>
  rename(pspecies = .pred_class)
  1. Check the boundaries produced by logistic regression, and how they differ from those of LDA. Using the 2D projection produced by the LDA rule (using equal priors) predict the your set of random points using the logistic model.

3. Interactively explore misclassifications

Here you are going to use interactive graphics to explore the misclassifications from the linear discriminant analysis. We’ll need to use detourr to accomplish this. The code below makes a scatterplot of the confusion matrix, where points corresponding to a class have been spread apart by jittering. This plot is linked to a tour plot. Try:

  1. Selecting penguins that have been misclassified, from the display of the confusion matrix. Observe where they are in the data space. Are they in an area where it is hard to distinguish the groups?
  2. Selecting neighbouring points in the tour, and examine where they are in the confusion matrix.
p_cl <- p_tidy_std |>
  mutate(pspecies = predict(lda_fit$fit, p_tidy_std)$class) |>
  dplyr::select(bl:bm, species, pspecies) |>
  mutate(sp_jit = jitter(as.numeric(species)),
         psp_jit = jitter(as.numeric(pspecies)))
p_cl_shared <- SharedData$new(p_cl)

detour_plot <- detour(p_cl_shared, tour_aes(
  projection = bl:bm,
  colour = species)) |>
    tour_path(grand_tour(2), 
                    max_bases=50, fps = 60) |>
       show_scatter(alpha = 0.9, axes = FALSE,
                    width = "100%", height = "450px")

conf_mat <- plot_ly(p_cl_shared, 
                    x = ~psp_jit,
                    y = ~sp_jit,
                    color = ~species,
                    colors = viridis_pal(option = "D")(3),
                    height = 450) |>
  highlight(on = "plotly_selected", 
              off = "plotly_doubleclick") %>%
    add_trace(type = "scatter", 
              mode = "markers")
  
bscols(
     detour_plot, conf_mat,
     widths = c(5, 6)
 )            

4. Exploring the math

Slide 23 of the lecture notes has the steps to go from Bayes rule to the discriminant functions. Explain what was done at each step to get to the next one.

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.