# Objective

The purpose of this lab is to

• explore the process of fitting a categorical response.
• understand the the training/test split procedure for estimating your model error.
• practice plotting of data and models to help understand model fit, in particular boundaries between groups induced by a classification model.
• learn about dimension reduction, in the presence of a categorical response variable.

# Class discussion

Textbook question, chapter 4 Q8

# Do it yourself

1. Run the K-Nearest Neighbours classification example, in the textbook section 4.6.5. The code below, fits the model for $$k=1$$.
library(tidyverse)
library(ISLR)
library(class)
data(Smarket)
Smarket_tr <- Smarket %>%
dplyr::filter(Year < 2005) %>%
dplyr::select(Lag1, Lag2, Direction)
Smarket_ts <- Smarket %>%
dplyr::filter(Year >= 2005) %>%
dplyr::select(Lag1, Lag2, Direction)
knn.pred <- knn(Smarket_tr[,1:2], Smarket_ts[,1:2],
Smarket_tr[,3], k=1)
table(knn.pred, Smarket_ts[,3])
1. Compute the test error for $$k=1$$
2. Re-fit the model for $$k=3$$, and compute the test error. How does this compare with the smaller $$k$$?
3. Fit a range of values for $$k$$, and find the best value.
4. Would you put your money on this classification model, to invest in stock purchases?
1. Run the linear discriminant analysis for the chocolates data from the lecture notes (it starts at line 579 in the Rmd file), and compute the training and test error. What does each line of code do? Think about what the dashed line corresponds to. Find the names of the dark chocolates in the test set that are misclassified as Milk, and try to understand why they are predicted incorrectly.

# Practice

Details about the data for these two problems can be found at http://ggobi.org/book/chap-data.pdf.

1. This data is an oldy, but a goody, and contains physical measurements on three species of flea beetles. You can find it at http://www.ggobi.org/book/data/flea.csv.

Source: Lubischew, A. A. (1962), On the Use of Discriminant Functions in Taxonomy, Biometrics 18, 455–477.

Variable Explanation
species Ch. concinna, Ch. heptapotamica, and Ch. heikertingeri
tars1 width of the first joint of the first tarsus in microns
tars2 width of the second joint of the first tarsus in microns
head the maximal width of the head between the external edges of the eyes in 0.01 mm
aede1 the maximal width of the aedeagus in the fore-part in microns
aede2 the front angle of the aedeagus (1 unit = 7.5 degrees)
aede3 the aedeagus width from the side in microns

Where you see “???” in the code you need to replace it with the appropriate code to do the analysis.

1. Read in the data, and make a scatterplot matrix, with the points coloured by species. Write a few sentences explaining what you learn about the data, and which variables seem to be most promising for distinguishing the species.

2. Split the data into training and test sets. Fit an LDA model, and compute training and test error. Use equal prior probabilities.

3. Plot the data in the discriminant space.

4. Write a few sentences explaining the difference between the species in scatterplot matrix, and the 2D projection provided by the discriminant space.

5. Determine which variables are most important in separating the species, by computing the correlation between each variable, and the two variables defining the discriminant space.

1. This data consists of the percentage composition of fatty acids found in the lipid fraction of Italian olive oils. The data arises from a study to determine the authenticity of an olive oil. You can find it at http://www.ggobi.org/book/data/olive.csv.

Source: Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classi- fication of Olive Oils from their Fatty Acid Composition, in Martens, H. and Russwurm Jr., H., eds, Food Research and Data Analysis, Applied Science Publishers, London, pp. 189–214. It was brought to our attention by Glover & Hopke (1992).

Variable Explanation
region Three “super-classes” of Italy: North, South, and the island of Sardinia
area Nine collection areas: three from the region North (Umbria, East and West Liguria), four from South (North and South Apulia, Calabria, and Sicily), and two from the island of Sardinia (inland and coastal Sardinia).
palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic fatty acids, % × 100
1. Download the data, and select just the variables, region, eicosenoic and linoleic. Make a plot of eicosenoic vs linoleic, coloured by region. (You will need to set region to be a factor variable.)

2. Split the data into traing and test sets. Fit a linear discriminant classifier. Compute the training and test error.

3. Examine the boundaries between groups. Generate a grid of points between the minimum and maximum values for the two predictors. Predict the region at these locations. Make a plot of the this data, coloured by predicted region. Overlay the data, using different plotting symbols on the grid.

4. Write a few sentences on why, despite the big gap between region 1 and the other two regions, LDA misclassifies several of the region 1 observations.