The purpose of this lab is to
Textbook question, chapter 4 Q8
library(tidyverse) library(ISLR) library(class) data(Smarket) Smarket_tr <- Smarket %>% dplyr::filter(Year < 2005) %>% dplyr::select(Lag1, Lag2, Direction) Smarket_ts <- Smarket %>% dplyr::filter(Year >= 2005) %>% dplyr::select(Lag1, Lag2, Direction) knn.pred <- knn(Smarket_tr[,1:2], Smarket_ts[,1:2], Smarket_tr[,3], k=1) table(knn.pred, Smarket_ts[,3])
Details about the data for these two problems can be found at http://ggobi.org/book/chap-data.pdf.
Source: Lubischew, A. A. (1962), On the Use of Discriminant Functions in Taxonomy, Biometrics 18, 455–477.
|species||Ch. concinna, Ch. heptapotamica, and Ch. heikertingeri|
|tars1||width of the first joint of the first tarsus in microns|
|tars2||width of the second joint of the first tarsus in microns|
|head||the maximal width of the head between the external edges of the eyes in 0.01 mm|
|aede1||the maximal width of the aedeagus in the fore-part in microns|
|aede2||the front angle of the aedeagus (1 unit = 7.5 degrees)|
|aede3||the aedeagus width from the side in microns|
Where you see “???” in the code you need to replace it with the appropriate code to do the analysis.
Read in the data, and make a scatterplot matrix, with the points coloured by species. Write a few sentences explaining what you learn about the data, and which variables seem to be most promising for distinguishing the species.
Split the data into training and test sets. Fit an LDA model, and compute training and test error. Use equal prior probabilities.
Plot the data in the discriminant space.
Write a few sentences explaining the difference between the species in scatterplot matrix, and the 2D projection provided by the discriminant space.
Determine which variables are most important in separating the species, by computing the correlation between each variable, and the two variables defining the discriminant space.
Source: Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classi- fication of Olive Oils from their Fatty Acid Composition, in Martens, H. and Russwurm Jr., H., eds, Food Research and Data Analysis, Applied Science Publishers, London, pp. 189–214. It was brought to our attention by Glover & Hopke (1992).
|region||Three “super-classes” of Italy: North, South, and the island of Sardinia|
|area||Nine collection areas: three from the region North (Umbria, East and West Liguria), four from South (North and South Apulia, Calabria, and Sicily), and two from the island of Sardinia (inland and coastal Sardinia).|
|palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic||fatty acids, % × 100|
Download the data, and select just the variables, region, eicosenoic and linoleic. Make a plot of eicosenoic vs linoleic, coloured by region. (You will need to set region to be a factor variable.)
Split the data into traing and test sets. Fit a linear discriminant classifier. Compute the training and test error.
Examine the boundaries between groups. Generate a grid of points between the minimum and maximum values for the two predictors. Predict the region at these locations. Make a plot of the this data, coloured by predicted region. Overlay the data, using different plotting symbols on the grid.
Write a few sentences on why, despite the big gap between region 1 and the other two regions, LDA misclassifies several of the region 1 observations.