---
title: "ETC3250/5250 Assignment 4"
date: "DUE: Friday, May 22 5pm"
output: html_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = FALSE,
eval = FALSE,
message = FALSE,
warning = FALSE)
```
## Instructions
- Assignment needs to be turned in as Rmarkdown, and as html, to moodle. That is, two files need to be submitted.
- You need to list your team members on the report. For each of the four assignments, one team member needs to be nominated as the leader, and is responsible for coordinating the efforts of other team members, and submitting the assignment.
- It is strongly recommended that you individually complete the assignment, and then compare your answers and explanations with your team mates. Each student will have the opportunity to report on other team member's efforts on the assignment, and if a member does not substantially contribute to the team submission they may get a reduced mark, or even a zero mark.
- R code should be hidden in the final report, unless it is specifically requested.
- Original work is expected. Any material used from external sources needs to be acknowledged. Cite thee R packages that you use for your work.
- To make it a little easier for you, a skeleton of R code is provided in the `Rmd` file. Where you see `???` means that something is missing and you will need to fill it in with the appropriate function, argument or operator. You will also need to rearrange the code as necessary to do the calculations needed.
## Marks
- Total mark will be out or 25
- 3 points will be reserved for readability, and appropriate citing of external sources
- 2 points will be reserved for reproducibility, that the report can be re-generated from the submitted Rmarkdown.
- Accuracy and completeness of answers, and clarity of explanations will be the basis for the remaining 20 points.
## Exercises
### 1.
Here we explore the maximal margin classifier on a toy data set.
```{r eval=TRUE}
library(tidyverse)
library(knitr)
library(kableExtra)
df <- tibble(id=1:15, x1=c(2, 4, 1, 4, 3, 4, 2, 5, 3, 1, 1, 2, 0, 1, 0),
x2=c(4, 2, 5, 4, 3, 3, 5, 3, 1, 3, 1, 1, 2, 2, 3),
class=c(rep(-1, 8), rep(1, 7)))
kable(df) %>% kable_styling()
```
a. We are given $n = 15$ observations in $p = 2$ dimensions. For each observation, there is an associated class label. Sketch the observations.
b. Sketch the optimal separating hyperplane, and provide the equation for this hyperplane in the form of textbook equation 9.1.
c. Write the classification rule for the maximal margin classifier.
d. On your sketch, indicate the margin for the maximal margin hyperplane.
e. Indicate the support vectors for the maximal margin classifier.
f. Argue that a slight movement of observation 4 would not affect the maximal margin hyperplane.
g. Sketch a separating hyperplane that is *not* the optimal separating hyperplane, and provide the equation for this hyperplane.
h. How would the separating hyperplane change if an 8th observation (16, 2, 2.5, 1) as added to the data?
i. Using the `svm` function of the `e1071` package fit the linear svm model to this data. Compare the result with your hand calculation.
#### 2.
Use the `Caravan` data from the `ISLR` package. Read the data description.
a. Compute the proportion of caravans purchased relative to not purchased. Is this a balanced class data set? What problem might be encountered in assessing the accuracy of the model as a consequence?
```{r}
library(tidyverse)
library(ISLR)
data(Caravan)
library(xgboost)
library(caret)
Caravan %>% count(???)
```
b. Convert the response variable from a factor to an integer variable, where 1 indicates that the person purchased a caravan.
```{r}
mycaravan <- Caravan %>%
mutate(Purchase = as.integer(ifelse(Caravan$Purchase == "Yes", ???, ???)))
```
c. Break the data into 2/3 training and test set, ensuring that the same ratio of the response variable is achieved in both sets. Check that your sampling has produced this.
d. The solution code on the unofficial solution web site:
```
library(ISLR)
train = 1:1000
Caravan$Purchase = ifelse(Caravan$Purchase == "Yes", 1, 0)
Caravan.train = Caravan[train, ]
Caravan.test = Caravan[-train, ]
```
would use just the first 1000 cases for the training set. What is wrong about doing this?
### 3.
Here we will fit a boosted tree model, using the `gbm` package.
a. Use 1000 trees, and a shrinkage value of 0.01.
```{r}
c_boost = gbm(???, data = c_tr, n.trees = ???, shrinkage = ???,
distribution = "bernoulli")
head(summary(???, plotit=FALSE), 6)
```
b. Make a plot of the oob improvement against iteration number. What does this suggest about the number of iterations needed? Why do you think the oob improvement value varies so much, and can also be negative?
```{r}
c_boost_diag <- tibble(iter=1:1000, tr_err=???, oob_improve=???)
ggplot(c_boost_diag, aes(x=iter, y=oob_improve)) + geom_point() + geom_smooth()
```
c. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.
```{r}
boost.prob = predict(c_boost, ???, n.trees = 1000, type = "response")
boost.pred = ifelse(boost.prob > ???, 1, 0)
addmargins(table(c_ts$Purchase, ???))
```
d. What are the 6 most important variables? Make a plot of each to examine the relationship between these variables and the response. Explain what you learn from these plots.
### 4.
Here we will fit a random forest model, using the `randomForest` package.
a. Use 1000 trees, using a numeric response so that predictions will be a number between 0-1, and set `importance=TRUE`. (Ignore the warning about not having enough distinct values to use regression.)
```{r}
c_rf <- randomForest(???, data=c_tr, ntree=???, importance=???)
```
b. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.
```{r}
rf.prob <- predict(c_rf, newdata=???)
rf.pred <- ifelse(rf.prob > ???, 1, 0)
addmargins(table(c_ts$Purchase, ???))
```
c. What are the 6 most important variables? Make a plot of any that are different from those chosen by `gbm`. How does the set of variables compare with those chosen by `gbm`.
```{r}
as_tibble(c_rf$importance) %>%
bind_cols(var=rownames(c_rf$importance)) %>%
arrange(desc(???)) %>% print(n=6)
```
### 5.
Here we will fit a gradient boosted model, using the `xgboost` package.
a. Read the description of the XGBoost technique at https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/, or other sources. Explain how this algorithm might differ from earlier boosted tree algorithms.
b. Tune the model fit to determine how many iterations to make. Then fit the model, using the parameter set provided.
```{r}
c_tr_xg <- xgb.DMatrix(data = as.matrix(c_tr[,???]), label = c_tr[,???])
c_ts_xg <- xgb.DMatrix(data = as.matrix(c_ts[,???]), label = c_ts[,???])
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)
xgbcv <- xgb.cv(params = params, data = c_tr_xg, nrounds = 100, nfold = 5, showsd = T, stratified = T, maximize = F)
c_xgb <- xgb.train(params = ???, data = ???, nrounds = 10, watchlist = list(val=c_ts_xg,train=c_tr_xg), maximize = F , eval_metric = "error")
```
b. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.
```{r}
xgbpred <- predict(c_xgb, ???)
xgbpred <- ifelse (xgbpred > ???, 1, 0)
addmargins(table(c_ts$Purchase, ???))
```
c. Compute the variable importance. What are the 6 most important variables? Make a plot of any that are different from those chosen by `gbm` or `randomForest`. How does the set of variables compare with the other two methods.
```{r}
c_xgb_importance <- xgb.importance(feature_names = colnames(c_tr[,???]), model = ???)
head(c_xgb_importance, 6)
```
### 6.
Compare and summarise the results of the three model fits.
### 7.
Now scramble the response variable (Purchase) using permutation. The resulting data has no true relationship between the response and predictors. Re-do Q5 with this data set. Write a paragraph explaining what you learn about the true data from analysing this permuted data.