Instructions

Marks

Exercises

1.

Here we explore the maximal margin classifier on a toy data set.

id x1 x2 class
1 2 4 -1
2 4 2 -1
3 1 5 -1
4 4 4 -1
5 3 3 -1
6 4 3 -1
7 2 5 -1
8 5 3 -1
9 3 1 1
10 1 3 1
11 1 1 1
12 2 1 1
13 0 2 1
14 1 2 1
15 0 3 1
  1. We are given \(n = 15\) observations in \(p = 2\) dimensions. For each observation, there is an associated class label. Sketch the observations.
  2. Sketch the optimal separating hyperplane, and provide the equation for this hyperplane in the form of textbook equation 9.1.
  3. Write the classification rule for the maximal margin classifier.
  4. On your sketch, indicate the margin for the maximal margin hyperplane.
  5. Indicate the support vectors for the maximal margin classifier.
  6. Argue that a slight movement of observation 4 would not affect the maximal margin hyperplane.
  7. Sketch a separating hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.
  8. How would the separating hyperplane change if an 8th observation (16, 2, 2.5, 1) as added to the data?
  9. Using the svm function of the e1071 package fit the linear svm model to this data. Compare the result with your hand calculation.

2.

Use the Caravan data from the ISLR package. Read the data description.

  1. Compute the proportion of caravans purchased relative to not purchased. Is this a balanced class data set? What problem might be encountered in assessing the accuracy of the model as a consequence?

  2. Convert the response variable from a factor to an integer variable, where 1 indicates that the person purchased a caravan.

  3. Break the data into 2/3 training and test set, ensuring that the same ratio of the response variable is achieved in both sets. Check that your sampling has produced this.

  4. The solution code on the unofficial solution web site:

library(ISLR)
train = 1:1000
Caravan$Purchase = ifelse(Caravan$Purchase == "Yes", 1, 0)
Caravan.train = Caravan[train, ]
Caravan.test = Caravan[-train, ]

would use just the first 1000 cases for the training set. What is wrong about doing this?

3.

Here we will fit a boosted tree model, using the gbm package.

  1. Use 1000 trees, and a shrinkage value of 0.01.

  2. Make a plot of the oob improvement against iteration number. What does this suggest about the number of iterations needed? Why do you think the oob improvement value varies so much, and can also be negative?

  3. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.

  4. What are the 6 most important variables? Make a plot of each to examine the relationship between these variables and the response. Explain what you learn from these plots.

4.

Here we will fit a random forest model, using the randomForest package.

  1. Use 1000 trees, using a numeric response so that predictions will be a number between 0-1, and set importance=TRUE. (Ignore the warning about not having enough distinct values to use regression.)

  2. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.

  3. What are the 6 most important variables? Make a plot of any that are different from those chosen by gbm. How does the set of variables compare with those chosen by gbm.

5.

Here we will fit a gradient boosted model, using the xgboost package.

  1. Read the description of the XGBoost technique at https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/, or other sources. Explain how this algorithm might differ from earlier boosted tree algorithms.

  2. Tune the model fit to determine how many iterations to make. Then fit the model, using the parameter set provided.

  3. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.

  4. Compute the variable importance. What are the 6 most important variables? Make a plot of any that are different from those chosen by gbm or randomForest. How does the set of variables compare with the other two methods.

6.

Compare and summarise the results of the three model fits.

7.

Now scramble the response variable (Purchase) using permutation. The resulting data has no true relationship between the response and predictors. Re-do Q5 with this data set. Write a paragraph explaining what you learn about the true data from analysing this permuted data.