Instructions

Marks

Exercises

  1. About the data: The chocolates data was compiled by students in a previous class of Prof Cook, by collecting nutrition information on the chocolates as listed on their internet sites. All numbers were normalised to be equivalent to a 100g serving. Units of measurement are listed in the variable name.

    1. Use the tour, with type of chocolate mapped to colour, and write a paragraph on whether the two types of chocolate differ on the nutritional variables.

    2. Make a parallel coordinate plot of the chocolates, coloured by type, with the variables sorted by how well they separate the groups. Maybe the “uniminmax” scaling might work best for this data. Write a paragraph explaining how the types of chocolates differ in nutritional characteristics.

    3. Identify one dark chocolate that is masquerading as dark, that is, nutritionally looks more like a milk chocolate. Explain your answer.

    4. Fit a linear discriminant analysis model, using equal prior probability for each group.

    5. Write down the LDA rule. Make it clear which type of chocolate is class 1 and class 2 relative to the formula in the notes.

  2. This question is about decision trees. Here is a sample data set to work with:

    1. Write down the formulae for the impurity metric, Gini, for a two group problem. Show that the Gini function has its highest value at 0.5. Explain why a value of 0.5 leads to the worst possible split.
    2. Write an R function to compute the impurity measure. The input should be data frame containing a vector of numeric values, and a vector of the associated classes.
    3. Use your function to compute the Gini impurity measure for every possible split of the sample data.
    4. Make a plot of your splits and the impurity measure. What partition of the data would yield the best split?
    5. Fit a classification tree to the chocolates data. Print the tree model.
    6. Compute Gini impurity measure for all possible splits on the Fiber variable in the chocolates data. Plot this against the splits. Explain where the best split is.
    7. Compute Gini impurity measure for all possible splits on all of the other nutrition variables. Plot all of these values against the split, all 10 plots. Are there other possible candidates for splitting, that are almost as good as the one chosen by the tree? Explain yourself.
  3. For each of the simulated data sets provided, using the tour, parallel coordinate plot, scatterplot matrix or any other technique you like, determine the main structure in the data: how many groups there are, whether there are any outliers, overall shape. Write a paragraph on what you find in the data and your approach.