# Load libraries used everywherelibrary(tidyverse)library(tidymodels)library(patchwork)library(mulgar)library(GGally)library(tourr)library(geozoo)library(keras)library(uwot)library(colorspace)library(ggthemes)library(conflicted)conflicts_prefer(dplyr::filter)conflicts_prefer(dplyr::select)conflicts_prefer(dplyr::slice)
🎯 Objectives
The goal for this week is learn to fit, diagnose, and predict from a neural network model.
🔧 Preparation
Make sure you have all the necessary libraries installed. There are a few new ones this week!
Exercises:
Open your project for this unit called iml.Rproj. We will be working through the tutorial at TensorFlow for R for fitting and predicting the fashion MNIST image data.
1. Get the data
We use the Fashion MNIST dataset which contains 70,000 grayscale images in 10 categories of articles sold on Zalando’s multi-brand, digital platform for fashion, beauty, and lifestyle.
# download the datafashion_mnist <-dataset_fashion_mnist()# split into input variables and responsec(train_images, train_labels) %<-% fashion_mnist$trainc(test_images, test_labels) %<-% fashion_mnist$test# for interpretation we also define the category namesclass_names =c('T-shirt/top','Trouser','Pullover','Dress','Coat','Sandal','Shirt','Sneaker','Bag','Ankle boot')
2. What’s in the data?
Check how many observations are in the training and test sets, and plot some of the images.
dim(train_images)dim(train_labels)dim(test_images)dim(test_labels)# Choose an image randomlyimg <-as.data.frame(train_images[sample(1:60000, 1), , ])colnames(img) <-seq_len(ncol(img))img$y <-seq_len(nrow(img))img <- img |>pivot_longer(cols =-y,names_to="x", values_to="value") |>mutate(x =as.integer(x))ggplot(img, aes(x = x, y = y, fill = value)) +geom_tile() +scale_fill_gradient(low ="white", high ="black", na.value =NA) +scale_y_reverse() +theme_map() +theme(legend.position ="none")
Solution
[1] 60000 28 28
[1] 60000
[1] 10000 28 28
[1] 10000
3. Pre-process the data
It may not be necessary, says Patrick, but we’ll scale the data to 0-1, before modeling.
one hidden layer with 128 nodes with (rectified) linear activation
final layer with 10 nodes and logistic activation
Why 10 nodes in the last layer? Why 128 nodes in the hidden layer?
model_fashion_mnist <-keras_model_sequential()model_fashion_mnist |># flatten the image data into a long vectorlayer_flatten(input_shape =c(28, 28)) |># hidden layer with 128 unitslayer_dense(units =128, activation ='relu') |># output layer for 10 categorieslayer_dense(units =10, activation ='softmax')
Set the optimizer to be adam, loss function to be sparse_categorical_crossentropy and accuracy as the metric. What other optimizers could be used? What is the sparse_catgorical_crossentropy?
There are 10 classes, so we need 10 nodes in the final layer.
The choice of 128 nodes in the hidden layer is arbitrary. It means that we are reducing the dimension down from 784 to 128 at this point.
Sparse categorical cross-entropy is an extension of the categorical cross-entropy loss function that is used when the output labels are represented in a sparse matrix format. The labels are represented as a single index value rather than a binary matrix.
https://keras.io/api/optimizers/ has a list of optimizers available.
Check with other people in class. Do you get the same result? If not, why would this be?
Solution
loss accuracy
0.5517961 0.8203000
Each person has started the optimizer with a different random seed, since we didn’t set one. You could try to set the seed using tensorflow::set_random_seed(), and have your neighbour do the same, to check if you get the same result. You will need to clean your environment before attempting this because if you fit the model again it will update the current one rather than starting afresh.
There are several classes that have some confusion with other classes, particularly 6 with 0, 2, 4. But other classes are most often confused with at least one other. Classes 1, 5, 7, 8, 9 are rarely confused.
8. Compute metrics
Compute the accuracy of the model on the test set. How does this compare with the accuracy reported when you fitted the model?
If the model equally accurate on all classes? If not, which class(es) is(are) poorly fitted?
This section is motivated by the examples in Cook and Laa (2024). Focus on the test data to investigate the fit, and lack of fit.
PCA can be used to reduce the dimension down from 784, to a small number of PCS, to examine the nature of differences between the classes. Compute the scree plot to decide on a reasonable number that can be examined in a tour. Plot the first two statically. Explain how the class structure matches any clustering.
animate_xy(images_pc[,1:5], col = test_tags,cex=0.2, palette ="Dynamic")
Solution
There isn’t much separation between classes in the PCs. There is some difference between classes, with overlap between them. It looks less separated than what the confusion matrix would suggest.
UMAP can also be used to understand the class structure. Make a 2D UMAP representation and explain how the class structure matches cluster structure.
There are multiple well-separated clusters in the representation. Mostly these are mixtures of several classes. Only one cluster mostly matches an article, Trouser.
Interestingly, the nodes in the hidden layer can be thought of as 128 new variables which are linear combinations of the original 784 variables. This is too many to visualise but we can again use PCA to reduce their dimension again, and make plots.
animate_xy(activations_pc[,1:5], col = test_tags,cex=0.2, palette ="Dynamic")
Solution
There substantial separation between classes in the PCs of these new variables. It looks now reasonable that the classes are distinguishable as the confusion matrix suggests.
Similarly, we can general a 2D representation using UMAP of these new variables.
There is a lot of clustering in this view, but it mostly doesn’t match the classes. Trouser is the only class that appears to be primarily in one cluster.
Last task is to explain on what was learned from the confusion matrix to examine the uncertainty in predictions from the predictive probabilities. Because there are 10 classes, these will fall in a 9D simplex. Each vertex is the spot where the model is completely certain about the prediction. Points along an edge indicate confusion only between two classes. Points on a triangular face indicate confusion between three classes. The code below will create the visualisation of the predictive probabilities, focusing on four of the 10 classes to make it a little simpler to digest.