ETC3250/5250 Tutorial 7

Neural networks

Author

Prof. Di Cook

Published

15 April 2024

Load the libraries and avoid conflicts

# Load libraries used everywhere
library(tidyverse)
library(tidymodels)
library(patchwork)
library(mulgar)
library(GGally)
library(tourr)
library(geozoo)
library(keras)
library(uwot)
library(colorspace)
library(ggthemes)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)

🎯 Objectives

The goal for this week is learn to fit, diagnose, and predict from a neural network model.

🔧 Preparation

Make sure you have all the necessary libraries installed. There are a few new ones this week!

Exercises:

Open your project for this unit called iml.Rproj. We will be working through the tutorial at TensorFlow for R for fitting and predicting the fashion MNIST image data.

1. Get the data

We use the Fashion MNIST dataset which contains 70,000 grayscale images in 10 categories of articles sold on Zalando’s multi-brand, digital platform for fashion, beauty, and lifestyle.

# download the data
fashion_mnist <- dataset_fashion_mnist()

# split into input variables and response
c(train_images, train_labels) %<-% fashion_mnist$train
c(test_images, test_labels) %<-% fashion_mnist$test

# for interpretation we also define the category names
class_names = c('T-shirt/top',
                'Trouser',
                'Pullover',
                'Dress',
                'Coat',
                'Sandal',
                'Shirt',
                'Sneaker',
                'Bag',
                'Ankle boot')

2. What’s in the data?

Check how many observations are in the training and test sets, and plot some of the images.

dim(train_images)
dim(train_labels)
dim(test_images)
dim(test_labels)

# Choose an image randomly
img <- as.data.frame(train_images[sample(1:60000, 1), , ])
colnames(img) <- seq_len(ncol(img))
img$y <- seq_len(nrow(img))
img <- img |>
  pivot_longer(cols = -y,
               names_to="x", 
               values_to="value") |>
  mutate(x = as.integer(x))

ggplot(img, aes(x = x, y = y, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "white", 
                      high = "black", 
                      na.value = NA) +
  scale_y_reverse() +
  theme_map() +
  theme(legend.position = "none")

Solution

[1] 60000    28    28

[1] 60000

[1] 10000    28    28

[1] 10000

3. Pre-process the data

It may not be necessary, says Patrick, but we’ll scale the data to 0-1, before modeling.

train_images <- train_images / 255
test_images <- test_images / 255

4. Set up the model

The model architecture will have:

a flatten layer to turn the images into vectors
one hidden layer with 128 nodes with (rectified) linear activation
final layer with 10 nodes and logistic activation

Why 10 nodes in the last layer? Why 128 nodes in the hidden layer?

model_fashion_mnist <- keras_model_sequential()
model_fashion_mnist |>
  # flatten the image data into a long vector
  layer_flatten(input_shape = c(28, 28)) |>
  # hidden layer with 128 units
  layer_dense(units = 128, activation = 'relu') |>
  # output layer for 10 categories
  layer_dense(units = 10, activation = 'softmax')

Set the optimizer to be adam, loss function to be sparse_categorical_crossentropy and accuracy as the metric. What other optimizers could be used? What is the sparse_catgorical_crossentropy?

model_fashion_mnist |> compile(
  optimizer = 'adam',
  loss = 'sparse_categorical_crossentropy',
  metrics = c('accuracy')
)

Solution

There are 10 classes, so we need 10 nodes in the final layer.

The choice of 128 nodes in the hidden layer is arbitrary. It means that we are reducing the dimension down from 784 to 128 at this point.

Sparse categorical cross-entropy is an extension of the categorical cross-entropy loss function that is used when the output labels are represented in a sparse matrix format. The labels are represented as a single index value rather than a binary matrix.

https://keras.io/api/optimizers/ has a list of optimizers available.

5. Fit the model

model_fashion_mnist |> fit(train_images,
              train_labels,
              epochs = 5,
              verbose = 0)

6. Evaluate the model

fmnist_score <- model_fashion_mnist |> 
  evaluate(test_images, test_labels, verbose = 0)

fmnist_score

Check with other people in class. Do you get the same result? If not, why would this be?

Solution

     loss  accuracy 
0.5517961 0.8203000

Each person has started the optimizer with a different random seed, since we didn’t set one. You could try to set the seed using tensorflow::set_random_seed(), and have your neighbour do the same, to check if you get the same result. You will need to clean your environment before attempting this because if you fit the model again it will update the current one rather than starting afresh.

7. Predict the test set

Which classes are most often confused?

test_tags <- factor(class_names[test_labels + 1],
                    levels = class_names)

fashion_test_pred <- predict(model_fashion_mnist,
                             test_images, verbose = 0)
fashion_test_pred_cat <- levels(test_tags)[
  apply(fashion_test_pred, 1,
        which.max)]
predicted <- factor(
  fashion_test_pred_cat,
  levels=levels(test_tags)) |>
  as.numeric() - 1
observed <- as.numeric(test_tags) -1
table(observed, predicted)

Solution

        predicted
observed   0   1   2   3   4   5   6   7   8   9
       0 853   1  31  66   2   1  30   0  16   0
       1   5 942   8  40   3   0   0   0   2   0
       2  27   5 803  20 118   1  15   0  11   0
       3  33   3  21 914  12   1  13   0   3   0
       4   0   2 158  78 730   1  26   0   5   0
       5   0   0   0   1   0 909   0  81   3   6
       6 181   2 197  76 189   1 323   0  31   0
       7   0   0   0   0   0  18   0 977   0   5
       8   7   2   8   9   3   2   6   7 956   0
       9   1   0   0   0   0  47   0 155   1 796

There are several classes that have some confusion with other classes, particularly 6 with 0, 2, 4. But other classes are most often confused with at least one other. Classes 1, 5, 7, 8, 9 are rarely confused.

8. Compute metrics

Compute the accuracy of the model on the test set. How does this compare with the accuracy reported when you fitted the model?

If the model equally accurate on all classes? If not, which class(es) is(are) poorly fitted?

fashion_test_pred <- fashion_test_pred |>
  cbind(observed, predicted)

fashion_test_pred <- fashion_test_pred |>
  as.tibble() |>
  mutate(label = test_tags,
         plabel = factor(class_names[predicted+1], 
                         levels = levels(test_tags)))

accuracy(fashion_test_pred, label, plabel)
bal_accuracy(fashion_test_pred, label, plabel)
fashion_test_pred |>
  count(label, plabel) |>
  group_by(label) |>
  mutate(Accuracy = ifelse(sum(n)>0, n[plabel==label]/sum(n), 0)) |>
  pivot_wider(names_from = "plabel", 
              values_from = n, 
              values_fill = 0) |>
  select(label, Accuracy)

Solution

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.820

# A tibble: 1 × 3
  .metric      .estimator .estimate
  <chr>        <chr>          <dbl>
1 bal_accuracy macro          0.900

# A tibble: 10 × 2
# Groups:   label [10]
   label       Accuracy
   <fct>          <dbl>
 1 T-shirt/top    0.853
 2 Trouser        0.942
 3 Pullover       0.803
 4 Dress          0.914
 5 Coat           0.73 
 6 Sandal         0.909
 7 Shirt          0.323
 8 Sneaker        0.977
 9 Bag            0.956
10 Ankle boot     0.796

9. Investigating the results

This section is motivated by the examples in Cook and Laa (2024). Focus on the test data to investigate the fit, and lack of fit.

PCA can be used to reduce the dimension down from 784, to a small number of PCS, to examine the nature of differences between the classes. Compute the scree plot to decide on a reasonable number that can be examined in a tour. Plot the first two statically. Explain how the class structure matches any clustering.

test_images_flat <- test_images
dim(test_images_flat) <- c(nrow(test_images_flat), 784)
images_pca <- prcomp(as.data.frame(test_images_flat))
images_pc <- as.data.frame(images_pca$x)
ggscree(images_pca, q=20, guide=FALSE)
ggplot(images_pc,
       aes(PC1, PC2, color = test_tags)) +
  geom_point(size = 0.1) +
  scale_color_discrete_qualitative(palette = "Dynamic") +
  theme(legend.title = element_blank())

animate_xy(images_pc[,1:5], col = test_tags,
        cex=0.2, palette = "Dynamic")

Solution

There isn’t much separation between classes in the PCs. There is some difference between classes, with overlap between them. It looks less separated than what the confusion matrix would suggest.

UMAP can also be used to understand the class structure. Make a 2D UMAP representation and explain how the class structure matches cluster structure.

set.seed(253)
fashion_umap <- umap(test_images_flat, init = "spca")
fashion_umap_df <- fashion_umap |>
  as_tibble() |>
  rename(UMAP1 = V1, UMAP2 = V2) |>
  mutate(label = test_tags)
ggplot(fashion_umap_df, aes(x = UMAP1, 
                            y = UMAP2,
                            colour = label)) +
  geom_point(size = 0.3, alpha=0.5) +
  scale_color_discrete_qualitative(palette = "Dynamic") +
  theme(legend.title = element_blank())

Solution

There are multiple well-separated clusters in the representation. Mostly these are mixtures of several classes. Only one cluster mostly matches an article, Trouser.

Interestingly, the nodes in the hidden layer can be thought of as 128 new variables which are linear combinations of the original 784 variables. This is too many to visualise but we can again use PCA to reduce their dimension again, and make plots.

activations_model_fashion <- keras_model(
  inputs = model_fashion_mnist$input,
  outputs = model_fashion_mnist$layers[[2]]$output
)
activations_fashion <- predict(
  activations_model_fashion,
  test_images, verbose = 0)

# PCA for activations
activations_pca <- prcomp(activations_fashion)
activations_pc <- as.data.frame(activations_pca$x)

ggscree(activations_pca, q=20, guide=FALSE)

ggplot(activations_pc,
       aes(PC1, PC2, color = test_tags)) +
  geom_point(size = 0.1) +
  ggtitle("Activations") +
  scale_color_discrete_qualitative(palette = "Dynamic")

animate_xy(activations_pc[,1:5], col = test_tags,
        cex=0.2, palette = "Dynamic")

Solution

There substantial separation between classes in the PCs of these new variables. It looks now reasonable that the classes are distinguishable as the confusion matrix suggests.

Similarly, we can general a 2D representation using UMAP of these new variables.

set.seed(253)
fashion_umap <- umap(activations_fashion, init = "spca")
fashion_umap_df <- fashion_umap |>
  as_tibble() |>
  rename(UMAP1 = V1, UMAP2 = V2) |>
  mutate(label = test_tags)
ggplot(fashion_umap_df, aes(x = UMAP1, 
                            y = UMAP2,
                            colour = label)) +
  geom_point(size = 0.5, alpha=0.5) +
  scale_color_discrete_qualitative(palette = "Dynamic")

Solution

There is a lot of clustering in this view, but it mostly doesn’t match the classes. Trouser is the only class that appears to be primarily in one cluster.

Last task is to explain on what was learned from the confusion matrix to examine the uncertainty in predictions from the predictive probabilities. Because there are 10 classes, these will fall in a 9D simplex. Each vertex is the spot where the model is completely certain about the prediction. Points along an edge indicate confusion only between two classes. Points on a triangular face indicate confusion between three classes. The code below will create the visualisation of the predictive probabilities, focusing on four of the 10 classes to make it a little simpler to digest.

# Generate the projection to 9D
proj <- t(geozoo::f_helmert(10)[-1,])
f_nn_v_p <- as.matrix(fashion_test_pred[,1:10]) %*% proj
colnames(f_nn_v_p) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9")

f_nn_v_p <- f_nn_v_p %>%
  as.data.frame() %>%
  mutate(class = test_tags)

simp <- geozoo::simplex(p=9)
sp <- data.frame(simp$points)
colnames(sp) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9")
sp$class = ""
f_nn_v_p_s <- bind_rows(sp, f_nn_v_p) %>%
  mutate(class = ifelse(class %in% c("T-shirt/top",
                                     "Pullover",
                                     "Shirt",
                                     "Coat"), class, "Other")) %>%
  mutate(class = factor(class, levels=c("T-shirt/top",
                                        "Pullover",
                                        "Shirt",
                                        "Coat",
                                        "Other")))

animate_xy(f_nn_v_p_s[,1:9], col = f_nn_v_p_s$class, 
           axes = "off", cex=0.2,
           edges = as.matrix(simp$edges),
           edges.width = 0.05,
           palette = "Viridis")

10. Ways to improve the model

There are many ways to improve neural networks. If you have time, read through the approaches taken in the HOML book. It used the digits data, but the approaches will be suitable to apply to the fashion data. Try:

increasing the number of epochs (don’t think this helps here)
try adding batch processing using batch size
use a validation set split
try a different number of nodes in the hidden layer
expand the number of layers
add batch normalisation
use regularisation at each layer
add dropout for each layer
experiment with the learning rate

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.