---
title: "ETC3250/5250 Tutorial 1 Solution"
subtitle: "Introduction to tidymodels"
author: "prepared by Professor Di Cook"
date: "Week 1"
output:
html_document:
after_body: tutorial-footer.html
css: tutorial.css
---
```{r, echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
error = FALSE,
collapse = TRUE,
comment = "#",
echo=TRUE,
fig.height = 4,
fig.width = 8,
out.width = "100%",
fig.align = "center",
cache = FALSE
)
library(emo)
```
## `r emo::ji("gear")` Exercise 1
The `nrc` data contains information collected on Statistics graduate programs in the USA. There are several ranking variables, and indicators of the departments' describing research, student and diversity, summarising many individual variables such as number of publications, student entrance scores and demographics. You can learn more about this data [here](https://en.wikipedia.org/wiki/United_States_National_Research_Council_rankings).
The goal here is to follow the tidy models approach to fit a model for rank against indicators of research.
### 1
Load the libraries to complete the exercises.
```{r}
# Load libraries
library(tidyverse)
library(tidymodels)
library(broom)
library(dotwhisker)
library(patchwork)
```
### 2
Read the data, simplify the names and select the relevant variables. You will want `rank = R.Rankings.5th.Percentile`, `research = Research.Activity.5th.Percentile`, `student = Student.Support.Outcomes.5th.Percentile` and `diversity = Diversity.5th.Percentile`.
```{r}
# Read the data
nrc <- read_csv("https://iml.numbat.space/data/nrc.csv")
# Simplify names of and select variables to use
nrc <- nrc %>%
mutate(rank = R.Rankings.5th.Percentile,
research = Research.Activity.5th.Percentile,
student = Student.Support.Outcomes.5th.Percentile,
diversity = Diversity.5th.Percentile) %>%
select(rank, research, student, diversity)
```
### 3
Make a plot of the observed response against predictors. What do you learn about the relationship between these variables?
```{r fig.height=3}
# Make some plots of data
a1 <- ggplot(nrc, aes(x=research, y=rank)) + geom_point() +
geom_smooth(se=FALSE)
a2 <- ggplot(nrc, aes(x=student, y=rank)) + geom_point() +
geom_smooth(se=FALSE)
a3 <- ggplot(nrc, aes(x=diversity, y=rank)) + geom_point() +
geom_smooth(se=FALSE)
a1 + a2 + a3
```
*You can see that the relationship between rank and research is very strong. There is very little relationship with either of the other predictors.*
### 4
Set up the model. While it is unnecessary to set the mode for a linear regression since it can only be regression, we continue to do it in these labs to be explicit. The specification doesn't perform any calculations by itself. It is just a specification of what we want to do.
```{r}
# Set up and fit model
lm_mod <-
linear_reg() %>%
set_engine("lm")
lm_mod
```
### 5
Fit the model. Once we have the specification we can fit it by supplying a formula expression and the data we want to fit the model on. The formula is written on the form `y ~ x` where `y` is the name of the response and `x` is the name of the predictors. The names used in the formula should match the names of the variables in the data set passed to data.
```{r}
lm_fit <-
lm_mod %>%
fit(rank ~ research + student + diversity,
data = nrc)
lm_fit
```
The result of this fit is a [parsnip](https://parsnip.tidymodels.org) model object. This object contains the underlying fit as well as some parsnip-specific information. If we want to look at the underlying fit object we can access it and summarise it with
```{r}
lm_fit %>%
pluck("fit") %>%
summary()
```
### 6
Report the coefficients of the model fit. We can use packages from the broom package to extract key information out of the model objects in tidy formats.
The `tidy()` function returns the parameter estimates of a `lm` object. Explain the relationship between the predictors and the response variable. Is the interpretation of `research`, "the higher the value of research indicates higher value of rank"? This doesn't make sense, why?
```{r}
tidy(lm_fit)
```
*The coefficient for research is* `r tidy(lm_fit)[2,2] %>% pull()` *which says that as the value of research increases by 1, the rank increases by about a half.*
*Wait! Shouldn't higher research score indicate smaller rank, because low values of rank indicate better. The research variable is a rank also, so low values mean lots of research and high values being less research. It is a rank based on other variables collected, related to research activity, like number of publications and number of citations. Making plots ot these variables against research rank should show that they have a negative association.*
### 7
Make a dot and whisker plot of the coefficients, to visualise the significance of the different variables. Explain what you learn about the importance of the three explanatory variables for predicting the response.
```{r}
tidy(lm_fit) %>%
dwplot(dot_args = list(size = 2, color = "black"),
whisker_args = list(color = "black"),
vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))
```
*The only important variable contributing to the model is research, because the confidence intervals for the other two overlap with 0, and the hypothesis test measuring difference from 0 is only significant for research. The coefficient for research is positive, meaning that the higher the research score the higher the rank.*
### 8
Report the fit statistics, using `broom::glance()`. What do you learn about the strength of the fit?
```{r}
glance(lm_fit)
```
*The model explains about half the variation in rank, because $R^2 is 0.485. That is, it is a moderately well-fitting model.*
### 9
Explore the model fit visually. Plot the predicted values against observed, residuals against fitted, and predicted against each of the predictors. Summarise what you learn about the model fit.
```{r}
# Plot the fit
nrc_all <- augment(lm_fit, nrc)
p1 <- ggplot(nrc_all, aes(x=.pred, y=rank)) + geom_point()
p2 <- ggplot(nrc_all, aes(x=.pred, y=.resid)) + geom_point()
p1 + p2
```
*Observed vs predicted shows the fit is reasonably good. There are two outliers, revealed more by the residual plot. These are two programs that poor ranking, but predicted to be much better.*
```{r fig.height=3}
p3 <- ggplot(nrc_all, aes(x=research, y=.pred)) + geom_point()
p4 <- ggplot(nrc_all, aes(x=student, y=.pred)) + geom_point()
p5 <- ggplot(nrc_all, aes(x=diversity, y=.pred)) + geom_point()
p3 + p4 + p5
```
*There is a very strong relationship between research and predicted values, which further supports that the model is primarily using research as the predictor.*
### 10
Generate a grid of new data values to predict, with all combinations of `research = c(10, 40, 70)`, `student = c(10, 40, 70)`, `diversity = c(10, 40, 70)`. Predict these values, as point and confidence intervals.
```{r}
# Predict new data
new_points <- expand.grid(research = c(10, 40, 70),
student = c(10, 40, 70),
diversity = c(10, 40, 70))
new_points
mean_pred <- predict(lm_fit, new_data = new_points)
mean_pred
conf_int_pred <- predict(lm_fit,
new_data = new_points,
type = "conf_int")
conf_int_pred
new_points <- augment(lm_fit, new_points)
```
### 11
Make a plot of predicted values vs research for the observed data and the new data, with new data coloured differently. How do the predicted values compare?
```{r}
# Plot predicted data, with observed data
ggplot() +
geom_point(data=nrc_all, aes(x=research, y=.pred)) +
geom_point(data=new_points, aes(x=research, y=.pred), colour="red")
```
*The new points are at just three values of research, so you can see the vertical stripes here. The predicted ranks for these points are more varied than the observed data. Its likely that this is due to the combination of values in the new data being different than what exists in the observed data. It would be a good idea to plot the predictors against eaach other to confirm that this is true.*
##### © Copyright 2022 Monash University