Professor Di Cook

Econometrics and Business Statistics

Monash University" date: "Week 10 (a)" output: xaringan::moon_reader: css: ["kunoichi", "ninjutsu", "mystyle.css", "libs/animate.css"] lib_dir: libs nature: ratio: '16:9' highlightStyle: github highlightLines: true countIncrementalSlides: false editor_options: chunk_output_type: console header-includes: - \usepackage{xcolor} --- ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE, echo = FALSE, out.width = "100%", fig.width=8, fig.height=6, fig.align = "center", fig.retina = 4) options(htmltools.dir.version = FALSE) library(magick) ``` class: middle center ```{r echo=TRUE} library(statquotes) search_quotes(search="Holdane", fuzzy=TRUE) ``` --- class: middle center ```{r echo=TRUE} statquote(source="Box") ``` --- layout: true class: shuriken-full white .blade1.bg-green[.content[ .white.font_large[Know your data.] `r set.seed(1);emo::ji("airplane")`

Quantitative or qualitative response? Predictors all quantitative? Do you have independent observations? ]] .blade2.bg-purple[.content[ .white.font_large[Plot your data.] `r set.seed(1);emo::ji("painting")`

Is there a relationship between response and predictors? Is the relationship linear? Are boundaries linear?Is variability heterogeneous? Are groups distinct? Are there unusual observations? ]] .blade3.bg-deep-orange[.content[ .white.font_large[Check for missing values.] `r set.seed(1);emo::ji("tool")`

Do some variables have too many missings to use them? Do some observations have too many missings to use them? What would be a useful imputation method to fix the sporadic missing value? ]] .blade4.bg-pink[.content[ .white.font_large[Fit a versatile model.] `r set.seed(1);emo::ji("computer")`

Compute and plot model diagnostics. Where doesn't the model do well? How can it be refined? ]] --- class: hide-blade2 hide-blade3 hide-blade4 hide-hole --- class: hide-blade3 hide-blade4 hide-hole count: false --- class: hide-blade4 hide-hole count: false --- class: hide-hole count: false --- count: false --- layout: false ## ROC for classification The .orange[ROC curve] is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds. It is a common method for comparing classification models. Below: ROC curve for the LDA classifier on the training set of `credit` data.

If the classifier returns a prediction between 0 and 1, interpret as the probability of a positive, then threshold (split the data) at different values, e.g. 0.1, 0.2, 0.3, 0.4, 0.5, ... Compute the confusion table for each split, record the sensitivity and specificity and plot the resulting numbers. ]] ---

true | |||

C1 (positive) | C2 (negative) | ||

pred- | C1 | a | b |

icted | C2 | c | d |

- Working with .orange[standardised variables] helps, because magnitude of coefficients is then directly interpreted as importance - .orange[Permutation] approach in random forests is useful more broadly. Compare magnitude of coefficients between models built on original and permuted variable. - .orange[Effect of one predictor with the response] can depend on their relationship with one another. Called multicollinearity in regression. --- layout: false # Bigger picture .orange[All possible model fits] to housing data with 7 variables, from [Wickham et al (2015) Removing the Blindfold](http://onlinelibrary.wiley.com/doi/10.1002/sam.11271/abstract) --- class: split-60 layout: false .column[.pad10px[ ]] .column[.pad10px[ .font_small[Three typical estimates for bedrooms: big positive, close to 0, big negative.] .font_small[Models with big .orange[positive coefficients] for bedrooms tend to have .orange[weaker fits]. They tend to occur with models that have no livingArea contribution, and more negative coefficients for zoneRM, and no air con.] .font_small[Models with big .orange[negative coefficients] on bedrooms tend to have .orange[stronger fits]. All contrast with livingArea (high positive coefficients).] .font_small[If bedrooms contribute to the model, bathrooms do not.] ]] --- ## Model choice - robustness of conclusions Whatever way you model the data, the .orange[interpretations should be consistent]. - Bias can explain difference in predictions between models, flexible vs inflexible can provide a spectrum on what the data predicts. - Broad changes in a model when some cases or some variables are not used, should evoke suspicions (your "spidey sense"). - Model fit statistics are a measure of predictive power. A weak model can still be useful if there is a large cost involved. --- layout: false # `r set.seed(2020); emo::ji("technologist")` Made by a human with a computer ### Slides at [https://iml.numbat.space](https://iml.numbat.space). ### Code and data at [https://github.com/numbats/iml](https://github.com/numbats/iml).

### Created using [R Markdown](https://rmarkdown.rstudio.com) with flair by [**xaringan**](https://github.com/yihui/xaringan), and [**kunoichi** (female ninja) style](https://github.com/emitanaka/ninja-theme).

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.