Read a fun explanation here by [Harriet Mason](https://numbat.space/post/permutation_variable_importance/permutationvariableimportance/) --- class: fade-row2 fade-row3 fade-row4 count: false --- class: fade-row3 fade-row4 count: false --- class: fade-row4 count: false --- count: false --- layout: false # Vote Matrix - .monash-orange2[Proportion of trees] the case is predicted to be each class, ranges between 0-1 - Can be used to .monash-orange2[identify troublesome] cases. - Used with plots of the actual data can help determine if it is the record itself that is the problem, or if method is biased. - Understand the difference in accuracy of prediction for different classes. --- layout: false # Proximities - Measure how each pair of observations land in the forest - Run both in- and out-of-bag cases down the tree, and increase proximity value of cases $i, j$ by 1 each time they are in the same terminal node. - Normalize by dividing by $B$. --- class: split-two .column[.pad50px[ # Example - Olive oil data Distinguish the region where oils were produced by their fatty acid signature. Important in quality control and in determining fraudulent marketing. **Areas in the south:** 1. North-Apulia

2. Calabria

3. South-Apulia

4. Sicily ]] .column[.content.vmiddle.center[ ]] --- # Example - Olive oil data Classifying the olive oils in the south of Italy - difficult classification task. ```{r} olive <- read_csv("http://ggobi.org/book/data/olive.csv") %>% rename(name=`...1`) olive <- olive %>% filter(region == 1) ``` ```{r eval=FALSE} library(tourr) library(RColorBrewer) # drop eicosenoic, all low for south animate_xy(olive[,4:10], axes="off", col=olive$area) # Drop Sicily animate_xy(olive[olive$area!=4,4:10], axes="off", col=col[olive$area!=4]) animate_xy(olive[,c(5, 7, 8)], axes="off", col=olive$area) animate_xy(olive[olive$area!=4,c(5, 7, 8)], axes="off", col=olive$area[olive$area!=4]) ``` ```{r eval=FALSE} # create animation library(plotly) library(htmltools) set.seed(20190411) bases <- save_history(olive[,4:10], grand_tour(2), start=matrix(c(1,0,0,1,0,0,0,0,0,0,0,0,0,0), ncol=2, byrow=TRUE), max = 15) # Re-set start bc seems to go awry bases[,,1] <- matrix(c(1,0,0,1,0,0,0,0,0,0,0,0,0,0), ncol=2, byrow=TRUE) tour_path <- interpolate(bases, 0.1) d <- dim(tour_path) olive_std <- tourr::rescale(olive[,4:10]) mydat <- NULL; for (i in 1:d[3]) { fp <- as.matrix(olive_std) %*% matrix(tour_path[,,i], ncol=2) fp <- tourr::center(fp) colnames(fp) <- c("d1", "d2") mydat <- rbind(mydat, cbind(fp, rep(i+10, nrow(fp)))) } colnames(mydat)[3] <- "indx" df <- as_tibble(mydat) df <- df %>% mutate(area=factor(rep(olive$area, d[3]))) p <- ggplot() + geom_point(data = df, aes(x = d1, y = d2, colour=area, frame = indx), size=1) + scale_colour_brewer("", palette="Dark2") + theme_void() + coord_fixed() + theme(legend.position="none") pg <- ggplotly(p, width=400, height=400) %>% animation_opts(200, redraw = FALSE, easing = "linear", transition=0) save_html(pg, file="olive1.html") ```

Examining the vote matrix allows us to see which samples the algorithm had trouble classifying. .monash-orange2[Look rows 3 and 5. How confident would you be in the classifications of these two observations?] ]] .column[.content.vmiddle[ ```{r} options(digits=4) olive_rf$fit$votes %>% as_tibble() %>% slice(1:10) ``` ]] --- class: split-50 layout: false .column[.pad10px[ ```{r out.width="100%", fig.width=6, fig.height=6} vt <- data.frame(olive_rf$fit$votes) vt$area <- olive_tr$area ggscatmat(vt, columns=1:4, col="area") + scale_colour_brewer("", palette="Dark2") ``` ]] .column[.top50px[ ```{r out.width="100%", fig.width=6, fig.height=6} f.helmert <- function(d) { helmert <- rep(1/sqrt(d), d) for(i in 1:(d-1)) { x <- rep(1/sqrt(i*(i+1)), i) x <- c(x, -i/sqrt(i*(i+1))) x <- c(x, rep(0, d - i - 1)) helmert <- rbind(helmert, x) } rownames(helmert) <- paste("V", 1:d, sep="") return(helmert) } proj <- t(f.helmert(4)[-1,]) vtp <- as.matrix(vt[,-5])%*%proj vtp <- data.frame(vtp, area=vt$area) ggscatmat(vtp, columns=1:3, col="area") + scale_colour_brewer("", palette="Dark2") ``` ```{r eval=FALSE} library(tourr) library(RColorBrewer) quartz() pal <- brewer.pal(4, "Dark2") col <- pal[as.numeric(vtp[, 4])] animate_xy(vtp[,1:3], col=col, axes = "bottomleft") ``` ]] --- # From Random Forests to Boosting Whereas random forests build an ensemble of .monash-blue2[deep independent trees], .monash-orange2[boosted trees] build an ensemble of .monash-orange2[shallow trees in sequence] with each tree learning and improving on the previous one.

a. Fit a tree $\hat{f}^b$ with $d$ splits ( $d+1$ terminal nodes)

b. Update $\hat{f}$ by adding a weighted new tree $\hat{f}(x) = \hat{f}(x)+\lambda\hat{f}^b(x)$.

c. Update the residuals $r_i = r_i - \lambda\hat{f}^b(x_i)$

3. Output boosted model, $\hat{f}(x) = \sum_{b=1}^B\lambda\hat{f}^b(x)$

Read a fun explanation of boosting here by [Harriet Mason](https://numbat.space/post/boosting/). --- # Boosting a regression tree - watch this! StatQuest by Josh Starmer

--- # Boosting a classification tree - watch this! StatQuest by Josh Starmer

--- # More resources Cook & Swayne (2007) "Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and GGobi" have several videos illustrating techniques for exploring high-dimensional data in association with trees and forest classifiers: - [Trees video](http://www.ggobi.org/book/chap-class/Trees.mov) - [Forests video](http://www.ggobi.org/book/chap-class/Forests.mov) --- ```{r endslide, child="assets/endslide.Rmd"} ```