Professor Di Cook

Econometrics and Business Statistics

Monash University" date: "Week 2 (a)" output: xaringan::moon_reader: css: ["kunoichi", "ninjutsu", "mystyle.css"] lib_dir: libs nature: ratio: '16:9' highlightStyle: github highlightLines: true countIncrementalSlides: false --- ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE, echo = FALSE, fig.retina = 4) options(htmltools.dir.version = FALSE) ``` ## Multiple Regression

.tip[ $$Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \cdots + \beta_pX_{p,i} + e_i$$] - Each $X_{j,i}$ is .orange[numerical] and is called a .orange[predictor]. - The coefficients $\beta_1,\dots,\beta_p$ measure the .orange[effect] of each predictor after taking account of the effect of all other predictors in the model. - Predictors may be .orange[transforms] of other predictors. e.g., $X_2=X_1^2$. --- ## Multiple Regression - The model describes a .orange[**line**, plane or hyperplane] in the predictor space.

.orange[Ordinary least squares] is the simplest way to fit the model. Geometrically, this is the sum of the squared distances, parallel to the axis of the dependent variable, between each observed data point and the corresponding point on the regression surface – the .orange[smaller the sum] of differences, the .orange[better] the model fits the data. --- ## Model Diagnostics

$R^2$ is the proportion of variation explained by the model, and measures the goodness of the fit, close to 1 the model explains most of the variability in $Y$, close to 0 it explains very little. $$R^2 = 1 - \frac{RSS}{TSS}$$ where $RSS = \sum_{i=1}^n (y_i-\hat{y})^2$ (read: Residual Sum of Squares), and $TSS=\sum_{i=1}^n (y_i-\bar{y})^2$ (read: Total Sum of Squares). --- ## Model Diagnostics

.orange[Residual Standard Error (RSE)] is an estimate of the standard deviation of $\varepsilon$. This is meaningful with the assumption that $\varepsilon \sim N(0, \sigma^2)$. $$RSE = \sqrt{\frac{1}{n-p-1}RSS}$$ --- ## Model Diagnostics

F statistic tests whether any predictor explains response, by testing $H_0: \beta_1=\beta_2=...=\beta_p=0$ vs $H_A:$ at least one is not 0 --- ## Think about ...

- Is at least one of the predictors useful in predicting the response? - Do all the predictors help to explain $Y$, or is only a subset of the predictors useful? - How well does the model fit the data? - Given a set of predictor values, what response value should we predict and how accurate is our prediction? --- ## Example - Wages Wage and other data for a group of 3000 male workers in the Mid-Atlantic region. .orange[Goal:] .green[Predict wage based on worker characteristics.] ```{r} library(tidyverse) library(ISLR) glimpse(Wage) ``` --- ## Take a look What do the following pairwise comparisons of the variables `year`, `age` and, `education` against `wage` show us? ```{r fig.height=5, fig.width=12, out.width="100%"} library(GGally) ggduo(Wage, columnsX=c(1,2,5), columnsY=11) ``` --- ## Take a look If we examine `logwage` instead of `wage` as response variable - what changes? ```{r fig.height=5, fig.width=12, out.width="100%"} ggduo(Wage, columnsX=c(1,2,5), columnsY=10) ``` --- class: split-two .column[.pad50px[ ## Take a look

- Examine the predictors. Ideally values are spread out without any associations. ]] .column[.content.vmiddle.center[ ```{r fig.height=8, fig.width=8, out.width="90%"} ggpairs(Wage, columns=c(1,2,5)) ``` ]] --- ## Model for wage data

.tip[ $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + e$ ] where $Y=$ `logwage`, $X_1=$ `year`, $X_2=$ `age`, and $X_3=$ `education`. --- ## Fitting the model in R ```{r results='hide'} fit <- lm(logwage~year+age+education, data=Wage) summary(fit) ``` ``` lm(formula = logwage ~ year + age + education, data = Wage) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.745e+01 5.469e+00 -3.191 0.00143 year 1.078e-02 2.727e-03 3.952 7.93e-05 age 5.509e-03 4.813e-04 11.447 < 2e-16 education2. HS Grad 1.202e-01 2.086e-02 5.762 9.18e-09 education3. Some College 2.440e-01 2.195e-02 11.115 < 2e-16 education4. College Grad 3.680e-01 2.178e-02 16.894 < 2e-16 education5. Advanced Degree 5.411e-01 2.362e-02 22.909 < 2e-16 Residual standard error: 0.3023 on 2993 degrees of freedom Multiple R-squared: 0.2631, Adjusted R-squared: 0.2616 F-statistic: 178.1 on 6 and 2993 DF, p-value: < 2.2e-16 ``` --- ## Interpretation - The ideal scenario is when the predictors are uncorrelated. - Each coefficient can be interpreted and tested separately. - Correlations amongst predictors cause problems. - The variance of all coefficients tends to increase, sometimes dramatically. - Interpretations become hazardous -- when $X_j$ changes, everything else changes. - Predictions still work provided new $X$ values are within the range of training $X$ values. - Claims of causality should be avoided for observational data. --- ## Interactions - An interaction occurs when the one variable changes the effect of a second variable. (e.g., spending on radio advertising increases the effectiveness of TV advertising). - To model an interaction, include the product $X_1X_2$ in the model in addition to $X_1$ and $X_2$. - **Hierarchy principle**: If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. (This is because the interactions are almost impossible to interpret without the main effects.) --- ## Interactions .font_tiny[(Chapter3/3.7.pdf)] --- ## Residuals

- If a plot of the residuals vs any predictor in the model shows a pattern, then the .orange[relationship is nonlinear.] - If a plot of the residuals vs any predictor **not** in the model shows a pattern, then .orange[the predictor should be added to the model.] - If a plot of the residuals vs fitted values shows a pattern, then there is .orange[heteroscedasticity in the errors]. (Could try a transformation.) --- ## Residuals

### Created using [R Markdown](https://rmarkdown.rstudio.com) with flair by [**xaringan**](https://github.com/yihui/xaringan), and [**kunoichi** (female ninja) style](https://github.com/emitanaka/ninja-theme).

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.