Department of Econometrics and Business Statistics
Welcome! Meet the teaching team
Chief examiner: Professor Dianne Cook
Communication: All questions need to be communicated through the Discussion forum. Any of a private matter can be addressed to etc3250.clayton-x@monash.edu or through a private message on the edstem forum. Emails should never be sent directly to tutors or the instructor.
Tutors:
Patrick: 3rd year PhD student working on computer vision for reading residual plots
Harriet: 2nd year PhD student working on visualisation of uncertainty
Jayani: 2nd year PhD student working on methods for understanding how non-linear dimension reduction warps your data
Krisanat: MBAt graduate, aspiring to be PhD student in 2025
What this course is about
select and develop appropriate models for clustering, prediction or classification.
estimate and simulate from a variety of statistical models.
measure the uncertainty of a prediction or classification using resampling methods.
apply business analytic tools to produce innovative solutions in finance, marketing, economics and related areas.
manage very large data sets in a modern software environment.
explain and interpret the analyses undertaken clearly and effectively.
complete weekly learning quiz to check your understanding
read the relevant sections of the resource material
run the code from lectures in the qmd files
Begin assessments early, when posted, map out a plan to complete it on time
Ask questions
Machine learning is a big, big area. This semester is like the tip of the iceberg, there’s a lot more, and interesting methods and problems, than what we can cover. Take this as a challenge to get you started, and become hungry to learn more!
Types of problems
Framing the problem
Supervised classification: categorical \(y_i\) is available for all \(x_i\)
Unsupervised learning: \(y_i\)unavailable for all \(x_i\)
What type of problem is this? (1/3)
Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips (at least in restaurants in the United States) are a major component of pay.
In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990. The restaurant, located in a suburban shopping mall, was part of a national chain and served a varied menu. In observance of local law the restaurant offered seating in a non-smoking section to patrons who requested it. Each record includes a day and time, and taken together, they show the server’s work schedule.
What is \(y\)? What is \(x\)?
What type of problem is this? (2/3)
Every person monitored their email for a week and recorded information about each email message; for example, whether it was spam, and what day of the week and time of day the email arrived. We want to use this information to build a spam filter, a classifier that will catch spam with high probability but will never classify good email as spam.
What is \(y\)? What is \(x\)?
What type of problem is this? (3/3)
A health insurance company collected the following information about households:
Total number of doctor visits per year
Total household size
Total number of hospital visits per year
Average age of household members
Total number of gym memberships
Use of physiotherapy and chiropractic services
Total number of optometrist visits
The health insurance company wants to provide a small range of products, containing different bundles of services and for different levels of cover, to market to customers.
What is \(y\)? What is \(x\)?
Math and computing
Data: math
\(n\) number of observations or sample points
\(p\) number of variables or the dimension of the data
\(d (\leq p)\) is used to denote the number of variables in a lower dimensional space, usually by taking a projection.
\(A\) is a \(p\times d\) orthonormal basis, \(A^\top A=I_d\) ( \(A'A=I_d\) ).
The projection of \({\mathbf x_i}\) onto \(A\) is \(A^\top{\mathbf x}_i\).
proj
[,1] [,2]
[1,] 0.71 0
[2,] 0.71 0
[3,] 0.00 1
sum(proj[,1]^2)
[1] 1
sum(proj[,2]^2)
[1] 1
sum(proj[,1]*proj[,2])
[1] 0
proj would be considered to be a orthonormal projection matrix.
Conceptual framework
Accuracy vs interpretability
Predictive accuracy
The primary purpose is to be able to predict\(\widehat{Y}\) for new data. And we’d like to do that well! That is, accurately.
Interpretability
Almost equally important is that we want to understand the relationship between \({\mathbf X}\) and \(Y\). The simpler model that is (almost) as accurate is the one we choose, always.
Training vs test splits
When data are reused for multiple tasks, instead of carefully spent from the finite data budget, certain risks increase, such as the risk of accentuating bias or compounding effects from methodological errors. Julia Silge
Training set: Used to fit the model, might be also broken into a validation set for frequent assessment of fit.
Test set: Purely used to assess final models performance on future data.
Compute \(\widehat{y}\) from training data, \(\{(y_i, {\mathbf x}_i)\}_{i = 1}^n\). The error rate (fraction of misclassifications) to get the Training Error Rate
A better estimate of future accuracy is obtained using test data to get the Test Error Rate.
Training error will usually be smaller than test error. When it is much smaller, it indicates that the model is too well fitted to the training data to be accurate on future data (over-fitted).
Confusion (misclassification) matrix
predicted
1
0
true
1
a
b
0
c
d
Consider 1=positive (P), 0=negative (N).
True positive (TP): a
True negative (TN): d
False positive (FP): c (Type I error)
False negative (FN): b (Type II error)
Sensitivity, recall, hit rate, or true positive rate (TPR): TP/P = a/(a+b)
Specificity, selectivity or true negative rate (TNR): TN/N = d/(c+d)
Prevalence: P/(P+N) = (a+b)/(a+b+c+d)
Accuracy: (TP+TN)/(P+N) = (a+d)/(a+b+c+d)
Balanced accuracy: (TPR + TNR)/2 (or average class errors)
Confusion (misclassification) matrix: computing
Two classes
# Write out the confusion matrix in standard form#| label: oconfusion-matrix-tidycm <- a2 %>%count(y, pred) |>group_by(y) |>mutate(cl_acc = n[pred==y]/sum(n)) cm |>pivot_wider(names_from = pred, values_from = n) |>select(y, bilby, quokka, cl_acc)
# A tibble: 2 × 4
# Groups: y [2]
y bilby quokka cl_acc
<fct> <int> <int> <dbl>
1 bilby 9 3 0.75
2 quokka 5 10 0.667
accuracy(a2, y, pred) |>pull(.estimate)
[1] 0.7
bal_accuracy(a2, y, pred) |>pull(.estimate)
[1] 0.71
sens(a2, y, pred) |>pull(.estimate)
[1] 0.75
specificity(a2, y, pred) |>pull(.estimate)
[1] 0.67
More than two classes
# Write out the confusion matrix in standard formcm3 <- a3 %>%count(y, pred) |>group_by(y) |>mutate(cl_err = n[pred==y]/sum(n)) cm3 |>pivot_wider(names_from = pred, values_from = n, values_fill=0) |>select(y, bilby, quokka, numbat, cl_err)
The balance of getting it right, without predicting everything as positive.
Need predictive probabilities, probability of being each class.
a2 |>slice_head(n=3)
# A tibble: 3 × 4
y pred bilby quokka
<fct> <fct> <dbl> <dbl>
1 bilby bilby 0.9 0.1
2 bilby bilby 0.8 0.2
3 bilby bilby 0.9 0.1
roc_curve(a2, y, bilby) |>autoplot()
Parametric vs non-parametric
Parametric methods
Assume that the model takes a specific form
Fitting then is a matter of estimating the parameters of the model
Generally considered to be less flexible
If assumptions are wrong, performance likely to be poor
Non-parametric methods
No specific assumptions
Allow the data to specify the model form, without being too rough or wiggly
More flexible
Generally needs more observations, and not too many variables
Easier to over-fit
Black line is true boundary.
Grids (right) show boundaries for two different models.
Reducible vs irreducible error
If the model form is incorrect, the error (solid circles) may arise from wrong shape, and is thus reducible. Irreducible means that we have got the right model and mistakes (solid circles) are random noise.
Flexible vs inflexible
Parametric models tend to be less flexible but non-parametric models can be flexible or less flexible depending on parameter settings.
Bias vs variance
Bias is the error that is introduced by modeling a complicated problem by a simpler problem.
For example, linear regression assumes a linear relationship and perhaps the relationship is not exactly linear.
In general, but not always, the more flexible a method is, the less bias it will have because it can fit a complex shape better.
Variance refers to how much your estimate would change if you had different training data. Its measuring how much your model depends on the data you have, to the neglect of future data.
In general, the more flexible a method is, the more variance it has.
The size of the training data can impact on the variance.
Bias
When you impose too many assumptions with a parametric model, or use an inadequate non-parametric model, such as not letting an algorithm converge fully.
When the model closely captures the true shape, with a parametric model or a flexible model.
Variance
This fit will be virtually identical even if we had a different training sample.
Likely to get a very different model if a different training set is used.
Bias-variance tradeoff
Goal: Without knowing what the true structure is, fit the signal and ignore the noise. Be flexible but not too flexible.
Images 2.16, 2.15 from ISLR
Trade-off between accuracy and interpretability
Diagnosing the fit
Compute and examine the usual diagnostics, some methods have more
fit statistics: accuracy, sensitivity, specificity
errors/misclassifications
variable importance
plot residuals, examine the misclassifications
check test set is similar to training
Go beyond … Look at the data and the model together!
Creating new variables to get better fits is a special skill! Sometimes automated by the method. All are transformations of the original variables. (See tidymodels steps.)
scaling, centering, sphering (step_pca)
log or square root or box-cox transformation (step_log)
ratio of values (step_ratio)
polynomials or splines: \(x_1^2, x_1^3\) (step_ns)
dummy variables: categorical predictors expanded into multiple new binary variables (step_dummy)
Convolutional Neural Networks: neural networks but with pre-processing of images to combine values of neighbouring pixels; flattening of images
The big picture
Know your data
Categorical response or no response
Types of predictors: quantitative, categorical
Independent observations
Do you need to handle missing values?
Are there anomalous observations?
Plot your data
What are the shapes (distribution and variance)?
Are there gaps or separations (centres)?
Fit a model or two
Compute fit statistics
Plot the model
Examine parameter estimates
Diagnostics
Which is the better model
Is there a simpler model?
Are the errors reducible or systematic?
Are you confident that your bias is low and variance is low?