[1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B"
[1] "A" "A" "B" "A" "B" "A" "B" "A"
[1] "A" "B" "B" "B"
Week 3: Re-sampling and regularisation
We will cover:
After making that split, we would use these methods on the training sample:
A set of \(n\) observations are randomly split into a training set (blue, containing observations 7, 22, 13, …) and a test set (yellow, all other observations not in training set).
(Chapter5/5.1.pdf)
With tidymodels, the function initial_split()
creates the indexes of observations to be allocated into training or test samples. To generate these samples use training()
and test()
functions.
[1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B"
[1] "A" "A" "B" "A" "B" "A" "B" "A"
[1] "A" "B" "B" "B"
How do you ensure that you get 0.70 in each class?
Stratify the sampling
[1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B"
[1] "A" "A" "A" "A" "B" "B" "B" "B"
[1] "A" "A" "B" "B"
Now the test set has 2 A’s and 2 B’2. This is best practice!
Not stratifying can cause major problems with unbalanced samples.
[1] "A" "A" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
[1] "B" "B" "A" "B" "B" "A" "B" "B"
[1] "B" "B" "B" "B"
The test set is missing one entire class!
Always stratify splitting by sub-groups, especially response variable classes, and possibly other variables too.
[1] "A" "B" "B" "B" "B" "B" "B" "B"
[1] "A" "B" "B" "B"
Now there is an A in the test set!
Check the class proportions of the response by computing counts and proportions in each class, and tabulating or plotting the result.
It’s good if there are similar numbers of each class in both sets.
Make a training/test variable and plot the predictors. Need to have similar distributions.
On the response training and test sets have similar proportions of each class so looks good BUT it’s not