Week 2: Visualising your data and models
In this week we will cover:
We plot the model on the data to assess whether it fits or is a misfit!
Doing this in high-dimensions is considered difficult!
So it is common to only plot the data-in-the-model-space.
Predictive probabilities are aspects of the model. It is useful to plot. What do we learn here?
But it doesn’t tell you why there is a difference.
Model is displayed, as a grid of predicted points in the original variable space. Data is overlaid, using text labels. What do you learn?
One model has a linear boundary, and the other has the highly non-linear boundary, which matches the class cluster better. Also …
Start simply! Make static plots that organise the variables on a page.
Plot all the pairs of variables. When laid out in a matrix format this is called a scatterplot matrix.
Here, we see linear association, clumping and clustering, potentially some outliers.
There is an outlier in the data on the right, like the one in the left, but it is hidden in a combination of variables. It’s not visible in any pair of variables.
Aspect ratio for scatterplots needs to be equal, or square!
When you make a scatterplot of two variables from a multivariate data set, most software renders it with an unequal aspect ratio, as a rectangle. You need to over-ride this and force the square aspect ratio. Why?
Because it adversely affects the perception of correlation and association between variables.
Parallel coordinate plots are side-by-side dotplots with values from a row connected with a line.
Examine the direction and orientation of lines to perceive multivariate relationships.
Crossing lines indicate negative association. Lines with same slope indicate positive association. Outliers have a different up/down pattern to other points. Groups of lines with same pattern indicate clustering.
But the advantage is that you can pack a lot of variables into the single page.
p_pcp <- p_tidy |>
na.omit() |>
plot_ly(type = 'parcoords',
line = list(),
dimensions = list(
list(range = c(172, 231),
label = 'fl', values = ~fl),
list(range = c(32, 60),
label = 'bl', values = ~bl),
list(range = c(2700, 6300),
label = 'bm', values = ~bm),
list(range = c(13, 22),
label = 'bd', values = ~bd)
)
)
Increasing dimension adds an additional orthogonal axis.
If you want more high-dimensional shapes there is an R package, geozoo, which will generate cubes, spheres, simplices, mobius strips, torii, boy surface, klein bottles, cones, various polytopes, …
And read or watch Flatland: A Romance of Many Dimensions (1884) Edwin Abbott.
Data
\[\begin{eqnarray*} X_{~n\times p} = [X_{~1}~X_{~2}~\dots~X_{~p}]_{~n\times p} = \left[ \begin{array}{cccc} x_{~11} & x_{~12} & \dots & x_{~1p} \\ x_{~21} & x_{~22} & \dots & x_{~2p}\\ \vdots & \vdots & & \vdots \\ x_{~n1} & x_{~n2} & \dots & x_{~np} \end{array} \right]_{~n\times p} \end{eqnarray*}\]
Projection
\[\begin{eqnarray*} A_{~p\times d} = \left[ \begin{array}{cccc} a_{~11} & a_{~12} & \dots & a_{~1d} \\ a_{~21} & a_{~22} & \dots & a_{~2d}\\ \vdots & \vdots & & \vdots \\ a_{~p1} & a_{~p2} & \dots & a_{~pd} \end{array} \right]_{~p\times d} \end{eqnarray*}\]
Projected data
\[\begin{eqnarray*} Y_{~n\times d} = XA = \left[ \begin{array}{cccc} y_{~11} & y_{~12} & \dots & y_{~1d} \\ y_{~21} & y_{~22} & \dots & y_{~2d}\\ \vdots & \vdots & & \vdots \\ y_{~n1} & y_{~n2} & \dots & y_{~nd} \end{array} \right]_{~n\times d} \end{eqnarray*}\]
Data is 2D: \(~~p=2\)
Projection is 1D: \(~~d=1\)\[\begin{eqnarray*} A_{~2\times 1} = \left[ \begin{array}{c} a_{~11} \\ a_{~21}\\ \end{array} \right]_{~2\times 1} \end{eqnarray*}\]
Notice that the values of \(A\) change between (-1, 1). All possible values being shown during the tour.
\[\begin{eqnarray*} A = \left[ \begin{array}{c} 1 \\ 0\\ \end{array} \right] ~~~~~~~~~~~~~~~~ A = \left[ \begin{array}{c} 0.7 \\ 0.7\\ \end{array} \right] ~~~~~~~~~~~~~~~~ A = \left[ \begin{array}{c} 0.7 \\ -0.7\\ \end{array} \right] \end{eqnarray*}\]
watching the 1D shadows we can see:
What does the 2D data look like? Can you sketch it?
⟵
The 2D data
Data is 3D: \(p=3\)
Projection is 2D: \(d=2\)
\[\begin{eqnarray*} A_{~3\times 2} = \left[ \begin{array}{cc} a_{~11} & a_{~12} \\ a_{~21} & a_{~22}\\ a_{~31} & a_{~32}\\ \end{array} \right]_{~3\times 2} \end{eqnarray*}\]
Notice that the values of \(A\) change between (-1, 1). All possible values being shown during the tour.
See:
Data is 4D: \(p=4\)
Projection is 2D: \(d=2\)
\[\begin{eqnarray*} A_{~4\times 2} = \left[ \begin{array}{cc} a_{~11} & a_{~12} \\ a_{~21} & a_{~22}\\ a_{~31} & a_{~32}\\ a_{~41} & a_{~42}\\ \end{array} \right]_{~4\times 2} \end{eqnarray*}\]
How many clusters do you see?
Avoid misinterpretation …
… see the bigger picture!
Image: Sketchplanations.
This is a basic tour, which will run in your RStudio plot window.
This data has a class variable, species
.
Use this to colour points with:
You can specifically guide the tour choice of projections using
For this 2D data, sketch a line or a direction that if you squashed the data into it would provide most of the information.
What about this data?
Principal component analysis (PCA) produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated. It is an unsupervised learning method.
Use it, when:
The first principal component is a new variable created from a linear combination
\[z_1 = \phi_{11}x_1 + \phi_{21} x_2 + \dots + \phi_{p1} x_p\]
of the original \(x_1, x_2, \dots, x_p\) that has the largest variance. The elements \(\phi_{11},\dots,\phi_{p1}\) are the loadings of the first principal component and are constrained by:
\[ \displaystyle\sum_{j=1}^p \phi^2_{j1} = 1 \]
If you think of the first few PCs like a linear model fit, and the others as the error, it is like regression, except that errors are orthogonal to model.
(Chapter6/6.15.pdf)
PCA can be thought of as fitting an \(n\)-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. The new variables produced by principal components correspond to rotating and scaling the ellipse into a circle. It spheres the data.
Suppose we have a \(n\times p\) data set \(X = [x_{ij}]\).
\[ \mathop{\text{maximize}}_{\phi_{11},\dots,\phi_{p1}} \frac{1}{n}\sum_{i=1}^n \left(\sum_{j=1}^p \phi_{j1}x_{ij}\right)^{\!\!\!2} \text{ subject to } \sum_{j=1}^p \phi^2_{j1} = 1 \]
Repeat optimisation to estimate \(\phi_{jk}\), with additional constraint that \(\sum_{j=1, k<k'}^p \phi_{jk}\phi_{jk'} = 0\) (next vector is orthogonal to previous eigenvector).
\[X = U\Lambda V^T\]
It is always possible to uniquely decompose a matrix in this way.
Remember, PCA is trying to summarise the variance in the data.
Total variance (TV) in data (assuming variables centered at 0):
\[ \text{TV} = \sum_{j=1}^p \text{Var}(x_j) = \sum_{j=1}^p \frac{1}{n}\sum_{i=1}^n x_{ij}^2 \]
If variables are standardised, TV=number of variables.
Variance explained by m’th PC: \(V_m = \text{Var}(z_m) = \frac{1}{n}\sum_{i=1}^n z_{im}^2\)
\[ \text{TV} = \sum_{m=1}^M V_m \text{ where }M=\min(n-1,p). \]
PCA is a useful dimension reduction technique for large datasets, but deciding on how many dimensions to keep isn’t often clear.
How do we know how many principal components to choose?
Proportion of variance explained:
\[\text{PVE}_m = \frac{V_m}{TV}\]
Choosing the number of PCs that adequately summarises the variation in \(X\), is achieved by examining the cumulative proportion of variance explained.
Cumulative proportion of variance explained:
\[\text{CPVE}_k = \sum_{m=1}^k\frac{V_m}{TV}\]
Scree plot: Plot of variance explained by each component vs number of component.
Scree plot: Plot of variance explained by each component vs number of component.
The data on national track records for women (as at 1984).
Rows: 55
Columns: 8
$ m100 <dbl> 12, 11, 11, 11, 11, 11, 12, 11, 12, 12, 1…
$ m200 <dbl> 23, 22, 23, 23, 23, 23, 24, 22, 25, 24, 2…
$ m400 <dbl> 54, 51, 51, 52, 53, 53, 55, 50, 55, 55, 5…
$ m800 <dbl> 2.1, 2.0, 2.0, 2.0, 2.2, 2.1, 2.2, 2.0, 2…
$ m1500 <dbl> 4.4, 4.1, 4.2, 4.1, 4.6, 4.5, 4.5, 4.1, 4…
$ m3000 <dbl> 9.8, 9.1, 9.3, 8.9, 9.8, 9.8, 9.5, 8.8, 9…
$ marathon <dbl> 179, 152, 159, 158, 170, 169, 191, 149, 1…
$ country <chr> "argentin", "australi", "austria", "belgi…
Source: Johnson and Wichern, Applied multivariate analysis
What do you learn?
What do you learn?
Standard deviations (1, .., p=7):
[1] 2.41 0.81 0.55 0.35 0.23 0.20 0.15
Rotation (n x k) = (7 x 7):
PC1 PC2 PC3 PC4 PC5 PC6 PC7
m100 0.37 0.49 -0.286 0.319 0.231 0.6198 0.052
m200 0.37 0.54 -0.230 -0.083 0.041 -0.7108 -0.109
m400 0.38 0.25 0.515 -0.347 -0.572 0.1909 0.208
m800 0.38 -0.16 0.585 -0.042 0.620 -0.0191 -0.315
m1500 0.39 -0.36 0.013 0.430 0.030 -0.2312 0.693
m3000 0.39 -0.35 -0.153 0.363 -0.463 0.0093 -0.598
marathon 0.37 -0.37 -0.484 -0.672 0.131 0.1423 0.070
Summary of the principal components:
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | |
---|---|---|---|---|---|---|---|
Variance | 5.81 | 0.65 | 0.30 | 0.13 | 0.05 | 0.04 | 0.02 |
Proportion | 0.83 | 0.09 | 0.04 | 0.02 | 0.01 | 0.01 | 0.00 |
Cum. prop | 0.83 | 0.92 | 0.97 | 0.98 | 0.99 | 1.00 | 1.00 |
Increase in variance explained large until \(k=3\) PCs, and then tapers off. A choice of 3 PCs would explain 97% of the total variance.
Scree plot: Where is the elbow?
At \(k=2\), thus the scree plot suggests 2 PCs would be sufficient to explain the variability.
Visualise model using a biplot: Plot the principal component scores, and also the contribution of the original variables to the principal component.
A biplot is like a single projection from a tour.
wsamoa
, cookis
, dpkorea
. PCA, because it is computed using the variance in the data, can be affected by outliers. It may be better to remove these countries, and re-run the PCA.track_std <- track |>
mutate_if(is.numeric, function(x) (x-
mean(x, na.rm=TRUE))/
sd(x, na.rm=TRUE))
track_std_pca <- prcomp(track_std[,1:7],
scale = FALSE,
retx=TRUE)
track_model <- pca_model(track_std_pca, d=2, s=2)
track_all <- rbind(track_model$points, track_std[,1:7])
animate_xy(track_all, edges=track_model$edges,
edges.col="#E7950F",
edges.width=3,
axes="off")
render_gif(track_all,
grand_tour(),
display_xy(
edges=track_model$edges,
edges.col="#E7950F",
edges.width=3,
axes="off"),
gif_file="gifs/track_model.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
Mostly captures the variance in the data. Seems to slightly miss the non-linear relationship.
🤭
Sometimes the lowest PCs show the interesting patterns, like non-linear relationships, or clusters.
Find some low-dimensional layout of points which approximates the distance between points in high-dimensions, with the purpose being to have a useful representation that reveals high-dimensional patterns, like clusters.
Multidimensional scaling (MDS) is the original approach:
\[ \mbox{Stress}_D(x_1, ..., x_n) = \left(\sum_{i, j=1; i\neq j}^n (d_{ij} - d_k(i,j))^2\right)^{1/2} \] where \(D\) is an \(n\times n\) matrix of distances \((d_{ij})\) between all pairs of points, and \(d_k(i,j)\) is the distance between the points in the low-dimensional space.
PCA is a special case of MDS. The result from PCA is a linear projection, but generally MDS can provide some non-linear transformation.
Many variations being developed:
NLDR can be useful but it can also make some misleading representations.
UMAP 2D representation
Tour animation
ETC3250/5250 Lecture 2 | iml.numbat.space