class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-10b.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-02.png") # .monash-blue[ETC3250/5250: Introduction to Machine Learning] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Assessing clustering results</h2> .bottom_abs.width100[ Lecturer: *Professor Di Cook* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC3250.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 10b <br> ] --- class: transition # Where cluster algorithms can be tripped up --- class: split-50 layout: false .column[.pad10px[ # Inlier-outlier observations <img src="images/lecture-10b/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> ]] .column[.pad50px[ <br><br> .monash-orange2[Nuisance cases] "Hansel and Gretel data" Points that are between major clusters of data. This affects some linkage methods, eg single, which will tend to "chain" through the data grouping everything together. ]] --- <img src="images/lecture-10b/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" /> --- <img src="images/lecture-10b/unnamed-chunk-5-1.png" width="90%" style="display: block; margin: auto;" /> --- class: split-50 layout: false .column[.pad10px[ # Nuisance variables <img src="images/lecture-10b/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" /> ]] .column[.pad50px[ <br> <br> <br> Variables that don't contribute to the clustering but are included in the distance calculations. `x2` is a nuisance variable. ]] --- <img src="images/lecture-10b/unnamed-chunk-7-1.png" width="90%" style="display: block; margin: auto;" /> --- <img src="images/lecture-10b/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" /> --- class: split-two .column[.pad50px[ ## Example - flea data 6 variables, 74 cases. Three very clear clusters. Data has a mix of nuisance variables and nuisance observations. <img src="images/lecture-10b/unnamed-chunk-11-1.png" width="50%" style="display: block; margin: auto;" /> Above is the 2D projection pursuit dimension reduction, using LDA index with true class. ]] .column[.content.vmiddle[ <img src="images/lecture-10b/unnamed-chunk-12-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- <img src="images/lecture-10b/unnamed-chunk-13-1.png" width="90%" style="display: block; margin: auto;" /> --- Cluster solutions plotted in 2D projection pursuit dimension reduction, using LDA index with true class <img src="images/lecture-10b/unnamed-chunk-14-1.png" width="80%" style="display: block; margin: auto;" /> --- class: middle <center> .info-box[Note that, if clustering is conducted on the 2D projection, where clusters are well-separated, all linkage methods produce the three true clusters. If we could do dimension reduction to remove nuisance variables prior to clustering, all would be so much easier. But this is hard.] </center> --- ## Dendrogram in p-space <br> Examining the dendrogram in the high-dimensional data space can be done using the tour. You need to 1. Add points to the data to provide the places where the leaves join. These are the .monash-orange2[nodes] in the dendrogram. 2. Create a data set of .monash-orange2[edges], indicating which points should be connected. --- Dendrogram in `\(p\)`-dimensions (Wards and average linkage) .pull-left[ <iframe src="https://iml.numbat.space/lectures/cluster_ward.html" width="400" height="500" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe> ] .pull-right[ <iframe src="https://iml.numbat.space/lectures/cluster_average.html" width="400" height="500" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe> ] --- class: transition middle # Comparing cluster solutions --- class: split-two .column[.pad50px[ # Confusion table Ward's linkage solution in columns, and average linkage in rows. <br> <table class="table table-striped" style="font-size: 24px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> cl_av </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> <br> Solutions agree on 31 observations. Named differently: Wards labels group "3", and average labels it "2". ]] .column[ <br><br> Re-number the labels. Change average "2" to "3" <br> <table class="table table-striped" style="font-size: 24px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> cl_av </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 31 </td> </tr> </tbody> </table> <br> Now agreement can be viewed as numbers on main diagonal, as used in a labelled class confusion matrix. Methods agree on 19+3+31 out of 74 observations, 71.6%. ] --- class: transition middle # Summarising a clustering --- # Clustering summaries Once you have cluster labels, the data can be treated like data encountered in supervised classification - Report means and standard deviations, and sample size of clusters - Compute important variables, eg using random forests - Dimension reduction using LDA (or PCA), to examine clusters - Plot using colour for cluster label, using tour, parallel coordinates, scatterplot matrix --- class: split-two .column[.pad50px[ # Example Summary statistics by cluster, .monash-orange2[in original units]. <br> <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;color: white !important;background-color: #505050 !important;"> cl5 </th> <th style="text-align:left;color: white !important;background-color: #505050 !important;"> stat </th> <th style="text-align:right;color: white !important;background-color: #505050 !important;"> aede2 </th> <th style="text-align:right;color: white !important;background-color: #505050 !important;"> tars1 </th> <th style="text-align:right;color: white !important;background-color: #505050 !important;"> aede3 </th> <th style="text-align:right;color: white !important;background-color: #505050 !important;"> aede1 </th> <th style="text-align:right;color: white !important;background-color: #505050 !important;"> tars2 </th> <th style="text-align:right;color: white !important;background-color: #505050 !important;"> head </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: #F0F0F0 !important;"> 1 </td> <td style="text-align:left;background-color: #F0F0F0 !important;"> m </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 14.10 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 183.10 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 104.86 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 146.19 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 129.62 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 51.24 </td> </tr> <tr> <td style="text-align:left;background-color: #FFFFFF !important;"> 1 </td> <td style="text-align:left;background-color: #FFFFFF !important;"> s </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 0.89 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 12.14 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 6.18 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 5.63 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 7.16 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 2.23 </td> </tr> <tr> <td style="text-align:left;background-color: #FFFFFF !important;"> 1 </td> <td style="text-align:left;background-color: #FFFFFF !important;"> n </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 21.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 21.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 21.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 21.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 21.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 21.00 </td> </tr> <tr> <td style="text-align:left;background-color: #F0F0F0 !important;"> 2 </td> <td style="text-align:left;background-color: #F0F0F0 !important;"> m </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 10.09 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 138.23 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 106.59 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 138.27 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 125.09 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 51.59 </td> </tr> <tr> <td style="text-align:left;background-color: #FFFFFF !important;"> 2 </td> <td style="text-align:left;background-color: #FFFFFF !important;"> s </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 0.97 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 9.34 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 5.85 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 4.14 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 8.55 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 2.84 </td> </tr> <tr> <td style="text-align:left;background-color: #FFFFFF !important;"> 2 </td> <td style="text-align:left;background-color: #FFFFFF !important;"> n </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 22.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 22.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 22.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 22.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 22.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 22.00 </td> </tr> <tr> <td style="text-align:left;background-color: #F0F0F0 !important;"> 3 </td> <td style="text-align:left;background-color: #F0F0F0 !important;"> m </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 14.29 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 201.00 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 81.00 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 124.65 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 119.32 </td> <td style="text-align:right;background-color: #F0F0F0 !important;"> 48.87 </td> </tr> <tr> <td style="text-align:left;background-color: #FFFFFF !important;"> 3 </td> <td style="text-align:left;background-color: #FFFFFF !important;"> s </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 1.10 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 14.90 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 8.93 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 4.62 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 6.65 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 2.35 </td> </tr> <tr> <td style="text-align:left;background-color: #FFFFFF !important;"> 3 </td> <td style="text-align:left;background-color: #FFFFFF !important;"> n </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 31.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 31.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 31.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 31.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 31.00 </td> <td style="text-align:right;background-color: #FFFFFF !important;"> 31.00 </td> </tr> </tbody> </table> ]] .column[ <img src="images/lecture-10b/unnamed-chunk-21-1.png" width="80%" style="display: block; margin: auto;" /> ] --- background-size: cover class: title-slide background-image: url("images/bg-02.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Professor Di Cook* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC3250.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 10b <br> ]