This assignment will assess your understanding of these topics:
basic math and calculations for ML
ML concepts
visualisation
dimension reduction
๐ Instructions
This is an open book assignment, and you are allowed to use any resources that you find helpful. However, every resource used needs to be appropriately cited. You can add the citations to the end of the report, the particular style is not important. Lack of citation of resources will incur up to 50% reduction in final score.
You are encouraged to use Generative AI, so that you become accustomed to where it is helpful and where it is problematic on topics related to machine learning. You are expected to include the full script of your conversion at the end of your report.
This is an individual assignment. You are expected to complete this assignment individually, which means that the only tutors or instructors can be consulted. This means you are not permitted to discuss the questions or answers with other people, including students in this unit, or post questions to help sites. You can either the send a message to the class email address or send a private message to the teaching team on the discussion forum ED.
You need to follow the rules detailed at Maintain academic integrity information for students. If you are concerned about, you can report these to the chief examiner.
For any reason, but especially if there is suspicion of violation of academic integrity, the chief examiner can request that you attend an oral exam to explain any of your answers, or to answer related questions on the assignment. Your score will be adjusted based on answers provided during the oral exam.
The assignment needs to be turned in as (1) quarto (.qmd), and (2) as html, to Moodle. That is, two files need to be submitted, ideally as a zip of the two files into a single file. No other formats will be marked. It is expected that the knitting the qmd will produce the html file submitted. If the qmd file does not knit, then the score for assignment will be reduced by 25%.
R code should be hidden in the final report, unless it is specifically requested.
A skeleton assignment file zip is provided get you started and help understand what to turn in.
the variable y is the true class, pred1 are class predictions for model 1, and pred2 are class predictions for model 2. The columns bilby1, quokka1 are predictive probabilities for model 1, and similarly bilby2, quokka2 are predictive probabilities for model 2.
Compute the accuracy and balanced accuracy for each model.
The class predictions were made by using 0.5 and above as the value at which to predict the observation to be a bilby. Compute sensitivity and 1-specificity if (i) 0.3 and (ii) 0.4 were used instead of 0.5.
Make the ROC curves for both models, where bilby is considered the positive class, and explain which is the better of the two.
3. Visualisation (8pts)
In the mulgar package, there is a data set called c7. It has 6 variables. Your job is to explore the data structure and report as accurately as possible. Particularly comment on:
number and shape of clusters
dimensionality/linear relationships
outliers
You will want to use a scatterplot matrix, a tour, and a 2D view provided by UMAP, to view different aspects of the data.
4. Dimension reduction (15pts)
The cricketdata package allows you to download player statistics from matches across the globe. The statistics for Australian women have been collected and made available in the file auswt20.csv. Your task is to conduct a principal component analysis of this data, and explain what is learned. Particularly, your answer needs these components:
Summary of the PCA
Biplot of the first two PCs, and an explanation of the structure and variable contributions
Decision on appropriate number of PCs to use, as supported by proportion of total variance, and a scree plot.
Interpretations of the PCs
Discussion of the main patterns discovered by PCA, including notes about particular players.
If you are unfamiliar with womenโs cricket, the wikipedia page might be good to read.