Purpose

This project provides the opportunity for practising your supervised learning skills. The competition format enables to objectively compare how well your model performs against other models.

This is also a place where you can exercise your creativity. You will also want to explore the data, how the distribution varies between classes, what drawings are hard to classify, and why. Its an opportunity to practice your data exploration skills, and effectively present information about the data.

Because it is a team project this is a chance for you to work effectively with others to tackle a practical problem.

Task

Pictionary is a great game to play with friends and family. Each player takes turns in sketching some object or action, and other players try to guess what is being drawn. The task here is a bit easier, you are provided with hand-drawn sketches of 6 items, “banana”, “boomerang”, “cactus”, “crab”, “flip flops”, “kangaroo” and you are expected to build a model to predict which object it is.

Objective

The objective of this contest is as follows:

Predict what object has been sketched. Variable to be predicted: “word” as “banana”, “boomerang”, “cactus”, “crab”, “flip flops” or “kangaroo”.

Data

  • See “Data” tab to download train and test files
    • sketches_train.rda contains the full training set, approximately 50% of observations
    • sketches_test.rda contains the test set that you need to predict, has all of the same variables as the training set except for the word variable
    • sample_predictions.csv shows the format of the file that you need to upload to kaggle with your predictions. The outcome column provided in this data are random outcomes.

Making predictions

This sample code will build a random forest model, predict the test data, and write out the predictions in the format required for submission to kaggle.

library(tidyverse)
library(randomForest)

load("data/sketches_train.rda")
load("data/sketches_test.rda")

sketches_rf <- randomForest(word~., data=sketches[,-786])
sketches_test$word <- predict(sketches_rf, newdata=sketches_test)
predictions <- sketches_test %>%
  select(id, word) %>%
  rename(Id=id, Category=word)

write_csv(predictions, path="predictions_01_05_2020.csv")

Evaluation Criteria

The kaggle criteria CategorizationAccuracy is used to assess your prediction. It is proportion of correction predictions. Using the code above will give a value of about 0.89. This is your benchmark to improve upon.

Tasks

  1. Your first task is to create a Kaggle account (using your Monash email address). Your username needs to be visible, and recognisable, and use the first part of your monash email + your team name. This is necessary to match your submissions to the class grade sheet. Without this, your submissions will not count for your project score.
  2. Upload your first predictions. Make it a really bad one. This sets your baseline score as really low, you have nowhere to go but up!
  3. Submissions need to be made as an individual. Your final team score will be the best score of all team members.
  4. You final work will be with a team of 3-4 class members.
  5. Do some basic exploration of the dataset.
  6. Build your first model. Predict the test set, and upload your predictions to Kaggle.
  7. Try, and try again to improve your model. You can submit one prediction per day.

Project report and presentation

The data analysis report can be a maximum of 5 pages, and must abide by the section structure described below.

The introduction will describe the data set and motivate the problem. It should be brief.

This section describes the models and methods you have used, including a justification of your choices. You should also present your model fitting, diagnostics, etc.

This includes for example graphs and tables, as well as a discussion of the results.

This includes summary of the findings.

You should clearly explain what you have done, using figures to supplement your explanation. Your figures must be of proper size with labeled, readable axes. In general, you should take pride in making your report readable and clear. You will be graded both on statistical content and quality of presentation.

Finally, each team will make a presentation of their work for the class, record using zoom, and upload to moodle. The video should be 5 minutes or less. All team members must participate by speaking in the presentation. Score will be given by other members of the class, that is, the audience for your talk will be your class mates. You must evaluate all of the presentations, or lose the peer review points.

Grading

Deadlines

Do not wait until the last minute. Late submissions will not be accepted.