This project provides the opportunity for practising your supervised learning skills. The competition format enables to objectively compare how well your model performs against other models.
This is also a place where you can exercise your creativity. You will also want to explore the data, how the distribution varies between classes, what drawings are hard to classify, and why. Its an opportunity to practice your data exploration skills, and effectively present information about the data.
Because it is a team project this is a chance for you to work effectively with others to tackle a practical problem.
Pictionary is a great game to play with friends and family. Each player takes turns in sketching some object or action, and other players try to guess what is being drawn. The task here is a bit easier, you are provided with hand-drawn sketches of 6 items, “banana”, “boomerang”, “cactus”, “crab”, “flip flops”, “kangaroo” and you are expected to build a model to predict which object it is.
The objective of this contest is as follows:
Predict what object has been sketched. Variable to be predicted: “word” as “banana”, “boomerang”, “cactus”, “crab”, “flip flops” or “kangaroo”.
sketches_train.rdacontains the full training set, approximately 50% of observations
sketches_test.rdacontains the test set that you need to predict, has all of the same variables as the training set except for the
sample_predictions.csvshows the format of the file that you need to upload to kaggle with your predictions. The outcome column provided in this data are random outcomes.
This sample code will build a random forest model, predict the test data, and write out the predictions in the format required for submission to kaggle.
library(tidyverse) library(randomForest) load("data/sketches_train.rda") load("data/sketches_test.rda") sketches_rf <- randomForest(word~., data=sketches[,-786]) sketches_test$word <- predict(sketches_rf, newdata=sketches_test) predictions <- sketches_test %>% select(id, word) %>% rename(Id=id, Category=word) write_csv(predictions, path="predictions_01_05_2020.csv")
The kaggle criteria
CategorizationAccuracy is used to assess your prediction. It is proportion of correction predictions. Using the code above will give a value of about 0.89. This is your benchmark to improve upon.
The data analysis report can be a maximum of 5 pages, and must abide by the section structure described below.
The introduction will describe the data set and motivate the problem. It should be brief.
This section describes the models and methods you have used, including a justification of your choices. You should also present your model fitting, diagnostics, etc.
This includes for example graphs and tables, as well as a discussion of the results.
This includes summary of the findings.
You should clearly explain what you have done, using figures to supplement your explanation. Your figures must be of proper size with labeled, readable axes. In general, you should take pride in making your report readable and clear. You will be graded both on statistical content and quality of presentation.
Finally, each team will make a presentation of their work for the class, record using zoom, and upload to moodle. The video should be 5 minutes or less. All team members must participate by speaking in the presentation. Score will be given by other members of the class, that is, the audience for your talk will be your class mates. You must evaluate all of the presentations, or lose the peer review points.
Do not wait until the last minute. Late submissions will not be accepted.