Data transformation Resampling techniques Regression models Smoothing Neural networks Support vector machines K-nearest neighbors Trees Random forests Gradient boosting trees Cubist Measuring performace in classification models Linear classificatoin models Latent Dirichlet allocation These are the notes I took while reading An Introduction to Statistical Learning and Applied Predictive Modeling. Some of the notes also came from other sources, but the majority of them are from these two books.
Last night I tried my hands on a Quora challenge that classifies user-submitted answers into ‘good’ and ‘bad.’ All the information is anonymized, including the variable names, but you can tell by looking at their values what some of them may represent. For example, some appear to be count data or some summary statistics based on them, and, given that many of the values are 0 and heavily right-skewed, they seem to be some measure of the writers’ reputations, the number of upvotes an answer received, or the follow-up comments.
This weekend, I participated in a Kaggle not-for-prize competition that uses data obtained from Reddit’s Random Acts of Pizza forum to analyze and predict the outcome of a request for pizza, and it was heaps of fun (I always wanted to say that)! Compared with other Kaggle competitions I had tried before, I found this one a bit easier because the dataset is not very large (~5,000 records) and is hence perfect for model experimenting, and, more importantly, the competition is based on a real research done by a couple Stanford researchers, which provides me with a lot of guidelines in how to proceed.