Background Earlier this year, I decided to learn French, something I’ve been thinking about for a long time. Foreign language learning has always been something magical to me: I had a great time learning English when I was at school (my mother tongue is Mandarin Chinese), so much that I would devote all my time to it and ignore my other subjects (not recommended). Hence, when I signed up for a beginner’s class in my local Alliance Française and started taking classes regularly, it felt homecoming to me.
Internet is truly full of free and fascinating datasets! I found this Yelp Dataset Challenge the other day that includes, among others, over 1 million reviews (most of which are recent) along with their respective 5-star ratings - excellent text mining material! Although to enter the competition (which ends on 12/31/14), you have to be a current student (which I’m not), but everyone is welcome to play around with the data.
Data transformation Resampling techniques Regression models Smoothing Neural networks Support vector machines K-nearest neighbors Trees Random forests Gradient boosting trees Cubist Measuring performace in classification models Linear classificatoin models Latent Dirichlet allocation These are the notes I took while reading An Introduction to Statistical Learning and Applied Predictive Modeling. Some of the notes also came from other sources, but the majority of them are from these two books.
Last night I tried my hands on a Quora challenge that classifies user-submitted answers into ‘good’ and ‘bad.’ All the information is anonymized, including the variable names, but you can tell by looking at their values what some of them may represent. For example, some appear to be count data or some summary statistics based on them, and, given that many of the values are 0 and heavily right-skewed, they seem to be some measure of the writers’ reputations, the number of upvotes an answer received, or the follow-up comments.
Lately, I’ve become very interested in text mining and topic modeling, and have played around with some popular algorithms like LDA. However, so far my projects have all been centered around what I can learn from a giant chunk of texts and usually stopped after I extracted some revealing and, if I’m lucky, thought-provoking topics from them. In other words, what I’ve been doing so far is all inference but no predictions.
If you are not from China or living there, you are probably not familiar with the term 八零后, or post-80s, but if you are, like me, I think you’ll agree that this is probably one of the most widely used and abused terms in modern China. Quite literally, it refers to Chinese people who were born in the 1980s (me included) and the reason it gained so much attention and exposure as compared to, say, 九零后 (post-90s) or 零零后 (post-00s), I think, stems from the fact that our generation has simply seen and been through way too many things that have never been seen or experienced by prior generations and are simply taken as norms for later ones.
On Tuesday last week, I attended a data visualization meetup organized by Data Science LA and the topic was about the most recent Eyeo Festival. Of all the talks that Amelia shared with us, what impressed me and inspired me the most was Nicholas Felton’s personal data projects. In case you are not familiar with him, every year he publishes an annual report that documents his personal data projects / experiments conducted throughout the year.
This weekend, I participated in a Kaggle not-for-prize competition that uses data obtained from Reddit’s Random Acts of Pizza forum to analyze and predict the outcome of a request for pizza, and it was heaps of fun (I always wanted to say that)! Compared with other Kaggle competitions I had tried before, I found this one a bit easier because the dataset is not very large (~5,000 records) and is hence perfect for model experimenting, and, more importantly, the competition is based on a real research done by a couple Stanford researchers, which provides me with a lot of guidelines in how to proceed.
Ever felt your twitter newsfeed has too much going on that you don’t have time to read them all, let alone digest? I certainly do, even though I only follow like 20 people. Whenever I open the app, I was “bombarded” by all the new tweets and, even after scrolling through all of them (as I feel obligated to), I don’t feel I have actually taken any new information in. How nice would it be if someone handpicked and highlighted all the useful information for us?
Leaflet is a popular javascript library for making interactive maps. Don’t know how to code in js? No problem, thanks to Ramnath Vaidyanathan, you can now use rCharts to do it in R! Now that we have R Shiny, it just seems a natural thing to combine the two together to make Shiny apps for interactive maps. If that doesn’t motivate you, take a look at these cool examples and roll up your sleeves and make one yourself (example1, example2)!