Background Earlier this year, I decided to learn French, something I’ve been thinking about for a long time. Foreign language learning has always been something magical to me: I had a great time learning English when I was at school (my mother tongue is Mandarin Chinese), so much that I would devote all my time to it and ignore my other subjects (not recommended). Hence, when I signed up for a beginner’s class in my local Alliance Française and started taking classes regularly, it felt homecoming to me.
Last night I tried my hands on a Quora challenge that classifies user-submitted answers into ‘good’ and ‘bad.’ All the information is anonymized, including the variable names, but you can tell by looking at their values what some of them may represent. For example, some appear to be count data or some summary statistics based on them, and, given that many of the values are 0 and heavily right-skewed, they seem to be some measure of the writers’ reputations, the number of upvotes an answer received, or the follow-up comments.
Lately, I’ve become very interested in text mining and topic modeling, and have played around with some popular algorithms like LDA. However, so far my projects have all been centered around what I can learn from a giant chunk of texts and usually stopped after I extracted some revealing and, if I’m lucky, thought-provoking topics from them. In other words, what I’ve been doing so far is all inference but no predictions.
If you are not from China or living there, you are probably not familiar with the term 八零后, or post-80s, but if you are, like me, I think you’ll agree that this is probably one of the most widely used and abused terms in modern China. Quite literally, it refers to Chinese people who were born in the 1980s (me included) and the reason it gained so much attention and exposure as compared to, say, 九零后 (post-90s) or 零零后 (post-00s), I think, stems from the fact that our generation has simply seen and been through way too many things that have never been seen or experienced by prior generations and are simply taken as norms for later ones.
On Tuesday last week, I attended a data visualization meetup organized by Data Science LA and the topic was about the most recent Eyeo Festival. Of all the talks that Amelia shared with us, what impressed me and inspired me the most was Nicholas Felton’s personal data projects. In case you are not familiar with him, every year he publishes an annual report that documents his personal data projects / experiments conducted throughout the year.
This weekend, I participated in a Kaggle not-for-prize competition that uses data obtained from Reddit’s Random Acts of Pizza forum to analyze and predict the outcome of a request for pizza, and it was heaps of fun (I always wanted to say that)! Compared with other Kaggle competitions I had tried before, I found this one a bit easier because the dataset is not very large (~5,000 records) and is hence perfect for model experimenting, and, more importantly, the competition is based on a real research done by a couple Stanford researchers, which provides me with a lot of guidelines in how to proceed.
Ever felt your twitter newsfeed has too much going on that you don’t have time to read them all, let alone digest? I certainly do, even though I only follow like 20 people. Whenever I open the app, I was “bombarded” by all the new tweets and, even after scrolling through all of them (as I feel obligated to), I don’t feel I have actually taken any new information in. How nice would it be if someone handpicked and highlighted all the useful information for us?
Leaflet is a popular javascript library for making interactive maps. Don’t know how to code in js? No problem, thanks to Ramnath Vaidyanathan, you can now use rCharts to do it in R! Now that we have R Shiny, it just seems a natural thing to combine the two together to make Shiny apps for interactive maps. If that doesn’t motivate you, take a look at these cool examples and roll up your sleeves and make one yourself (example1, example2)!
This week I started taking this Coursera class called Introduction to Data Science taught by Bill Howe from University of Washington. Although there has only been one lesson so far, my experience has been quite positive particularly due to the interesting programing assignment, which is to use twitter’s live stream data to analyze tweet sentiment. If you are interested and want to try yourself, you can read the very helpful instruction here and clone the git here (I believe you can access it without signing up for the class, but the course is free anyway).
After the numerous times of finding out about a concert too late and ending up either paying a premium or not being able to go, I finally decided to do something about it, and this is what I came up with.
https://runzemc.shinyapps.io/pitchfork/
This R Shiny app I made pulls data from pitchfork automatically every time it’s open and shows the upcoming shows per city and per artist (indie artist, to be precise).
R Shiny is an R package that is designed to easily create and deploy pretty web apps all in the nifty RStudio. Right now, it may not be able to make sophisticated or aesthetically pleasing web apps like d3.js, but, by leveraging R’s powerhouse analytical capability, I believe it has great potentials. One possible application I can think of is education. Take this k-means app for instance, I wish I had a chance to play with this interactive app when learning about the algorithm myself.
This weekend I decided to learn more about twitter and its handy API. My subject of the analysis is Artsy, a fine-art website that provides a pandora-like service. The subjects I was curious to find out are where their followers are from, what their twitter activities are like, what other interests they have, and, specifically, what kind of stereotypes clusters they fall into because, you know, it’s important and I didn’t have anything better to do.