Internet is truly full of free and fascinating datasets! I found this Yelp Dataset Challenge the other day that includes, among others, over 1 million reviews (most of which are recent) along with their respective 5-star ratings - excellent text mining material! Although to enter the competition (which ends on 12/31/14), you have to be a current student (which I’m not), but everyone is welcome to play around with the data. Hence, during my recent break, I tried my hands on it and here is my first attempt :-)
In this first attempt, I tried to use the reviews’ text alone to predict the ratings by first computing the reviews’ sentiment scores in a supervised fashion and then using the estimated scores to predict the 5-class outcome. Currently the model achieved a mean absolute error of 0.66 and an accuracy score of 0.48. In the future, I’m planing to incorporate more relevant information to improve the prediction power.
The IPython notebooks are rendered by nbviewer here and the individual files can be accessed and viewed directly below: