Internet is truly full of free and fascinating datasets! I found this Yelp Dataset Challenge the other day that includes, among others, over 1 million reviews (most of which are recent) along with their respective 5-star ratings - excellent text mining material! Although to enter the competition (which ends on 12/31/14), you have to be a current student (which I’m not), but everyone is welcome to play around with the data.
This weekend, I participated in a Kaggle not-for-prize competition that uses data obtained from Reddit’s Random Acts of Pizza forum to analyze and predict the outcome of a request for pizza, and it was heaps of fun (I always wanted to say that)! Compared with other Kaggle competitions I had tried before, I found this one a bit easier because the dataset is not very large (~5,000 records) and is hence perfect for model experimenting, and, more importantly, the competition is based on a real research done by a couple Stanford researchers, which provides me with a lot of guidelines in how to proceed.