On Friday I came across this post asking for people to try and do interesting things with the recently published 11 year data dump of BoingBoing blog posts.
For a long time I’ve been itching to try and use the ruby classifier gem for something cool and I figured that this was my chance. I decided to create a quick one off website that would let people enter a post title and see if it matched (lexically) the types of things that were previously featured on BoingBoing.
The downloaded JSON representation of the data is 138.4 MB uncompressed. I’d originally considered trying to use the body text of each post as additional fodder for the comparisons, but as I didn’t have a lot of time to work on this settled for just the post titles.
To extract the titles from the JSON file I used the following script (which could be easily adapted by anyone wanting to pull other datapoints).
Now, I had a bit of a dilema. I had been planning on using the classifier gem to do a Naive Bayesian classification of the data, but the Bayes classification relies upon training data for each outcome that you want. I had data that could be used to train the classifier on what a BoingBoing post should look like, but the data dump didn’t include the surely hundreds of thousands of submissions that they received that were completely worthless.
I needed a source of post titles that would closely match the types of submissions that BoingBoing receives and settled on scraping Reddit/New for them. They are a similar type of submission and as they mostly haven’t been voted on they’re objectively pretty low quality; just the type of thing to be rejected by discriminating BoingBoing editors.
Reddit has a well documented API that’s kinda sorta RESTlike, an example: querying http://www.reddit.com/new/.json returns JSON (as you would expect). However, querying http://www.reddit.com/new/.xml returns a
I actually ended up being rather thankful for the RSS format as sorting out the JSON data (which was nested somewhat oddly within itself IMO) was becoming a real chore. So XML + Nokogiri to the rescue and I came up with this script to get a second “bad” set of training data.
With both sets of data I could now run the classifier, training it what were BoingBoing style posts and what weren’t.
Of note is the use of the Madeleine gem which persists the training results (otherwise you would have to re-run the training every request to do a classification). I was somewhat shocked how well this worked: I ended up checking the serialized classifier into source control and just using it directly in production – no need to retrain!.
The above script was then modified into two methods with a “Classy”, non ActiveRecord backed model within the Rails application.
The final project is up at: http://www.boingable.com if you want to check it out.