Archive

Archive for the ‘Netflix Prize’ Category

First submission to Netflix Prize

December 3rd, 2008 deevis No comments

So, I got my algorithm working on the probe dataset and it is getting a respectable .7859.  I was very excited about this because only a .856 is needed on the actual qualifying dataset.

Turns out my score on the qualifying dataset is much worse: .9696.

Back to the drawing board.  I think I need to removed the Ratings being predicted in the probe dataset  from the training set.

Categories: Netflix Prize Tags:

Netflix Prize – Night 2

December 3rd, 2008 deevis No comments

So I’ve finally got all of the data loading directly into memory and fitting in 620MB ( down from an initial 7Gig using OO-principles and Hibernate and Spring and Dao’s and all that jazz… ).  The code I’m working with now is like C written in Java.  Lots of pointer arithmetic, blitting mutliple fields into 5 bytes, none of which is it’s own value.  I should clarify how this is working, actually.

A Rating has 3 pieces of information I’d like stored with it:

  1. movieId            15 bits needed
  2. customerId      22 bits needed
  3. rating                  3 bits needed

So, those are the 40 bits being munged into the 5 bytes.  Here’s how they fit in, precisely:

// movie bits        mmmmmmmm  mmmmmmm0  00000000  00000000  00000000
// customer bits     00000000  0000000c  cccccccc  cccccccc  ccccc000
// stars bits        00000000  00000000  00000000  00000000  00000sss

// munged              mmmmmmmm  mmmmmmmc  cccccccc  cccccccc  cccccsss

And these 5 bytes are all living in an external byte[] owned by a Movie.

When I left it was taking about 14 minutes to load the data into memory to work with.  I’m excited and I decide to start playing around with doing some predictions.

Pcm == Predicted Score of (c)ustomer for (m)ovie.

Ac == Average rating movies by (c)ustomer.

Am == Average of all ratings for a particular (m)ovie.

My first stab was to simply say that Pcm = (Ac + Am) / 2.  It was simple and fast and started spitting results out.  Well, after I waited the 14 minutes for the data to load into memory.  Gotta do something about this – soon.

So then I started playing around with trying to locate groups of customers who liked movies and groups who hated movies and trying to use these in part of the rating.  This was getting potentially complicated and I decided to keep each Movies Ratings sorted by customerId.  This made it very fast to determine (via binary search) whether a particular customer had seen a given Movie.  So, the prediction logic could now reasonably inquire about these sorts of things.  But everytime I change a little thing it’s another 14 minutes to see the tweak in action.  So, I shifted priorities and wrote a bit of code to Serialize the data to the file system in its new byte[] layout.

All the data now loads in 15 seconds ( sometimes as fast as 7! ).  So, now it’s time to start experimenting with the predictions.  And it’s time to call it a night.

My first RMSE this night, with the super-naive (Ac + Am)/2 algorithm seemed to be up in the 1.10 neighborhood.  The leaders have 0.86 and to win someone will need to get the score down to 0.85 ( they are close! ).

Categories: Netflix Prize Tags:

Digging into the Netflix Prize

December 3rd, 2008 deevis No comments

Over a year ago I remember taking a gander at the Netflix Prize and approaching it with my basic lightweight J2EE toolkit that I’ve come to know and love.  Basically this entails using Hibernate, Spring, C3P0 and MySQL to architect things.  I did all the basic stuff with Hibernate backed Pojo’s which knew how to create their logical DDL representations in MySQL and had Spring managing my services and daos and all was well.

Well not exactly.  You see, OO programming in Java turns out to be a bit of a memory hog.  All of the performance articles I see about Java vs C++ are concerned with raw processing speed and not so much with the memory footprint comparable solutions in each may incur.

So, the Netflix Prize basically makes a dataset with 17,700 movies, 480,000 customers and then 100,000,000 ratings of the movies by the customers.  So, I had my Rating object which held a reference to a Movie, a reference to a Customer and then had internals to represent the score associated with it.  In turn the Movie object had a Collection of Ratings and the Customer object did as well.  Long story short – it turned out that I would need about 7 Gigabytes to hold all of this in RAM.  Bleah.

I realized that I’d have to make changes and started down that road.  First step was to do away with having these objects be so interconnected.  I’d give only the Movie a list of Ratings and the Rating itself would house the movieId, customerId and score as primitives.   I knew space was at a premium and chose short for movieId, int for customerId, and byte for score.  This was 7 bytes per Rating.  This should only require 700MB for 100,000,000 of these 7 byte Ratings.  Not the case.  I managed to get a little bit more than 1/3 of the dataset loaded into 1 Gig.  So, all of these Objects are apparently still killing me.

I still need 3 Gig.  Bleah.

I want to say here that all of this is happening on the same night and I’m in some sort of programming nirvana.  I decide that a good next step will be to start programming more like a C programmer.  Instead of having the 3 primitives, I decided to force all the data into a byte[5] array.  Only 5 bytes per rating, but with the added cost of having to blit the actual values in and out of the underlying byte[].  I figured this would be a step in the right direction and get me a bit closer to the goal.

Wrong.  It actually got worse and needed 4 Gig of RAM.  Say what?!?  Yeah, it turns out that 100,000,000 byte[]’s carry some hefty overhead.  So, this isn’t working either.

Next idea – ok.  All these byte[]’s are bad, so let’s just put 5 byte instance variables on each Rating Object.  This had the intended result ( that I’d expected earlier ) and now a bit more than half the dataset will load in my 1 Gig.

Down to needing about 1.8 Gig from the initial 7 Gig.  But not good enough.  I should be able to squeeze this stuff into 500 MB or so with the 100 Million 5-byte Ratings.

So, now I decide that all of the Rating Objects are really, really bad.  So, instead of Movie’s having a Collection of Ratings, now they get a big byte[] and I use pointer arithmetic to move virtual Rating objects in and out of this underlying byte[].  Bingo!  Finally things work.

Now it all fits in 620MB, and it’s a lot  faster.  The entire dataset loads in about 14 minutes.

Time for bed on my first night on the Netflix Prize.  More to come.

Categories: Netflix Prize Tags: