Home > Netflix Prize > Netflix Prize – Night 2

Netflix Prize – Night 2

December 3rd, 2008 deevis

So I’ve finally got all of the data loading directly into memory and fitting in 620MB ( down from an initial 7Gig using OO-principles and Hibernate and Spring and Dao’s and all that jazz… ).  The code I’m working with now is like C written in Java.  Lots of pointer arithmetic, blitting mutliple fields into 5 bytes, none of which is it’s own value.  I should clarify how this is working, actually.

A Rating has 3 pieces of information I’d like stored with it:

  1. movieId            15 bits needed
  2. customerId      22 bits needed
  3. rating                  3 bits needed

So, those are the 40 bits being munged into the 5 bytes.  Here’s how they fit in, precisely:

// movie bits        mmmmmmmm  mmmmmmm0  00000000  00000000  00000000
// customer bits     00000000  0000000c  cccccccc  cccccccc  ccccc000
// stars bits        00000000  00000000  00000000  00000000  00000sss

// munged              mmmmmmmm  mmmmmmmc  cccccccc  cccccccc  cccccsss

And these 5 bytes are all living in an external byte[] owned by a Movie.

When I left it was taking about 14 minutes to load the data into memory to work with.  I’m excited and I decide to start playing around with doing some predictions.

Pcm == Predicted Score of (c)ustomer for (m)ovie.

Ac == Average rating movies by (c)ustomer.

Am == Average of all ratings for a particular (m)ovie.

My first stab was to simply say that Pcm = (Ac + Am) / 2.  It was simple and fast and started spitting results out.  Well, after I waited the 14 minutes for the data to load into memory.  Gotta do something about this – soon.

So then I started playing around with trying to locate groups of customers who liked movies and groups who hated movies and trying to use these in part of the rating.  This was getting potentially complicated and I decided to keep each Movies Ratings sorted by customerId.  This made it very fast to determine (via binary search) whether a particular customer had seen a given Movie.  So, the prediction logic could now reasonably inquire about these sorts of things.  But everytime I change a little thing it’s another 14 minutes to see the tweak in action.  So, I shifted priorities and wrote a bit of code to Serialize the data to the file system in its new byte[] layout.

All the data now loads in 15 seconds ( sometimes as fast as 7! ).  So, now it’s time to start experimenting with the predictions.  And it’s time to call it a night.

My first RMSE this night, with the super-naive (Ac + Am)/2 algorithm seemed to be up in the 1.10 neighborhood.  The leaders have 0.86 and to win someone will need to get the score down to 0.85 ( they are close! ).

Share and Enjoy:
  • Print this article!
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • LinkedIn
  • StumbleUpon
  • Technorati
  • TwitThis
Categories: Netflix Prize Tags:
Comments are closed.