Over a year ago I remember taking a gander at the Netflix Prize and approaching it with my basic lightweight J2EE toolkit that I’ve come to know and love. Basically this entails using Hibernate, Spring, C3P0 and MySQL to architect things. I did all the basic stuff with Hibernate backed Pojo’s which knew how to create their logical DDL representations in MySQL and had Spring managing my services and daos and all was well.
Well not exactly. You see, OO programming in Java turns out to be a bit of a memory hog. All of the performance articles I see about Java vs C++ are concerned with raw processing speed and not so much with the memory footprint comparable solutions in each may incur.
So, the Netflix Prize basically makes a dataset with 17,700 movies, 480,000 customers and then 100,000,000 ratings of the movies by the customers. So, I had my Rating object which held a reference to a Movie, a reference to a Customer and then had internals to represent the score associated with it. In turn the Movie object had a Collection of Ratings and the Customer object did as well. Long story short – it turned out that I would need about 7 Gigabytes to hold all of this in RAM. Bleah.
I realized that I’d have to make changes and started down that road. First step was to do away with having these objects be so interconnected. I’d give only the Movie a list of Ratings and the Rating itself would house the movieId, customerId and score as primitives. I knew space was at a premium and chose short for movieId, int for customerId, and byte for score. This was 7 bytes per Rating. This should only require 700MB for 100,000,000 of these 7 byte Ratings. Not the case. I managed to get a little bit more than 1/3 of the dataset loaded into 1 Gig. So, all of these Objects are apparently still killing me.
I still need 3 Gig. Bleah.
I want to say here that all of this is happening on the same night and I’m in some sort of programming nirvana. I decide that a good next step will be to start programming more like a C programmer. Instead of having the 3 primitives, I decided to force all the data into a byte[5] array. Only 5 bytes per rating, but with the added cost of having to blit the actual values in and out of the underlying byte[]. I figured this would be a step in the right direction and get me a bit closer to the goal.
Wrong. It actually got worse and needed 4 Gig of RAM. Say what?!? Yeah, it turns out that 100,000,000 byte[]’s carry some hefty overhead. So, this isn’t working either.
Next idea – ok. All these byte[]’s are bad, so let’s just put 5 byte instance variables on each Rating Object. This had the intended result ( that I’d expected earlier ) and now a bit more than half the dataset will load in my 1 Gig.
Down to needing about 1.8 Gig from the initial 7 Gig. But not good enough. I should be able to squeeze this stuff into 500 MB or so with the 100 Million 5-byte Ratings.
So, now I decide that all of the Rating Objects are really, really bad. So, instead of Movie’s having a Collection of Ratings, now they get a big byte[] and I use pointer arithmetic to move virtual Rating objects in and out of this underlying byte[]. Bingo! Finally things work.
Now it all fits in 620MB, and it’s a lot faster. The entire dataset loads in about 14 minutes.
Time for bed on my first night on the Netflix Prize. More to come.