Prototype, JQuery and JQueryUI

February 18th, 2009 deevis No comments

We’ve switched over to using JQuery from Prototype at my current employer.  We’d previously used Prototype-1.5 for a long time and had a lot of dependencies to it.  When we first started using JQuery it was version 1.1.3.1.  There was no JQueryUI at that time.

So, we currently use JQuery-1.2.6 and JQueryUI-1.6rc2 and I really like them a lot.  Oh yeah, we still have Prototype in the mix and it’s version 1.6 ( which is lots better than 1.5 ).

I recently, however, had issues with attempting to port new code back to a previous install.  The new code was using JQueryUI, and the old code was on Prototype 1.5 and JQuery 1.1.3.1.  Well – it took me a while to figure this out ( and I mentioned it already above ), but there is NO JQueryUI version compatible with JQuery-1.1.3.1.  It wasn’t created until the 1.2.x branches of JQuery.

So – here’s the compatibility matrix for Prototype, JQuery and JQueryUI for any of you who care:

Prototype JQuery JQueryUI
1.5 1.1.3.1 —-
1.6 1.2.6 1.5.3
1.6 1.3+ 1.6rc6
1.6.0.2 1.3.2 1.7

Ciao.

First submission to Netflix Prize

December 3rd, 2008 deevis No comments

So, I got my algorithm working on the probe dataset and it is getting a respectable .7859.  I was very excited about this because only a .856 is needed on the actual qualifying dataset.

Turns out my score on the qualifying dataset is much worse: .9696.

Back to the drawing board.  I think I need to removed the Ratings being predicted in the probe dataset  from the training set.

Categories: Netflix Prize Tags:

Netflix Prize – Night 2

December 3rd, 2008 deevis No comments

So I’ve finally got all of the data loading directly into memory and fitting in 620MB ( down from an initial 7Gig using OO-principles and Hibernate and Spring and Dao’s and all that jazz… ).  The code I’m working with now is like C written in Java.  Lots of pointer arithmetic, blitting mutliple fields into 5 bytes, none of which is it’s own value.  I should clarify how this is working, actually.

A Rating has 3 pieces of information I’d like stored with it:

  1. movieId            15 bits needed
  2. customerId      22 bits needed
  3. rating                  3 bits needed

So, those are the 40 bits being munged into the 5 bytes.  Here’s how they fit in, precisely:

// movie bits        mmmmmmmm  mmmmmmm0  00000000  00000000  00000000
// customer bits     00000000  0000000c  cccccccc  cccccccc  ccccc000
// stars bits        00000000  00000000  00000000  00000000  00000sss

// munged              mmmmmmmm  mmmmmmmc  cccccccc  cccccccc  cccccsss

And these 5 bytes are all living in an external byte[] owned by a Movie.

When I left it was taking about 14 minutes to load the data into memory to work with.  I’m excited and I decide to start playing around with doing some predictions.

Pcm == Predicted Score of (c)ustomer for (m)ovie.

Ac == Average rating movies by (c)ustomer.

Am == Average of all ratings for a particular (m)ovie.

My first stab was to simply say that Pcm = (Ac + Am) / 2.  It was simple and fast and started spitting results out.  Well, after I waited the 14 minutes for the data to load into memory.  Gotta do something about this – soon.

So then I started playing around with trying to locate groups of customers who liked movies and groups who hated movies and trying to use these in part of the rating.  This was getting potentially complicated and I decided to keep each Movies Ratings sorted by customerId.  This made it very fast to determine (via binary search) whether a particular customer had seen a given Movie.  So, the prediction logic could now reasonably inquire about these sorts of things.  But everytime I change a little thing it’s another 14 minutes to see the tweak in action.  So, I shifted priorities and wrote a bit of code to Serialize the data to the file system in its new byte[] layout.

All the data now loads in 15 seconds ( sometimes as fast as 7! ).  So, now it’s time to start experimenting with the predictions.  And it’s time to call it a night.

My first RMSE this night, with the super-naive (Ac + Am)/2 algorithm seemed to be up in the 1.10 neighborhood.  The leaders have 0.86 and to win someone will need to get the score down to 0.85 ( they are close! ).

Categories: Netflix Prize Tags:

Digging into the Netflix Prize

December 3rd, 2008 deevis No comments

Over a year ago I remember taking a gander at the Netflix Prize and approaching it with my basic lightweight J2EE toolkit that I’ve come to know and love.  Basically this entails using Hibernate, Spring, C3P0 and MySQL to architect things.  I did all the basic stuff with Hibernate backed Pojo’s which knew how to create their logical DDL representations in MySQL and had Spring managing my services and daos and all was well.

Well not exactly.  You see, OO programming in Java turns out to be a bit of a memory hog.  All of the performance articles I see about Java vs C++ are concerned with raw processing speed and not so much with the memory footprint comparable solutions in each may incur.

So, the Netflix Prize basically makes a dataset with 17,700 movies, 480,000 customers and then 100,000,000 ratings of the movies by the customers.  So, I had my Rating object which held a reference to a Movie, a reference to a Customer and then had internals to represent the score associated with it.  In turn the Movie object had a Collection of Ratings and the Customer object did as well.  Long story short – it turned out that I would need about 7 Gigabytes to hold all of this in RAM.  Bleah.

I realized that I’d have to make changes and started down that road.  First step was to do away with having these objects be so interconnected.  I’d give only the Movie a list of Ratings and the Rating itself would house the movieId, customerId and score as primitives.   I knew space was at a premium and chose short for movieId, int for customerId, and byte for score.  This was 7 bytes per Rating.  This should only require 700MB for 100,000,000 of these 7 byte Ratings.  Not the case.  I managed to get a little bit more than 1/3 of the dataset loaded into 1 Gig.  So, all of these Objects are apparently still killing me.

I still need 3 Gig.  Bleah.

I want to say here that all of this is happening on the same night and I’m in some sort of programming nirvana.  I decide that a good next step will be to start programming more like a C programmer.  Instead of having the 3 primitives, I decided to force all the data into a byte[5] array.  Only 5 bytes per rating, but with the added cost of having to blit the actual values in and out of the underlying byte[].  I figured this would be a step in the right direction and get me a bit closer to the goal.

Wrong.  It actually got worse and needed 4 Gig of RAM.  Say what?!?  Yeah, it turns out that 100,000,000 byte[]’s carry some hefty overhead.  So, this isn’t working either.

Next idea – ok.  All these byte[]’s are bad, so let’s just put 5 byte instance variables on each Rating Object.  This had the intended result ( that I’d expected earlier ) and now a bit more than half the dataset will load in my 1 Gig.

Down to needing about 1.8 Gig from the initial 7 Gig.  But not good enough.  I should be able to squeeze this stuff into 500 MB or so with the 100 Million 5-byte Ratings.

So, now I decide that all of the Rating Objects are really, really bad.  So, instead of Movie’s having a Collection of Ratings, now they get a big byte[] and I use pointer arithmetic to move virtual Rating objects in and out of this underlying byte[].  Bingo!  Finally things work.

Now it all fits in 620MB, and it’s a lot  faster.  The entire dataset loads in about 14 minutes.

Time for bed on my first night on the Netflix Prize.  More to come.

Categories: Netflix Prize Tags:

Does My Bus Look Big in This?

June 21st, 2008 deevis No comments

I just watched Martin Fowler and Jim Webber’s collective presentation on SOA and ESB   I entered into the presentation hoping to learn about how they’d recommend I use a Mule or ServiceMix.  Well, they not only didn’t tell me which one was better or how I should configure it or anything even close to that.  I don’t want to spoil it all for you, but I’ll say that they did tell me that SOA actually stands for “Same Old Atrocity” and that ESB stands for “Erroneous Spaghetti Box”.  Oh yeah, I almost forgot – Mule and ServiceMix, while not named explicitly, may have man-boobs.

Here are my notes from the presentation:

Does My Bus Look Big in This?
———————————————-
Talk given by Jim Webber and Martin Fowler

Most integration tools are bloated lard asses because they’ve been on a rich
diet of BPM, Transformations, Security, Adapters, Rules Engine and GUI tools.

Integration software is a bridge built between two existing applications (silos)

SOA to the rescue!  ( not! )  It’s the same old stuff.  Same old atrocity -
if a single service ever needs to change then you’re screwed.

SOA != Service Oriented Architecture
SOA == Same Old Atrocity

ESB – this should fix things.  Just plumb your databases and services in
and you are suddenly Scalable and Governable.

ESB != Enterprise Service Bus
ESB == Erroneous Spaghetti Box

ESB is an Architectural Fantasy – at least as a magic bullet.

Mainstream SOA Today : it doesn’t run and it’s full of fat.

We need to think carefully, plan ahead, design interfaces wisely.

Agile Architecture:
1) Accept change as an inevitable part of the Software Process
2) The most important part of the Software Process is people – we need
to make allow the people to be effective.

Continuous Integration
Automated Testing
Refactoring – disciplined way to design and evolve the software
Behaviour Driven Development

Grow the system, incrementally.  With a solid core that functions to begin
with and then add more layers of functionality deliberately over time.

Frameworks have gotten agile, too.  Spring, Hibernate, Rails, etc…

Http and the web – broke some of the rules.  All the links *DON’T HAVE TO WORK* – 404 errors are a good compromise

The web is great because the underlying network is so simple and dumb, but it
facilitates almost any higher level usage layered on top of.

The web allows us to do things that we never considered in advance.

Great design isn’t considering all the cases and designing for them.  Great design is
when you can handle situations with your architecture that you’ve never thought of
or considered.

Guerilla SOA – just do it in the trenches but don’t bet the entire project on it.  Then
with each small victory, re-prioritise and keep delivering.

Web-based Services: The Web *is* middleware.  HTTP is a big, coordination framework.

The web is slim, trim middleware – not fat bloated like SOA and ESB.  The web is ubiquitous!

The web is incremental and, therefore, low risk.

So, we don’t need what middleware vendors are selling us.

SQUID?!?  What is it?

Squid is your ESB – a Big, Big Proxy Server.

Proprietary Middleware  vs  Web-centric techniques
——————————————————
Big, up-front design  vs  Evolutionary
Lengthy death-marches  vs  Constant delivery
Expensive  vs  Inexpensive
Risky  vs  Incremental
Enterprise Scale  vs  Internet Scale
Specialised  vs  Commoditised
Integration separate activity  vs  Integration by-product of delivering business value
Not very sensible  vs  Quite sensible

Take the same ideas we’ve come up with in application developement when attacking the larger
Enterprise issues.

Categories: Architecture Tags:

Spring Dynamic Modules for OSGi

June 17th, 2008 deevis 1 comment

OSGi – http://www.infoq.com/presentations/colyer-server-side-osgi

This is a very informative talk given by Adrian Colyer – CTO of Interface 21 – on the new OSGi support within Spring 2.5

——————————————————————
OSGi == The Dynamic Module System for Java
In OSGi jars/modules are called “Bundles”
Modules can declare what it provides and what it requires.
Modules can be installed, started, stopped, uninstalled, and updated *at runtime*
Bundles can publish services dynamically.
Publish/Find/Bind – Service Registry allows this

A Brief History of OSGi
- started in 1999, focus on embedded Java and networked devices
- 2003: extended support to mobile devices
- 2004: significant open source community adoption ( Eclipse plugins )
- 2006: OSGi moving into server-side Java

Current implementations:
- Eclipse Equinox
- Apache Felix
- Makewave Knopflerfish
- Prosyst mBedded Server Professional Edition

Paremus?!? Enterprise company using OSGi – who are they?

How does OSGi help me?
- Strong modularity
By default a bundle is a black box and classes are isolated from other bundles.
A bundle can export one or more packages.
Only exported packages are visible outside of the exporting bundle.
Modularity gives : Independent development, easier to maintain, faster development cycles.

- Versioning support
Versioning allows two dependent Libraries A and B to each use a different version of
Library C v1 and v2 with v1 and v2 being incompatible. This is fine so long
as no classes from C are leaked back to your application through A and B.

** What about Singletons in Library C?

- Operational control (life cycle)
With OSGi console and/or JMX you can see all modules and their status
Get information on wiriing
Install/Update/Stop/Uninstall new bundles
Activate bundles ( publishing services in the process )
Deactivate bundles ( unregistering services in the process )
All of the above happen without stopping or restarting the server.

Type-space contributions
Object-space contributions

OSGi WebApps can be written and deployed either as:
1) OSGi bundles in an OSGi-compliant container – but not many of these are open to this, yet.
2) As a more standard WebApp, but using the Embedded OSGi container ( and the ServletBridge )

Spring 2.5 is OSGi ready.

Class.forName becomes troublesome because the Class may not be exported out of its bundle.
Class.forName does it’s own caching of Classes, so the correct version of a class may not be found

OSGi doesn’t have a ContextClassLoader because:
1) it has no notion of “context”
2) is has no notion of “application”
** Solutions
1) Eclipse Equinox’s ContextFinder – this looks up the call stack for the most recent bundle
owned class and uses that bundle’s ClassLoader “which works in many, many situations”.
2) Spring Dynamic Modules: CCL Management

Web Applications
- OSGi HttpService: allows for programmatic configuration. ability to register Servlets/resources
under aliases.
- Equinox Http Registry Bundle: declarative configuration.

What does Spring Dynamic Modules for OSGiprovide? www.springframework.org/osgi
- Bundle needs: instantiating, configuring, assembling, decorating
- Bundle blueprints
- expose bundle objects as services
- wire service references between bundles
- consistent/easy way to manage dynamics
– services may come and go
– broadcast operations
- test environment
- ContextClassLoader management
- Configuration Admin service integration
- Bundle lifecycle management

OsgiBundleXmlApplicationContext
- uses bundle context and classloader to load resources
- implements Spring’s resource abstraction for OSGi
– relative resource paths resolved to bundle entries
– “bundle:” prefix for explicit specification

Instead of a ContextLoaderListener there is org.sfw.osgi.extender bundle
- acts like “ContextLoaderListener”
- automatically creates Spring application context inside a bundle
when a bundle is started. No code or dependence on Spring APIs required!

Creating a bundle – from jar file to Spring bundle…
- start with a normal old jar: mymodule.jar
- add needed headers to META-INF/MANIFEST.MF
- Bundle-SymbolicName: org.xyz.myapp.mymodule
- Bundle-Version: 1.0
- Bundle-ManifestVersion: 2
- place configuration files in META-INF/spring

Exporting a Service:

<osgi:service id=”myServiceOsgi” ref=”myService”
interface=”org.sfw.osgi.samples.ss.MyService”/>

Importing a Service

<osgi:reference id=”aService”
interface=”org.sfw.osgi.samples.ss.MyService”/>

What happens if…?
- there isn’t a matching service? Dampening…
- there are several matching services? osgi:set vs osgi:reference, perhaps.
- a matched service goes away at runtime? Dampening…
- new matching services become available at runtime? osgi:set vs osgi:reference, perhaps.

Relevant JSR’s
———————————————————-
JSR-291 Take OSGi and endorse it on Java Platform
JSR-277 Java Module System ( poor man’s osgi )
JSR-294 Language changes that support Java Module System

Categories: Architecture Tags:

Twitter Architectural Issues

June 14th, 2008 deevis No comments

disclaimer: First I want to give a huge load of thanks to the good folks over at infoQ for being so completely awesome.  They have an amazing site structure and great content along with it.  I felt like a kid in a candy store when I first stumbled onto their site from Spring’s home page.  That being said, I’m going to be making my way through 20+ articles/interviews they’ve done previously and responding in kind with summaries, gleaned factoids, and responses.

I just read infoQ’s Architecting Twitter article.

Concepts

Single-Instance Storage

Data Sharding

Flickr Architecture

Google Architecture

LiveJournal Architecture

Categories: Architecture Tags:

Collections Performance Benchmarks

March 25th, 2008 deevis 1 comment

I’ve been having fun of late pitting some Collections implementations against each other and getting some pretty peculiar results coming back. Well, I guess the only thing peculiar about them is that Vector keeps out performing ArrayList, even though Vector is synchronized and ArrayList in theory should be faster. So, one way that ArrayList can lose is when the two aren’t sized correctly upon construction and have to grow to accommodate their datasets. ArrayList grows by 50% with each resize, while Vector doubles. Having to resize is an expensive operation and this explains quite a bit. But even when I remove the resizing ( by sizing correctly upon construction ) ArrayList still gets pwned by Vector? Here’s the results from a test with random 40-character Strings being built into a unique Collection where the Collections are sized Collections Showdown - presized.correctly. Notice how Vector and LinkedList perform virtually exactly the same, but ArrayList is slower, and getting slower as the dataset size increases. Of course, HashSet is still the hands down winner and what you should use when working with Collections that don’t care about ordering. TreeSet is a close second, but bear in mind that TreeSet can perform poorly when your dataset have similar Strings.

So, now it’s time to dig a bit deeper into the entire ArrayList versus Vector thing. What if I build each of them up with integers from 1 to x before the test runs, and then as the test call contains() on each integer 1 to x? Is contains faster or slower for ArrayList?

ArrayList vs Vector - add

So, it looks like the calls to contains() are slower for ArrayList. What about the calls to add() with a presized Collection?

calling_add_x_number_of_times__presized_collections__2008_03_23__09_33_pm.png

Yet again, Vector is a little bit faster than ArrayList! I’ve looked through the Java source code and so help me I swear the y are doing the same effective work – but Vector is synchronized and SHOULD BE SLOWER!

So, now I need to branch out and do some testing on other JDK’s and other OS’s. 1.5 and Linux, here I come!

Categories: Benchmarks, JDK Tags:

ArrayList vs Vector

March 18th, 2008 deevis 1 comment

I’ve done some more work on the ArrayList vs Vector performance issue.  The post below originally stated that Vector would/could outperform ArrayList even though it is synchronized while ArrayList is not.  Well, in digging deeper into the issue, I”ve determined that there is a problem with the benchmarking framework being used.  I ran a very simple test with ArrayList and Vector to look at performance of add() and contains() methods and have determined that:

ArrayList.add() is 13% faster than Vector.add()

ArrayList.contains() is 3% faster than Vector.contains()

Now I need to identify the problem with the benchmark framework.

I’ve been surprised of late in my Unique Collection Building benchmarks that TreeSet was outperformed by ArrayList.  One reason this could happen would be due to the nature of the dataset being processed and the data set I had contained many items with the first 20 characters or so being the same.  You know, lots of entries starting with “/WEB-INF/javascript” perhaps.  This would have made the TreeSets comparisons in deciding ordering more costly.

So, now I run the test again but with Vector thrown into the mix and the Strings are all 40 characters long and very, very random.  The odds of differing after 2 or 3 characters are now very high.

TreeSet should do much better.

But how will Vector fair?

You probably know as well as I do that ArrayList is the faster, unsynchronized option to Vector ( much like using StringBuilder instead of StringBuffer ).  But how much faster is ArrayList?  Here are the results:

building_unique_collection_of_random-generated_40_character_strings_-_every_element_appears_twice_2008_03_18__01_15_am.png

Categories: Uncategorized Tags:

More on Autoboxing

March 17th, 2008 deevis No comments

I recently did a benchmark on Autoboxing where I determined that Autoboxing cost about 15 nanoseconds. Well, then I got to thinking that I’d written that test to work with int/Integer datatypes. But what about the other types? Today I add Boolean, Float, Double and Long into the mix.

To cut to the chase, I’ll first tell you the cost of Autoboxing these types.

Boolean: 1/1 nanoseconds

Integer: 15/17 nanoseconds

Float: 19/19 nanoseconds

Double: 22/24 nanoseconds

Long: 30/40 nanoseconds

The two numbers are the results for Half and Full Boxing. Half Boxing is going one way from either primitive to Object or Object to primitive. Full Boxing is a full round trip going primitive to Object and back to primitive. So, for the Long results it takes 30 nanoseconds to Half Box (one way) while it takes 40 nanoseconds to Full Box ( round trip ).

So, the underlying datatype’s size seems to directly impact the cost of the Autoboxing. I’ll be doing more tests with other types shortly. Here are the charts from the Boolean and Long trials.

Boolean Autoboxing Results

Integer Autoboxing Results

Float Autoboxing Results

Double Autoboxing Results

Long Autoboxing Results

Categories: Uncategorized Tags: