Building a unique collection of strings

I recently encountered some Java code that was building a unique list of String objects in order to prevent multiple processing of any of them. The code was using an ArrayList and would make a call to contains() to determine whether or not to add each String. I knew this wasn’t the optimal way of doing it and wanted to immediately change the implementation to use HashSet, but figured I’d benchmark it and learn exactly what I’m dealing with. In addition to benchmarking the ArrayList implementation and a HashSet implementation, I also threw a TreeSet and the new java.util.concurrent.ConcurrentSkipListSet into the mix. What do you think? Which will be the fastest between HashSet, TreeSet and ArrayList implementations?

ArrayList

[source:java]
public void runInternalTrial() throws Exception {
List list = new ArrayList();
int count = 0;
while ( count < getWorkload() ) {
for ( String s : data ) {
if ( ! list.contains( s )) {
list.add( s );
}
count++;
if ( count >= getWorkload() ) break;
}
}
}

[/source]

HashSet

[source:java]
public void runInternalTrial() throws Exception {
Set set = new HashSet();
int count = 0;
while ( count < getWorkload() ) {
for ( String s : data ) {
set.add( s );
count++;
if ( count >= getWorkload() ) break;
}
}
}

[/source]

Here are the results: ( on the time axis I’ve got the size of the data being processed )

Building a unique set of Strings

Of interest from the results shown above is that the ArrayList implementation is actually faster for the extremely small data sizes of 5 & 10. I wasn’t exactly sure how the TreeSet would perform, but I thought it might out perform the ArrayList. I guess this is generally going to be dependent on the nature of the data being processed. HashSet, sure enough, is the hands down winner. Even with the small sized datasets where ArrayList wins, the amounts are down in the hundreds of nanoseconds realm. Once the size of the dataset hits ( still very small ) size 15, then HashSet is already 1,500 nanoseconds faster.

The new ConcurrentSkipListSet, while it is completely blown away performance-wise by even ArrayList and TreeSet, shouldn’t necessarily be counted out. If the building of the unique list isn’t something that happens too often, then even the cost of an extra 30,000 nanoseconds isn’t going to be noticable. If you are in your service layer and are using the ConcurrentSkipListSet to maintain state that may be accessed/mutated by multiple threads, the guaranteed thread safety may well be worth this extra speed cost.

One Response to Building a unique collection of strings

  1. Pingback: Internet Marketing Email » Blog Archive » Javatech ยป Building a unique collection of strings