Supercharging your HashMaps

There is plenty of content around that discuss the performance of the relative Java Collections classes, mostly in terms of adding and retrieving items and some of the specifics under the cover such as the effect of the initial capacity. HashMaps and HashSets are special cases that make heavy use of the Object.hashCode() contract and as such are particularly sensitive to the behaviour of that method.

This article was inspired by a couple of classes in the application I support at work. It has a couple of very large HashMaps, one containing several thousand properties and settings, and the other containing tens of thousands of translations. Since they are so pervasive their performance has a direct impact on the application, and since they are so large it was worth taking the time to see how they performed. From this I took a tangent to specifics affecting HashMap performance and the article below.

Firstly a recommendation: If you haven't read Joshua Bloch's Effective Java, you should. Much of the information below follows on from concepts in his book, but there is no replacement for the wealth contained in the book.

While looking at HashMaps, we will of course look at the effects on loading and retrieving data, but we'll also address the time to iterate the contents and combine the effects of the initial capacity at the same time. Having read the API, you are of course aware that the HashSet class is backed by a HashMap, which is why we cover one and not the other.

HashMaps With String Keys

Our HashMaps are keyed against either a String or Integer value, so I thought I would start with Strings.

Loading the Strings

I created three lists of random Strings containing 1,000 10,000 and 100,000 items. The Strings were from 5 to 10 characters long and all upper case.

The Strings values were loaded into an Array and the time taken to load them from the Array into the HashMap was measured for a variety of initial capacities and timed using something like this...


    long start = System.nanoTime();

    for (String key : array   ) {

        map.put(key, key);

    }

    System.out.println("Took " + (System.nanoTime() - start) / 1000000 + " ms");

Key Size	Initial Capacity	Load time (ms)

1000	10	4
1000	100	4
1000	1000	4
1000	10,000	4
1000	100,000	4
1000	1,000,000	4

10,000	10	20
10,000	100	20
10,000	1000	20
10,000	10,000	20
10,000	100,000	20
10,000	1,000,000	20

100,000	10	110
100,000	100	110
100,000	1000	110
100,000	10,000	110
100,000	100,000	110
100,000	1,000,000	110
Table 1 - String key loading times

These results are non-threatening and may even be a little faster than expected. It is certainly the case that you wouldn't think twice about putting 100,000 records in a HashMap if you knew it was going to take a tenth of a second to load.

Retrieving the Strings

Now that we have the data loaded, it isn't much good unless we can get it back out. Since HashMaps are renouned for their ability to return data quickly, we will start with that. Note that to make sure the time spent selecting an item doesn't affect our results, we will pick the first item from our initial list and load it repeatedly like this...


    String searchKey = array[0];

    long start = System.nanoTime();

    for (int i = 0; i < LOOP_SIZE; i++) {

        map.get(searchKey);

    }

    System.out.println("Took " + (System.nanoTime() - start) / 1000000
+ " ms");

Key Size	Initial Capacity	Reads	Read time (ms)

1000	10	100 million	~2400
1000	100	100 million	~2400
1000	1000	100 million	~2400
1000	10,000	100 million	~2400
1000	100,000	100 million	~2400
1000	1,000,000	100 million	~2400

10,000	10	100 million	~2400
10,000	100	100 million	~2400
10,000	1000	100 million	~2400
10,000	10,000	100 million	~2400
10,000	100,000	100 million	~2400
10,000	1,000,000	100 million	~2400

100,000	10	100 million	~2400
100,000	100	100 million	~2400
100,000	1000	100 million	~2400
100,000	10,000	100 million	~2400
100,000	100,000	100 million	~2400
100,000	1,000,000	100 million	~2400
Table 2 - String key read times

Once again I hope these results are non-threatening and pretty much as expected. Regardless of the size of the HashMap and the number of items it contains, the time taken to retrieve an item does not change. Note that 100 million reads is a fair amount of work for our HashMap to do, and is almost one operation for every millisecond in the day.

Iterating the Strings

The final piece of work to build our basic results is to test the amount of time taken to iterate the results. To get values that are visible, we iterate each set of keys depending on the number of items it contains, like this...


    long start = System.nanoTime();

    for (int i = 0; i < ITERATE_COUNT; i++) {

        for (String string : map.keySet()) {

            // This operation should be 'free', and the compiler

            // should not be able to compile it out of the byte code.

            string.hashCode();

        }

    }

    System.out.println("Took " + (System.nanoTime() - start) / 1000000
+ " ms");

Key Size	Initial Capacity	Iterations	Iteration Time (ms)

1000	10	10,000 (x 1000)	~510
1000	100	10,000 (x 1000)	~510
1000	1000	10,000 (x 1000)	~550
1000	10,000	10,000 (x 1000)	1,100
1000	100,000	10,000 (x 1000)	6,200
1000	1,000,000	10,000 (x 1000)	50,000

10,000	10	1,000 (x 10,000)	~520
10,000	100	1,000 (x 10,000)	~520
10,000	1000	1,000 (x 10,000)	~520
10,000	10,000	1,000 (x 10,000)	~550
10,000	100,000	1,000 (x 10,000)	1,200
10,000	1,000,000	1,000 (x 10,000)

100,000	10	100 (x 100,000)	1800
100,000	100	100 (x 100,000)	1800
100,000	1000	100 (x 100,000)	1800
100,000	10,000	100 (x 100,000)	1800
100,000	100,000	100 (x 100,000)	1800
100,000	1,000,000	100 (x 100,000)
Table 3 - String key iteration times

Summarising Strings as Keys

Now that we have our baseline results using String as our keys, we can summarise:

It is very fast to load String keys into a HashMap
While the javadoc warns against making the initial capacity too small due to rehashing costs, this wasn't apparent in the results above.
The 'read time' appears constant regardless of the number of items in the map or the current capacity. This is described in the HashMap API.
The iteration time appears fairly constant unless the capacity is very large. This is described in the HashMap API too.

To be fair, the String class is a first class citizen in Java, and is able to make use of skilled design and specific treatment by the compiler. The largest help is that the String classes hashCode() method is lazy loaded and therefore cached. This was used in the iteration example since the hash code value had already been calculated and can therefore be reused without being recalculated, but it is also critical to the results we have seen so far, as we are about to discover.

HashMaps With Exotic Keys

While Strings and Integers are common choices as keys, it is often necessary to use classes that are more complicated, and the complicated nature of the class could relate directly to the behaviour of the HashMap utilising the class. Without trying to go overboard creating a class that incurs a cost while calculating its hash code, we will create a class that contains an array of primitive ints and provides a hash code based on the array contents.

The LargeHash Class


// This class is only designed to illustrate the cost of hashing.

// It is not for production use!

public class LargeHash {



    private final int[] intValues;



    public LargeHash (final int intValue) {

        intValues = new int[1000];

        for (int i = 0; i < intValues.length; i++) {

            intValues[i] = intValue + i;

        }

    }



    @Override

    public int hashCode() {

        final int prime = 31;

        int result = 17;

        result = prime * result + Arrays.hashCode(intValues);

        return result;

    }



    @Override

    public boolean equals (Object obj) {

        if (this == obj )

            return true;

        if (obj == null)

            return false;

        if (getClass() != obj.getClass())

            return false;

        final  LargeHash other = (LargeHash) obj;

        if (!Arrays.equals(intValues, other.intValues))

            return false;

        return true;

    }

}

The class is designed to be a HashMap key, and once an instance is created it has only limited functionality. The hashCode() and equals() methods were initially generated by Eclipse, and use methods from the Arrays helper class.

We will be performing the same operations of loading, reading and iterating a HashMap containing a large number of instances as keys, and the results are...

Key Size	Initial Capacity	Load Time (ms)	Reads	Read Time (ms)	Iterations	Iteration Time (ms)

1000	10	5	100 million	275,000 *	10,000 (x 1000)	500
1000	100	5	100 million	280,000 *	10,000 (x 1000)	500
1000	1000	4	100 million	275,000 *	10,000 (x 1000)	500
1000	10,000	5	100 million	280,000 *	10,000 (x 1000)	1100
1000	100,000	5	100 million	275,000 *	10,000 (x 1000)	6200 **
1000	1,000,000	4	100 million	275,000 *	10,000 (x 1000)	50,000 **

10,000	10	40	100 million	280,000 *	1000 (x 10,000)	500
10,000	100	40	100 million	280,000 *	1000 (x 10,000)	500
10,000	1000	40	100 million	280,000 *	1000 (x 10,000)	520
10,000	10,000	42	100 million	290,000 *	1000 (x 10,000)	520
10,000	100,000	43	100 million	290,000 *	1000 (x 10,000)	1100 **
10,000	1,000,000	42	100 million	280,000 *	1000 (x 10,000)	5800 **

100,000	10	350	100 million	280,000 *	100 (x 100,000)	2100
100,000	100	350	100 million	280,000 *	100 (x 100,000)	2100
100,000	1000	350	100 million	280,000 *	100 (x 100,000)	2100
100,000	10,000	350	100 million	280,000 *	100 (x 100,000)	2100
100,000	100,000	330	100 million	280,000 *	100 (x 100,000)	2000 **
100,000	1,000,000	330	100 million	280,000 *	100 (x 100,000)	2400 **

* values are interpreted from repeated runs of 1 million reads ** See the afterthoughts...
Table 4 - Simple LargeHash times

The LargeHash Behaviour

Firstly, the load time was larger but still not really noticeable. It took possibly twice as long, but twice 'not a hell of a lot' is still 'not much'. Interestingly, the API recommends setting the initial size to some realistic value to prevent incurring the cost of rehashing, but that cost doesn't appear in the results. It is possible that the initial capacity made a difference when we use 100,000 items in our Map, but not enough to provide any conclusive results.

The read time was obviously worse, to the extent that I decided to extrapolate the results rather than wait several minutes for each run to complete. This shows us that the hashCode() method is too important when used in HashMaps to be taken lightly, and consideration should be taken when choosing or designing a class that will be used as a hashed key. There is nothing incredibly wrong with the class itself, but when put under pressure it fails to perform and can cause a drag on the code.

Building a Responsible Key

The lesson to learn from the String class is that lazy loading and caching the hash value can have a dramatic effect on the read times, since this is crucial to the HashMap performance. This is particularly true when you have high high usage HashMaps like the ones in the introduction. Is important to not that the hash code cannot be cached if the instance state can change in a way that alters the hash code, which implies that the key should first be made immutable before the hash code can be lazy loaded.

The LargeHash class is almost immutable, since it doesn't require defensive copies of data coming in and no data is exposed. This means we are only required to make the class final to fulfil the immutable contract.

There a few options for lazy loading and caching the hash code, but we will copy the approach used in the String class. Since no synchronization is used it could lead to a race condition in which the hash code is set to different values, but immutable internal state protects us from this happening in this case. It is still possible for the hash code to be generated by more than thread, but once the value is set then it will be defined forever and we no longer worry about the value changing and do not suffer the cost of synchronization that is no longer required.

It is also important to note that the local variable 'h' is used throughout the method to ensure thread safety. Otherwise another thread could see the hash variable in an intermediate (non zero) state and return the incorrect value. The updated method looks like this...


    @Override

    // Style copied from String.hashCode()

    public int hashCode() {

        int h = hash;

        if (h == 0) {

            final int prime = 31;

            h = 17;

            h = prime * h + Arrays.hashCode(intValues);

            hash = h;

        }

        return  h;

    }

...and now has performance comparable to the String key while being all the more exotic in nature.

Key Size	Initial Capacity	Load Time (ms)	Reads	Read Time (ms)	Iterations	Iteration Time (ms)

1000	10	5	100 million	2400	10,000 (x 1000)	500
1000	100	5	100 million	2400	10,000 (x 1000)	500
1000	1000	5	100 million	2400	10,000 (x 1000)	500
1000	10,000	4	100 million	2400	10,000 (x 1000)	1100
1000	100,000	5	100 million	2400	10,000 (x 1000)	6200 **
1000	1,000,000	5	100 million	2400	10,000 (x 1000)	50,000 **

10,000	10	40	100 million	2400	1000 (x 10,000)	500
10,000	100	40	100 million	2400	1000 (x 10,000)	500
10,000	1000	42	100 million	2400	1000 (x 10,000)	500
10,000	10,000	40	100 million	2400	1000 (x 10,000)	500
10,000	100,000	40	100 million	2400	1000 (x 10,000)	1000 **
10,000	1,000,000	40	100 million	2400	1000 (x 10,000)	5800 **

100,000	10	350	100 million	2400	100 (x 100,000)	2000
100,000	100	350	100 million	2400	100 (x 100,000)	2000
100,000	1000	440 *	100 million	2400	100 (x 100,000)	1800 *
100,000	10,000	350	100 million	2400	100 (x 100,000)	2000
100,000	100,000	350	100 million	2400	100 (x 100,000)	2000 **
100,000	1,000,000	350	100 million	2400	100 (x 100,000)	2400 **

* These values were a repeatable aberration ** See the afterthoughts...
Table 5 - Optimised LargeHash times

In Conclusion

The hasCode() method matters. In fact I would extend that to state that all aspects of all contract methods matter. Joshua Bloch has plenty to say about the hashCode(), equals() and comparable() contracts, but when you consider that these methods are the backbone of large parts of the Java programming language, it may not be sufficient just to get the right. In critical places, the default hashing algorithm may not be enough.

While the API warns that making the initial size too small can incur an overhead due to rehashing when the capacity increases, this was not apparent in the tables above. The cost to iteration due to large capacities is easier to see. Therefore the data suggests that HashMaps should be initialised to a smaller size and definitely not overly large.

Afterthoughts

Unusual Iteration Times

Tables 4 and 5 show that the iteration time increases as the capacity increases, and we expected this result. What was unexpected were times that decreased when the Maps contained more entries.

I don't have much more to say about this than "gosh".

Actual Size

An interesting diversion is to look at the actual capacity of the HashMap. The internal size is always kept as a power of two, presumably for performance purposes, but it is also important to note that the 'power of two' property is enforced under the covers and certainly not exposed via the API. Therefore as a rule you should ignore it in your own code. One thing this means is that you shouldn't try to initialise HashMaps to an appropriate 'power of two' value as it is already done for you, but in doing so you may be actively harming yourself if the particulars are changed in later implementations.

The actual capacity is the smallest power of two that is greater than the initial capacity and greater than the number of items divided by the load factor. This second point is the reason that an initial capacity of 1000 would set the actual capacity to 1024 (is 2^10) but once this was 75% full (load factor 0.75) the capacity would be altered to 2048 (ie 2^11) as seen in Table 6.

While the actual capacity is not exposed directly, there are many ways to inspect the value and I found it easiest to attach a debugger to interrogate the HashMap instance.

Key Size	Initial Capacity	Actual Capacity

1000	10	2^11
1000	100	2^11
1000	1000	2^11
1000	10,000	2^14
1000	100,000	2^17
1000	1,000,000	2^20

10,000	10	2^14
10,000	100	2^14
10,000	1000	2^14
10,000	10,000	2^14
10,000	100,000	2^17
10,000	1,000,000	2^20

100,000	10	2^17
100,000	100	2^17
100,000	1000	2^17
100,000	10,000	2^17
100,000	100,000	2^17
100,000	1,000,000	2^20
Table 6 - Actual capacities in the previous tests

Iterating Map.Entry Sets

As a final aside, there are many articles that recommend iterating the Map.Entry set from HashMaps in preference to iterating the keys and then using the key to obtain the value. We already have the time required to iterate in the data above, and have also shown that the time to retrieve the value is constant, but we already knew that from the API.


    long start = System.nanoTime();

    for (int i = 0; i < ITERATE_COUNT; i++) {

        for (Map.Entry<String,String> entry : map.entrySet()) {

            entry.getKey().hashCode();

        }

    }

    System.out.println("Took " + (System.nanoTime() - start) / 1000000
+ " ms");

While potentially faster, this code is less readable and would therefore need to be quite a bit faster to warrant adoption, so lets run the tests one last time...

Key Size	Initial Capacity	Iterations	Get then Read * (ms)	Map.Entry (ms)

1000	10,000	10,000 (x 1000)	1340	1330
1000	100,000	10,000 (x 1000)	6400	6400
1000	1,000,000	10,000 (x 1000)	50,000	50,000

10,000	10,000	1,000 (x 10,000)	740	700
10,000	100,000	1,000 (x 10,000)	1200	1200
10,000	1,000,000	1,000 (x 10,000)	6000	5900

100,000	10,000	100 (x 100,000)	2200	1900
100,000	100,000	100 (x 100,000)	2200	1500
100,000	1,000,000	100 (x 100,000)	2600	1900
* Times were calculated using the data from Tables 5
Table 7 - Map.Entry iteration times

The Map.Entry does show faster iteration times at the extreme end of the test data, but otherwise I'm not so convinced.

Code was output using the Java2Html plug-in for Eclipse using MyEclipse 6.5