Supercharging your HashMaps

by David O'Meara

There is plenty of content around that discuss the performance of the relative Java Collections classes, mostly in terms of adding and retrieving items and some of the specifics under the cover such as the effect of the initial capacity. HashMaps and HashSets are special cases that make heavy use of the Object.hashCode() contract and as such are particularly sensitive to the behaviour of that method.

This article was inspired by a couple of classes in the application I support at work. It has a couple of very large HashMaps, one containing several thousand properties and settings, and the other containing tens of thousands of translations. Since they are so pervasive their performance has a direct impact on the application, and since they are so large it was worth taking the time to see how they performed. From this I took a tangent to specifics affecting HashMap performance and the article below.

Firstly a recommendation: If you haven't read Joshua Bloch's Effective Java, you should. Much of the information below follows on from concepts in his book, but there is no replacement for the wealth contained in the book.

While looking at HashMaps, we will of course look at the effects on loading and retrieving data, but we'll also address the time to iterate the contents and combine the effects of the initial capacity at the same time. Having read the API, you are of course aware that the HashSet class is backed by a HashMap, which is why we cover one and not the other.

HashMaps With String Keys

Our HashMaps are keyed against either a String or Integer value, so I thought I would start with Strings.

Loading the Strings

I created three lists of random Strings containing 1,000 10,000 and 100,000 items. The Strings were from 5 to 10 characters long and all upper case.

The Strings values were loaded into an Array and the time taken to load them from the Array into the HashMap was measured for a variety of initial capacities and timed using something like this...

    long start = System.nanoTime();
    for (String key : array ) {
        map.put(key, key);
    }
    System.out.println("Took " + (System.nanoTime() - start) / 1000000 + " ms");

Key SizeInitial CapacityLoad time
(ms)
1000104
10001004
100010004
100010,0004
1000100,0004
10001,000,0004
10,0001020
10,00010020
10,000100020
10,00010,00020
10,000100,00020
10,0001,000,00020
100,00010110
100,000100110
100,0001000110
100,00010,000110
100,000100,000110
100,0001,000,000110
Table 1 - String key loading times

These results are non-threatening and may even be a little faster than expected. It is certainly the case that you wouldn't think twice about putting 100,000 records in a HashMap if you knew it was going to take a tenth of a second to load.

Retrieving the Strings

Now that we have the data loaded, it isn't much good unless we can get it back out. Since HashMaps are renouned for their ability to return data quickly, we will start with that. Note that to make sure the time spent selecting an item doesn't affect our results, we will pick the first item from our initial list and load it repeatedly like this...

    String searchKey = array[0];
    long start = System.nanoTime();
    for (int i = 0; i < LOOP_SIZE; i++) {
        map.get(searchKey);
    }
    System.out.println("Took " + (System.nanoTime() - start) / 1000000 + " ms");

Key SizeInitial CapacityReadsRead time
(ms)
100010100 million~2400
1000100100 million~2400
10001000100 million~2400
100010,000100 million~2400
1000100,000100 million~2400
10001,000,000100 million~2400
10,00010100 million~2400
10,000100100 million~2400
10,0001000100 million~2400
10,00010,000100 million~2400
10,000100,000100 million~2400
10,0001,000,000100 million~2400
100,00010100 million~2400
100,000100100 million~2400
100,0001000100 million~2400
100,00010,000100 million~2400
100,000100,000100 million~2400
100,0001,000,000100 million~2400
Table 2 - String key read times

Once again I hope these results are non-threatening and pretty much as expected. Regardless of the size of the HashMap and the number of items it contains, the time taken to retrieve an item does not change. Note that 100 million reads is a fair amount of work for our HashMap to do, and is almost one operation for every millisecond in the day.

Iterating the Strings

The final piece of work to build our basic results is to test the amount of time taken to iterate the results. To get values that are visible, we iterate each set of keys depending on the number of items it contains, like this...

    long start = System.nanoTime();
    for (int i = 0; i < ITERATE_COUNT; i++) {
        for (String string : map.keySet()) {
            // This operation should be 'free', and the compiler
            // should not be able to compile it out of the byte code.
            string.hashCode();
        }
    }
    System.out.println("Took " + (System.nanoTime() - start) / 1000000 + " ms");

Key SizeInitial CapacityIterationsIteration Time
(ms)
10001010,000 (x 1000)~510
100010010,000 (x 1000)~510
1000100010,000 (x 1000)~550
100010,00010,000 (x 1000)1,100
1000100,00010,000 (x 1000)6,200
10001,000,00010,000 (x 1000)50,000
10,000101,000 (x 10,000)~520
10,0001001,000 (x 10,000)~520
10,00010001,000 (x 10,000)~520
10,00010,0001,000 (x 10,000)~550
10,000100,0001,000 (x 10,000)1,200
10,0001,000,0001,000 (x 10,000)
100,00010100 (x 100,000)1800
100,000100100 (x 100,000)1800
100,0001000100 (x 100,000)1800
100,00010,000100 (x 100,000)1800
100,000100,000100 (x 100,000)1800
100,0001,000,000100 (x 100,000)
Table 3 - String key iteration times

Summarising Strings as Keys

Now that we have our baseline results using String as our keys, we can summarise:

To be fair, the String class is a first class citizen in Java, and is able to make use of skilled design and specific treatment by the compiler. The largest help is that the String classes hashCode() method is lazy loaded and therefore cached. This was used in the iteration example since the hash code value had already been calculated and can therefore be reused without being recalculated, but it is also critical to the results we have seen so far, as we are about to discover.

HashMaps With Exotic Keys

While Strings and Integers are common choices as keys, it is often necessary to use classes that are more complicated, and the complicated nature of the class could relate directly to the behaviour of the HashMap utilising the class. Without trying to go overboard creating a class that incurs a cost while calculating its hash code, we will create a class that contains an array of primitive ints and provides a hash code based on the array contents.

The LargeHash Class

// This class is only designed to illustrate the cost of hashing.
// It is not for production use!

public class LargeHash {

    private final int[] intValues;

    public LargeHash (final int intValue) {
        intValues = new int[1000];
        for (int i = 0; i < intValues.length; i++) {
            intValues[i] = intValue + i;
        }
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 17;
        result = prime * result + Arrays.hashCode(intValues);
        return result;
    }

    @Override
    public boolean equals (Object obj) {
        if (this == obj )
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        final LargeHash other = (LargeHash) obj;
        if (!Arrays.equals(intValues, other.intValues))
            return false;
        return true;
    }
}

The class is designed to be a HashMap key, and once an instance is created it has only limited functionality. The hashCode() and equals() methods were initially generated by Eclipse, and use methods from the Arrays helper class.

We will be performing the same operations of loading, reading and iterating a HashMap containing a large number of instances as keys, and the results are...

Key SizeInitial CapacityLoad Time
(ms)
ReadsRead Time
(ms)
IterationsIteration Time
(ms)
1000105100 million275,000 *10,000 (x 1000)500
10001005100 million280,000 *10,000 (x 1000)500
100010004100 million275,000 *10,000 (x 1000)500
100010,0005100 million280,000 *10,000 (x 1000)1100
1000100,0005100 million275,000 *10,000 (x 1000)6200 **
10001,000,0004100 million275,000 *10,000 (x 1000)50,000 **
10,0001040100 million280,000 *1000 (x 10,000)500
10,00010040100 million280,000 *1000 (x 10,000)500
10,000100040100 million280,000 *1000 (x 10,000)520
10,00010,00042100 million290,000 *1000 (x 10,000)520
10,000100,00043100 million290,000 *1000 (x 10,000)1100 **
10,0001,000,00042100 million280,000 *1000 (x 10,000)5800 **
100,00010350100 million280,000 *100 (x 100,000)2100
100,000100350100 million280,000 *100 (x 100,000)2100
100,0001000350100 million280,000 *100 (x 100,000)2100
100,00010,000350100 million280,000 *100 (x 100,000)2100
100,000100,000330100 million280,000 *100 (x 100,000)2000 **
100,0001,000,000330100 million280,000 *100 (x 100,000)2400 **
* values are interpreted from repeated runs of 1 million reads
** See the afterthoughts...
Table 4 - Simple LargeHash times

The LargeHash Behaviour

Firstly, the load time was larger but still not really noticeable. It took possibly twice as long, but twice 'not a hell of a lot' is still 'not much'. Interestingly, the API recommends setting the initial size to some realistic value to prevent incurring the cost of rehashing, but that cost doesn't appear in the results. It is possible that the initial capacity made a difference when we use 100,000 items in our Map, but not enough to provide any conclusive results.

The read time was obviously worse, to the extent that I decided to extrapolate the results rather than wait several minutes for each run to complete. This shows us that the hashCode() method is too important when used in HashMaps to be taken lightly, and consideration should be taken when choosing or designing a class that will be used as a hashed key. There is nothing incredibly wrong with the class itself, but when put under pressure it fails to perform and can cause a drag on the code.

Building a Responsible Key

The lesson to learn from the String class is that lazy loading and caching the hash value can have a dramatic effect on the read times, since this is crucial to the HashMap performance. This is particularly true when you have high high usage HashMaps like the ones in the introduction. Is important to not that the hash code cannot be cached if the instance state can change in a way that alters the hash code, which implies that the key should first be made immutable before the hash code can be lazy loaded.

The LargeHash class is almost immutable, since it doesn't require defensive copies of data coming in and no data is exposed. This means we are only required to make the class final to fulfil the immutable contract.

There a few options for lazy loading and caching the hash code, but we will copy the approach used in the String class. Since no synchronization is used it could lead to a race condition in which the hash code is set to different values, but immutable internal state protects us from this happening in this case. It is still possible for the hash code to be generated by more than thread, but once the value is set then it will be defined forever and we no longer worry about the value changing and do not suffer the cost of synchronization that is no longer required.

It is also important to note that the local variable 'h' is used throughout the method to ensure thread safety. Otherwise another thread could see the hash variable in an intermediate (non zero) state and return the incorrect value. The updated method looks like this...

    @Override
    // Style copied from String.hashCode()
    public int hashCode() {
        int h = hash;
        if (h == 0) {
            final int prime = 31;
            h = 17;
            h = prime * h + Arrays.hashCode(intValues);
            hash = h;
        }
        return h;
    }

...and now has performance comparable to the String key while being all the more exotic in nature.

Key SizeInitial CapacityLoad Time
(ms)
ReadsRead Time
(ms)
IterationsIteration Time
(ms)
1000105100 million240010,000 (x 1000)500
10001005100 million240010,000 (x 1000)500
100010005100 million240010,000 (x 1000)500
100010,0004100 million240010,000 (x 1000)1100
1000100,0005100 million240010,000 (x 1000)6200 **
10001,000,0005100 million240010,000 (x 1000)50,000 **
10,0001040100 million24001000 (x 10,000)500
10,00010040100 million24001000 (x 10,000)500
10,000100042100 million24001000 (x 10,000)500
10,00010,00040100 million24001000 (x 10,000)500
10,000100,00040100 million24001000 (x 10,000)1000 **
10,0001,000,00040100 million24001000 (x 10,000)5800 **
100,00010350100 million2400100 (x 100,000)2000
100,000100350100 million2400100 (x 100,000)2000
100,0001000440 *100 million2400100 (x 100,000)1800 *
100,00010,000350100 million2400100 (x 100,000)2000
100,000100,000350100 million2400100 (x 100,000)2000 **
100,0001,000,000350100 million2400100 (x 100,000)2400 **
* These values were a repeatable aberration
** See the afterthoughts...
Table 5 - Optimised LargeHash times

In Conclusion

The hasCode() method matters. In fact I would extend that to state that all aspects of all contract methods matter. Joshua Bloch has plenty to say about the hashCode(), equals() and comparable() contracts, but when you consider that these methods are the backbone of large parts of the Java programming language, it may not be sufficient just to get the right. In critical places, the default hashing algorithm may not be enough.

While the API warns that making the initial size too small can incur an overhead due to rehashing when the capacity increases, this was not apparent in the tables above. The cost to iteration due to large capacities is easier to see. Therefore the data suggests that HashMaps should be initialised to a smaller size and definitely not overly large.

Afterthoughts

Unusual Iteration Times

Tables 4 and 5 show that the iteration time increases as the capacity increases, and we expected this result. What was unexpected were times that decreased when the Maps contained more entries.

I don't have much more to say about this than "gosh".

Actual Size

An interesting diversion is to look at the actual capacity of the HashMap. The internal size is always kept as a power of two, presumably for performance purposes, but it is also important to note that the 'power of two' property is enforced under the covers and certainly not exposed via the API. Therefore as a rule you should ignore it in your own code. One thing this means is that you shouldn't try to initialise HashMaps to an appropriate 'power of two' value as it is already done for you, but in doing so you may be actively harming yourself if the particulars are changed in later implementations.

The actual capacity is the smallest power of two that is greater than the initial capacity and greater than the number of items divided by the load factor. This second point is the reason that an initial capacity of 1000 would set the actual capacity to 1024 (is 2^10) but once this was 75% full (load factor 0.75) the capacity would be altered to 2048 (ie 2^11) as seen in Table 6.

While the actual capacity is not exposed directly, there are many ways to inspect the value and I found it easiest to attach a debugger to interrogate the HashMap instance.

Key SizeInitial CapacityActual
Capacity
1000102^11
10001002^11
100010002^11
100010,0002^14
1000100,0002^17
10001,000,0002^20
10,000102^14
10,0001002^14
10,00010002^14
10,00010,0002^14
10,000100,0002^17
10,0001,000,0002^20
100,000102^17
100,0001002^17
100,00010002^17
100,00010,0002^17
100,000100,0002^17
100,0001,000,0002^20
Table 6 - Actual capacities in the previous tests

Iterating Map.Entry Sets

As a final aside, there are many articles that recommend iterating the Map.Entry set from HashMaps in preference to iterating the keys and then using the key to obtain the value. We already have the time required to iterate in the data above, and have also shown that the time to retrieve the value is constant, but we already knew that from the API.

    long start = System.nanoTime();
    for (int i = 0; i < ITERATE_COUNT; i++) {
        for (Map.Entry<String,String> entry : map.entrySet()) {
            entry.getKey().hashCode();
        }
    }
    System.out.println("Took " + (System.nanoTime() - start) / 1000000 + " ms");

While potentially faster, this code is less readable and would therefore need to be quite a bit faster to warrant adoption, so lets run the tests one last time...

Key SizeInitial CapacityIterationsGet then Read *
(ms)
Map.Entry
(ms)
100010,00010,000 (x 1000)13401330
1000100,00010,000 (x 1000)64006400
10001,000,00010,000 (x 1000)50,00050,000
10,00010,0001,000 (x 10,000)740700
10,000100,0001,000 (x 10,000)12001200
10,0001,000,0001,000 (x 10,000)60005900
100,00010,000100 (x 100,000)22001900
100,000100,000100 (x 100,000)22001500
100,0001,000,000100 (x 100,000)26001900
* Times were calculated using the data from Tables 5
Table 7 - Map.Entry iteration times

The Map.Entry does show faster iteration times at the extreme end of the test data, but otherwise I'm not so convinced.

Code was output using the Java2Html plug-in for Eclipse using MyEclipse 6.5