Collections in Java
Part 2 - The Set Interface

by Thomas Paul

Last month we discussed the basics of the Collection interface and reviewed one of the children of this interface, the List interface This month we will look at the second child interface, the Set interface. There are three non-abstract implementations of the Set interface: TreeSet, HashSet, and LinkedHashSet (new to J2SE 1.4). Set objects do not allow duplicates and are not required to maintain their objects in the order they are added. HashSet will maintain items in an order designed for fast lookup. LinkedHashSet will maintain the items in the order they are added while also maintaining a hash for fast lookup. TreeSet will maintain items in a sorted order. A look at the performance difference of the three shows that HashSet is generally the most efficient.

Testing of Set objects using 50,000 repetitions.

Type

Contains

Iterate

Insert

Remove

TreeSet 6469 594 438 47
HashSet 5188 640 329 15
LinkedHashSet 5297 375 343 32


Figure 1: relationships between the classes and interfaces of the Set interface

The Set interface

The Set interface does not add any methods to the Collection interface which is its parent. Any class implementing the Set interface is not permitted to have duplicate entries. The formal definition of the Set interface requires that no two elements in the Set are equal [such that e1.equals(e2) is never true]. Since a Set can not have duplicates, the Set can contain no more than one null entry. A Set is not required to maintain the elements of the Set in the order they are added although the LinkedHashSet does maintain order.

Here are some of the key methods of the Set interface:

The HashSet class

The HashSet class is the simplest implementation of the Set interface. The HashSet does not add any additional methods beyond those found in the Set interface. The HashSet achieves good performance by using a hash to store the Object in the Set. The hash allows fast lookup. Any Object placed in a HashSet must implement the hashCode( ) and equals( ) methods. We will discuss more of this in a little while.

The HashSet does not guarantee the order of the items in the Set and allows one null element. Duplicates are not permitted. The add(Object o) will return false if you attempt to add an item already in the HashSet. The HashSet offers "constant time" performance. This means that adding items to the Set will not cause significant performance degradation. The performance of basic functions such as add, remove, etc is based on two factors which can be specified in the constructor of the HashSet, initial capacity and load factor.

Choosing an initial capacity that is too high can waste space as well as effect performance. (The constant time performance of iterating through a HashSet is based is based on the total of the number of entries plus the capacity.) The default initial capacity is 16. The capacity should never be smaller than you expect entries in the HashSet and should be at least large enough so that the load factor does not cause the HashSet to increase in size. Sun recommends that a HashSet be allocated with a capacity of twice the predicted number of entries.

Load factor is a measure of how full the HashSet is allowed to get before its capacity is increased. When the number of entries exceeds the product of the load factor and the current capacity, the capacity is doubled. As an example, if the initial capacity is 100 and the load factor is .75 then the HashSet will double in size to a capacity of 200 when the HashSet reaches 75 entries. Generally, the default load factor (.75) offers a good trade-off between time and space costs. Higher values decrease the space overhead, but increase the time it takes to look up an entry. Load factor is always specified as a number from 0.0 to 1.0.

As I mentioned above, classes that are to be stored in a HashSet must implement the hashCode() and equals() method. Here is an example of this implementation. We will use this same class for the examples that follow.

public class Employee {
	private String socSecNum;
	private String empNum;
	private String name;
	private int department;

	public String getName( ) {
		return name;
	}
	public String getEmpNum( ) {
		return empNum;
	}
	public int getDepartment( ) {
		return department;
	}
	public String getSocSecNum( ) {
		return socSecNum;
	}
	public boolean equals(Object obj) {
 		boolean retValue = false;
    	if (obj instanceof Employee) {
			Employee e  = (Employee)obj;
			retValue =  empNum.equals(e.getEmpNum( ));
		}
		return retValue;
	}
	public int hashCode( ) {
		return empNum.hashCode( );
	}
}


The LinkedHashSet Class

New to J2SE 1.4, the LinkedHashSet class is used in exactly the same way as the HashSet class. It has no additional methods and has constructors that take identical parameters as the HashSet. The only difference is that the LinkedHashSet maintains the order of the items added to the Set. It does this by maintaining a doubly linked list containing the hash and the original order of the items. According to Sun, the LinkedHashSet should run nearly as fast as the HashSet.

The TreeSet class

The TreeSet class guarantees that the Objects in the Set will be sorted in ascending order by either the Objects natural order or by a Comparator provided to the constructor of the TreeSet. Searching for an item in a TreeSet will be slower than in a HashSet because the hashing algorithm gives better performance than the compareTo() method which is used to locate items in a TreeSet.

TreeSet does add several new methods to the Set interface.

Sorting Collections

You have two choices for determining how to sort items in the TreeSet. Whichever option you choose must guarantee that items found to be equal are really equal. In one of our examples we will be adding items to the TreeSet based on name. This will only work if none of our employees have the same name! Remember, the TreeSet does not accept duplicate entries.

The first option is to only use Objects that implement the Comparable interface in the TreeSet. For your own classes, you will need to implement the Comparable interface if you wish to store them in a TreeSet. The Comparable interface has a single method:

int compareTo(Object o)

The compareTo() method will return a negative number, zero, or positive number depending on whether the Object is less than, equal to, or greater than the Object argument. It is recommended that Objects that implement the compareTo() should ensure that if the equals() method of the Object return true then the same comparison using compareTo( ) should return zero.

(x.compareTo(y)==0) == (x.equals(y))

An example of a class using the Comparable interface we would make these changes to our Employee class:

public class Employee implements Comparable {
	// The comparison can be done on any field we choose.
	// For this example we have decided to sort by empNum.
	public int compareTo(Object obj) {
		Employee e  = (Employee)obj;
		return empNum.compareTo(e.getEmpNum( ));
	}
}

The second option is provide a Comparator for the TreeSet to use. The advantage of the Comparator is that we can create multiple Comparators that can be used to sort our TreeSet in multiple ways. Using the example above, we could add two Comparators; one for name and a different one for socSecNum.

public class Employee implements Comparable {
	public static final Comparator SOC_SEC_ORDER  = new Comparator( ) {
		public int compare(Object o1, Object o2) {
			Employee emp1 = (Employee)o1;
			Employee emp2 = (Employee)o2;
			return emp1.getSocSecNum( ).compareTo(emp2. getSocSecNum( );
		}
	};
	public static final Comparator NAME_ORDER  = new Comparator( ) {
		public int compare(Object o1, Object o2) {
			Employee emp1 = (Employee)o1;
			Employee emp2 = (Employee)o2;
			return emp1.getName( ).compareTo(emp2. getName( );
		}
	};
}

Now in a class that uses a TreeSet with Employee objects I can use this constructor for my TreeSet to create a TreeSet that will keep my Employees in name order:

TreeSet ts = new TreeSet(Employee.NAME_ORDER);

The same options used for TreeSet can be used for any Collection. A Collection can be sorted by using the static Sort method of the Collections class. (Notice the extra "s" at then end of "Collections".) So to sort an ArrayList of Employee objects by Social Security number, we would do the following:

Collections.sort(empArrayList, Employee.SOCIAL_SECURITY_ORDER);

Since Employee implements the Comparable interface, to sort by employee number we can simply do:

Collections.sort(empArrayList);

Conclusion

We have taken a quick look at the main components of the Set interface and also looked at how we can sort items in a TreeSet or in any Collection. As we have seen, which Set object you wish to use will depend greatly upon how you intend on using it in your production environment. Performance is less a concern in choosing among Set objects. The main concern is the functionality you require. If your only requirement is fast lookup, then you can use HashSet. If you need to be able to retrieve your items in the same order that you added them to the Set, then you would use the LinkedHashSet. And finally, if you need the items in your Set to be maintained in a specific sort order, then you would use the TreeSet.

Next month we will take a tour of the Map interface.