Hash tables

Before C++11, hash tables could be used in C++ only with some tricks. The only associative container available in the STL used to be the std::map (along with its sibling, the multimap). This class, however, is implemented as a binary search tree, and does not provide a way to customize a hash function. What hurt me most was that you didn’t have the great performances offered by a true hash map: almost constant time insert, delete, and search (depending on how good is the hash function). You insert in O(log N) time. You find in O(log N) time. Too bad.

Truth is, hash tables were available through some tr1 extension, which I have not had the chance to use ever in my short career. It is thus with exultance that I saluted the introduction of hash tables with the more recent standard.

In this post, I want to group the common usage of this data structure, and in the process I want to keep in one place the list of operations I use the most.

unordered_map?!

I understand historical reasons (sometimes), but can you find a less appropriate name for a hash table? Since C++11, the engineer uses a hash table by including a header called unordered_map, because the hash table’s name in C++ is unordered_map. (There are other containers based on hash functions, but they follow a similar logic.)

An unordered_map is declared as follows ([unord.map.syn]):

1
2
3
4
5
6
template <class Key,
class T,
class Hash = hash<Key>,
class Pred = std::equal_to<Key>,
class Allocator = std::allocator<std::pair<const Key, T>>>
class unordered_map;

What you always need to declare is the first two types, namely, the type of the key (the unique identifier you use to access the hashed data), and the type of the data; the other template types are optional, and usually I go with the default, unless I need some particular behavior. For example, hash is a standard function available in <functional>, already implemented for all the integral types and for some special types (Josuttis), and it is designed so as to satisfy the requirements described in the standard ([unord.hash]).

A practical usage in a declaration would be something like:

1
unordered_map<string, int> ages;

This object would be used to map strings to int values, for instance in an association Name-Age. The representation of this container is a collection of std::pair objects, accessible in O(1) time by applying the hash function to the Key. Pairs corresponding to the same hash value, though rare, are grouped in linked lists.

This map has the usual caveat as the std::map, that is, accessing an
element with operator[] would create the entry if it did not exist already, using the default type constructor of the type T if nothing else is provided — in this case, the int would simply be initialized with 0. The snippet:

1
2
3
4
5
unordered_map<string, int> ages;
if (ages["Clint"] == 41)
cout << "Clint has 41 years\n";
for (auto a : ages)
cout << a.first << ": " << a.second << '\n';

would inadvertently produce the output

Clint: 0

because the test would create a pair Clint, 0; then would fail because 0 is not 41, and finally it would candidly produce the unexpected result shown. When I need to access an element, I use the find method to check if it is there already, and then do the work:

1
2
if (ages.find("Clint") != ages.end() && ages["Clint"] == 41)
...

The interface

The most common operations I perform on a has table are insertion and retrieval; less often, I remove elements.

Insert

There are more ways to insert elements; what I find the cleanest is:

1
ages.insert({"Clint", 83})

This uses the uniform initialization syntax and the fact that an element of the map is actually a std::pair. As seen before, we could use the operator[] to perform the same operation:

1
ages["Clint"] = 83;

This is actually closer to what happens in other languages, like JavaScript.

Access

To actually look for an element, you can use the find() method, as shown above, which returns an iterator. An alternative would be to use the method at(). The problem with the latter is that it throws an exception when the element is not in the map. Since, as a rule in my life, I do not use exceptions if not for exceptional situations, I prefer to use this only if I actually expect the element to be there, and to be then surprised by its absence; which I would not, because I’d surround the statement containing the access by at() with a try-catch block. The exception thrown is the out_of_range.

1
2
3
4
5
try {
age = ages.at("Clint");
} catch (std::out_of_range& ex) {
cerr << "No such element: Clint\n";
}

Remove

Removing elements from a hash table can be done with the erase() and the clear() methods. The first is supposed to remove the elements by value, by position, or by range. The first snippet removes the element whose Key is Terence:

1
2
3
4
5
6
7
unordered_map<string, int> ages;
ages.insert({"Clint", 83});
ages.insert({"Terence", 74});
ages.insert({"Kevin", 59});

// Remove by value
ages.erase("Terence");

Using the position would require us to provide an iterator. This is something I rarely do, unless after a find():

1
2
auto it = ages.find("Kevin");
ages.erase(it);

Finally, providing a range is something I have not done yet; it makes more sense if you are using an unordered_multimap, because in that case you can have more values corresponding to the same key, and the find() actually returns a pair of iterators defining a range.

One nice thing about erase() is that it is required never to invalidate iterators in the container.

The fine details

The implementation of the STL hash table involves some design decisions. Some of them are mandated by the standard; others are delegated to the library implementers.

An implementation stores the (key, element) pairs in so-called buckets. The destination bucket is determined by the hashed value of the key. So if we have 10 buckets, and hash(“Clint”) is 7, then the pair (“Clint”, 83) goes in the seventh bucket (or the eight, if you start counting from 0). The number of buckets in the table has a direct influence on the probability of hash collisions, that occur when two keys hash to the same position. The load factor is the ratio between number of elements and number of buckets. Higher load factors imply higher probabilities of having collisions: it would be like having a house with a parking lot for 2 cars and throwing parties with tens of invitees.

The programmer can control part of the behavior of the hash table. We can set the max load factor, which is the number of elements that the container tolerates without triggering an automatic rehash; and we can manually rehash the table.

Rehashing means changing the number of buckets in the has table. It is controllable through the methods rehash(), which takes the desired minimum size of each bucket; and by the method reserve(), which is more intuitive because you feed it the number of buckets directly. You have no control, however, on how many buckets are added or removed when rehashing (the growth factor).

1
ages.reserve(10);

Automatic rehashing is performed by the container when the load factor exceeds a set value. You can’t control how few elements have to be in the container to trigger an automatic rehash (that can be imagined as the minimum load factor).

Caveat on Key and T

There are some restrictions of which we must be aware when we complicate things and use Key and/or T types which are not primitive or provided by libraries.

  1. Key must be a moveable and copyable type; it also has to be const, because it is used to access the position of the element.
  2. T must be comparable with equal

Summary

At Yahoo!, someone (I honestly can’t remember who, but I’ve read this in one of Gayle Laakmann McDowell’s books) once said: the three most used data structures in our code are, in order, hash tables, hash tables, and hash tables. They have numerous applications, and using them effectively (well, using them at all) can make the difference between a good solution and a bad idea.

In this post, I have described a somehow distilled workflow to use this powerful data structure with the unordered_map provided by the STL.