Implement std::multiset

In this post, I talk about the std::multiset and about some design decisions I needed when trying to implement it. The full implementation is on GitHub.

Multisets and multimaps

There are a few data structures in the standard library that I never use. In some cases, it’s because I never had the right use case no my hands, in others (embarrassingly enough) because I am too lazy to read up about one that I’m not already familiar with. But lately I’ve found myself needing something to solve a problem, and I discovered that I could get some great help by just using a data structure I’d never heard of: the std::multiset (or its cousin, the std::multimap).

I had read about it before. On the surface, it is just like a std::set, but it let you store the keys multiple times. Why on Earth would you need that, and in particular, which cases are not served as well by just the std::set?

There’s something interesting, however, about this structure: It keeps its keys sorted. Consider the following situation: You want to store the nodes of a binary tree by level (root being at level 1, leaves at level O(log(N)) + 1 if we have N nodes and a balanced tree). Then you want to print the sum of the values of the nodes at each level, sorted by (ascending) level.

You could scan the tree and use a map level-to-nodes to accomplish this. However, you know that a std::map would not keep the order of the keys.

A better way to do it is by using a multimap! You’d do something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
typedef int level_t;

struct node {
int value;
node* left;
node* right;
};

void printLevelSums(const int num_levels) {
std::multimap<level_t, node> m;

// Store each node using the level as a key. Note that we have multiple
// keys with the same value.
// ...

for (int i = 1; i < num_levels; ++i) {
auto range = m.equal_range(i)
int sum = 0;
for (auto it = range.first; it != range.second; ++it) {
sum += it->value;
}
std::cout << sum << std::endl;
}
}

Implementing a multiset

The requirements on the basic operations of a multiset, expressed in terms of time complexity, are the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
// O(log(size)):
void insert(const Key& t);
bool contains(const Key& t) const;
iterator find(const Key& t);

// Returns the number of elements with key `key`.
// O(log(size of the container)) plus linear in the number of the elements
// found.
unsigned count(const Key& key) const;

// Removes all elements with the key equivalent to key. Return number of
// elements erased. O(log(c.size()) + c.count(k)).
unsigned erase(const Key& key);

We see that the insertion and search are not constant like they would if using a hash, but logarithmic. The count operation reveals how many elements having the same key are in the multiset.

One way of implementing this is by using a binary search tree, where each node contains a vector<Key>. We would then have something like:

1
2
3
4
5
6
7
template <typename Key>
class multiset {
typedef node<Key> BinarySearchTree;
BinarySearchTree backend_;

// ...
};

Back-end: the node

The strategy I used was to just implement a BST with the right data and then forwarding the operations to the node. You all know what a node/BST looks like, so I’ll just write a few words about bizarre operations I implemented. Here’s the complete interface of the node/BST structure (I decided to use the node as a recursive definition of a BST):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// A node is a tree.
template <typename Key>
struct node {
private:
vector<Key>* data_; // never null if the object exists. // TODO:
// use optional<vector<Key>>

void getsize(unsigned& cnt) const;
void vectorize(vector<Key*>& v) const;
std::pair<node<Key>*, is_left_child_t> findParent(const node<Key>* n);

public:
node<Key>* left_;
node<Key>* right_;

node() : data_(new vector<Key>), left_(nullptr), right_(nullptr) {}
node(const Key& k)
: data_(new vector<Key>), left_(nullptr), right_(nullptr) {
data_->push_back(k);
}

~node();

Key value() const;
void insert(const Key& k);
node<Key>* find(const Key& k) const;
bool contains(const Key& k) const;
void print() const;
unsigned count(const Key& k) const;
unsigned size() const;
unsigned nodesize() const;

node<Key>* findSuccessor() const;

unsigned erase(const Key& k);
vector<Key*> vectorize() const;
};
  • vectorize: this method transforms the multiset into a vector of the keys, without repeating them. It is useful for the implementation of the iterator. I’ll talk about the iterators in a separate section.
  • nodesize: The number of items associated with a single Key. size just returns the number of stored keys.
  • findParent: I don’t store the node’s parent in my implementation. It would have simplified things, but it’s a trade-off between space and complexity, or space and time (I need to find the parent when I need it, which takes O(log(N)) time, so I can spare a pointer in each node, that would allow me to find the parent in O(1) if I know the address of the node in question. Oh well.)

Iterators

The iterator on a multiset should offer what you usually expect from such a class: a begin() and an end() method, plus some pre- and post-increment or another equivalent of a next() method to go to the next node. To do it, I’ve used a vector of Key pointers and an index into it. When I build an iterator from a node, I can process the sub-tree starting at that node. Since the multiset contains only Key values, my iterator needs to hide the fact that I’m using a tree in the back-end, which is why the de-referencing operator returns a Key itself, and not a node.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class iterator {
vector<Key*> to_process_;
int cur_;

public:
iterator(node<Key>* n) : cur_(n ? 0 : -1) {
if (n) {
to_process_ = n->vectorize();
}
}

// pre-increment:
iterator& operator++() {
if (cur_ == -1) {
return *this;
}

++cur_;
if (cur_ == to_process_.size()) {
cur_ = -1;
}
return *this;
}

Key& operator*() {
if (cur_ == -1 || cur_ == to_process_.size()) {
throw OutOfRange();
}
return *to_process_[cur_];
}

bool operator==(const iterator& it) const {
if (cur_ == -1 && it.cur_ == -1) {
return true;
} else if (it.cur_ == -1) {
return false;
} else if (cur_ >= to_process_.size() || cur_ == -1) {
return false;
}

if (to_process_[cur_] && it.to_process_[cur_]) {
return *(to_process_[cur_]) == *(it.to_process_[cur_]);
}
return false;
}

bool operator!=(const iterator& it) const { return !operator==(it); }

};

Conclusions

A few considerations:

Needless to say, there’s a lot of room for improvement. It took me a few days to implement this basic version, and I realize now that it might not necessarily do what I wanted to do (for instance, the iterator doesn’t allow you to iterate over all the keys, like the std::multiset does, but only on the non-repeating ones).

Once more, I see that implementing on paper to have a full vision of what you’re going to do is of paramount importance: The farther down the way to the full implementation you spot errors, the longer the corrections take, and the more they can impact other areas which you have lost track of in the meanwhile. With more experience, you can usually keep a bigger plan in your head.