Fill in more of the unordered container documentation.

[SVN r3042]
This commit is contained in:
Daniel James 2006-07-01 22:33:29 +00:00
parent e9e503be3f
commit f6222b10e2
4 changed files with 168 additions and 62 deletions

View File

@ -1,28 +1,26 @@
[section:buckets The Data Structure]
The containers are made up of a number of 'buckets', each of which can contain
any number of elements. For example, the following
diagram shows an [classref boost::unordered_set unordered_set] with 7
buckets containing 5 elements, `A`, `B`, `C`, `D` and `E`
(this is just for illustrations, the containers have more buckets, even when
empty).
any number of elements. For example, the following diagram shows an [classref
boost::unordered_set unordered_set] with 7 buckets containing 5 elements, `A`,
`B`, `C`, `D` and `E` (this is just for illustration, in practise containers
will have more buckets).
[$../diagrams/buckets.png]
In order to decide which bucket to place an element in, the container
applies `Hash` to the element (for maps it applies it to the element's `Key`
part). This gives a `std::size_t`. `std::size_t` has a much greater range of
values then the number of buckets, so that container applies another
transformation to that value to choose a bucket (in the case of
[classref boost::unordered_set] this is just the modulous of the number of
buckets).
In order to decide which bucket to place an element in, the container applies
`Hash` to the element's key (for `unordered_set` and `unordered_multiset` the
key is the whole element, but this refered to as the key so that the same
terminology can be used for sets and maps). This gives a `std::size_t`.
`std::size_t` has a much greater range of values then the number of buckets, so
that container applies another transformation to that value to choose a bucket
to place the element in.
If at a later date the container wants to find an element in the container
it just has to apply the same process to the element (or key for maps) to
discover which bucket to find it in. This means that you only have to look at
the elements within a bucket when searching, and if the hash function has
worked well an evenly distributed the elements among the buckets, this should
be a small number.
If at a later date the container wants to find an element in the container it
just has to apply the same process to the element's key to discover which
bucket to find it in. This means that you only have to look at the elements
within a single bucket. If the hash function has worked well the elements will
be evenly distributed amongst the buckets.
You can see in the diagram that `A` & `D` have been placed in the same bucket.
This means that when looking in this bucket, up to 2 comparison have to be
@ -44,6 +42,10 @@ fast we try to keep these to a minimum.
[``size_type bucket_size(size_type n) const``]
[The number of elements in bucket `n`.]
]
[
[``size_type bucket(key_type const& k) const``]
[Returns the index of the bucket which would contain k]
]
[
[``
local_iterator begin(size_type n);
@ -65,36 +67,34 @@ The standard gives you two methods to influence the bucket count. First you can
specify the minimum number of buckets in the constructor, and later, by calling
`rehash`.
The other method is the `max_load_factor` member function. This lets you
/hint/ at the maximum load that the buckets should hold.
The 'load factor' is the average number of elements per bucket,
the container tries to keep this below the maximum load factor, which is
initially set to 1.0.
`max_load_factor` tells the container to change the maximum load factor,
using your supplied hint as a suggestion.
The other method is the `max_load_factor` member function. The 'load factor'
is the average number of elements per bucket, `max_load_factor` can be used
to give a /hint/ of a value that the load factor should be kept below. The
draft standard doesn't actually require the container to pay much attention
to this value. The only time the load factor is /required/ to be less than the
maximum is following a call to `rehash`. But most implementations will probably
try to keep the number of elements below the max load factor, and set the
maximum load factor something the same or near to your hint - unless your hint
is unreasonably small.
The draft standard doesn't actually require the container to pay much attention
to this value. The only time the load factor is required to be less than the
maximum is following a call to `rehash`.
It is not specified anywhere how member functions other than `rehash` affect
the bucket count, although `insert` is only allowed to invalidate iterators
when the insertion causes the load factor to reach the maximum. Which will
typically mean that insert will only change the number of buckets when an
insert causes this.
It is not specified anywhere how other member functions affect the bucket count.
But most implementations will invalidate the iterators whenever they change
the bucket count - which is only allowed when an
`insert` causes the load factor to be more than or equal to the maximum.
But it is possible to implement the containers such that the iterators are
never invalidated.
In a similar manner to using `reserve` for `vector`s, it can be a good idea
to call `rehash` before inserting a large number of elements. This will get
the expensive rehashing out of the way and let you store iterators, safe in
the knowledge that they won't be invalidated. If you are inserting `n`
elements into container `x`, you could first call:
(TODO: This might not be right. I'm not sure what is allowed for
std::unordered_set and std::unordered_map when insert is called with enough
elements to exceed the maximum, but the maximum isn't exceeded because
the elements are already in the container)
x.rehash((x.size() + n) / x.max_load_factor() + 1);
(TODO: Ah, I forgot about local iterators - rehashing must invalidate ranges
made up of local iterators, right?).
This all sounds quite gloomy, but it's not that bad. Most implementations
will probably respect the maximum load factor hint. This implementation
certainly does.
[blurb Note: `rehash`'s argument is the number of buckets, not the number of
elements, which is why the new size is divided by the maximum load factor. The
`+ 1` is required because the container is allowed to resize when the load
factor is equal to the maximum load factor.]
[table Methods for Controlling Bucket Size
[[Method] [Description]]
@ -119,20 +119,14 @@ certainly does.
]
[h2 Rehash Techniques]
[/ I'm not at all happy with this section. So I've commented it out.]
If the container has a load factor much smaller than the maximum, `rehash`
[/ h2 Rehash Techniques]
[/If the container has a load factor much smaller than the maximum, `rehash`
might decrease the number of buckets, reducing the memory usage. This isn't
guaranteed by the standard but this implementation will do it.
When inserting many elements, it is a good idea to first call `rehash` to
make sure you have enough buckets. This will get the expensive rehashing out
of the way and let you store iterators, safe in the knowledge that they
won't be invalidated. If you are inserting `n` elements into container `x`,
you could first call:
x.rehash((x.size() + n) / x.max_load_factor() + 1);
If you want to stop the table from ever rehashing due to an insert, you can
set the maximum load factor to infinity (or perhaps a load factor that it'll
never reach - say `x.max_size()`. As you can only give a 'hint' for the maximum
@ -144,6 +138,6 @@ maybe the implementation should cope with that).
If you do this and want to make the container rehash, `rehash` will still work.
But be careful that you only ever call it with a sufficient number of buckets
- otherwise it's very likely that the container will decrease the bucket
count to an overly small amount.
count to an overly small amount.]
[endsect]

View File

@ -1,8 +1,9 @@
[section:comparison Comparison to Associative Containers]
[section:comparison Comparison with Associative Containers]
* The elements in an unordered container are organised into buckets, in an
unpredictable order. There are member functions to.... TODO
* The unordered associative containers don't support the comparison operators.
unpredictable order. There are member functions to access these buckets which
was described earlier.
* The unordered associative containers don't support any comparison operators.
* Instead of being parameterized by an ordering relation `Compare`,
the unordered associative container are parameterized by a function object
`Hash` and an equivalence realtion `Pred`. The member types and accessor

View File

@ -18,6 +18,118 @@ but not the equality predicate, while if you were to change the behaviour
of the equality predicate you would have to change the hash function to match
it.
For example, if you wanted to use
For example, if you wanted to use the
[@http://www.isthe.com/chongo/tech/comp/fnv/ FNV-1 hash] you could write:
``[classref boost::unordered_set]``<std::string, hash::fnv_1> words;
An example implementation of FNV-1, and some other hash functions are supplied
in the examples directory.
Alternatively, you might wish to use a different equality function. If so, make
sure you use a hash function that matches it. For example, a
case-insensitive dictionary:
struct iequal_to
: std::binary_function<std::string, std::string, bool>
{
bool operator()(std::string const& x,
std::string const& y) const
{
return boost::algorithm::iequals(x, y);
}
};
struct ihash
: std::unary_function<std::string, bool>
{
bool operator()(std::string const& x) const
{
std::size_t seed = 0;
for(std::string::const_iterator it = x.begin();
it != x.end(); ++it)
{
boost::hash_combine(seed, std::tolower(*it));
}
return seed;
}
};
struct word_info {
// ...
};
boost::unordered_map<std::string, word_info, iequal_to, ihash>
idictionary;
[h2 Custom Types]
Similarly, a custom hash function can be used for custom types:
struct point {
int x;
int y;
};
bool operator==(point const& p1, point const& p2)
{
return p1.x == p2.x && p1.y == p2.y;
}
struct point_hash
: std::unary_function<point, std::size_t>
{
std::size_t operator()(point const& p) const
{
std::size_t seed = 0;
boost::hash_combine(seed, p.x);
boost::hash_combine(seed, p.y);
return seed;
}
}
boost::unordered_multiset<point, std::equal_to<point>, point_hash>
points;
Although, customizing Boost.Hash is probably a better solution:
struct point {
int x;
int y;
};
bool operator==(point const& p1, point const& p2)
{
return p1.x == p2.x && p1.y == p2.y;
}
std::size_t hash_value(point const& x) {
std::size_t seed = 0;
boost::hash_combine(seed, p.x);
boost::hash_combine(seed, p.y);
return seed;
}
// Now the default functions work.
boost::unordered_multiset<point> points;
See the Boost.Hash documentation for more detail on how to do this. Remember
that it relies on extensions to the draft standard - so it won't work on other
implementations of the unordered associative containers.
[table Methods for accessing the hash and euqality functions.
[[Method] [Description]]
[
[``hasher hash_function() const``]
[Returns the container's hash function.]
]
[
[``key_equal key_eq() const``]
[Returns the container's key equality function.]
]
]
[endsect]

View File

@ -20,9 +20,8 @@ on average. The worst case complexity is linear, but that occurs rarely and
with some care, can be avoided.
Also, the existing containers require a 'less than' comparison object
to order their elements. For some data types this is impracticle.
It might be slow to calculate, or even impossible. On the other hand, in a hash
table, then elements aren't ordered - but you need an equality function
to order their elements. For some data types this is impossible to implement
or isn't practicle. For a hash table you need an equality function
and a hash function for the key.
So the __tr1__ introduced the unordered associative containers, which are