document why axis_type::size returns signed integer

This commit is contained in:
Hans Dembinski 2018-12-05 10:57:46 +01:00
parent d6c7e81515
commit e0945a2dea

View File

@ -34,11 +34,11 @@ Histograms store axis objects and a storage object. A one-dimensional histogram
To understand the need for multi-dimensional histograms, think of point coordinates. If all points that you consider lie on a line, you need only one value to describe the point. If all points lie in a plane, you need two values to describe the position. Three values are needed for a point in space. A histogram puts a discrete grid over the line, the plane or the space, and counts how many points lie in each cell of the grid. To approximate a point distribution on a line, a 1d-histogram is sufficient. To do the same in 3d-space, one needs a 3d-histogram.
]
This library supports different axis types, so that the user can customize how the mapping is done exactly, see [link histogram.rationale.structure.axis_types axis types]. Users can furthermore chose between two ways of storing axis types in the histogram.
This library supports different axis types, so that the user can customise how the mapping is done exactly, see [link histogram.rationale.structure.axis_types axis types]. Users can furthermore chose between two ways of storing axis types in the histogram.
When the histogram host class is configured to store axis types in a `std::tuple`, we obtain a static histogram. The number and types of the axes are known at compile-time. To access a particular axis, you need to use a compile-time number as index. A static histogram is always faster (see [link histogram.benchmarks benchmark]), because there are no type checks and conversions happening at run-time, and because the compiler can inline more code. Furthermore, many user errors are caught at compile-time rather than run-time.
The static histogram is generally preferable, but cannot be used when the axis configuration is only known at run-time. This is the case, for example, when histograms are created at run-time from Python. Therefore, axis types can also be stored in a variant-like `boost::histogram::axis::any` type, which can be put in a `std::vector`. When the histogram host class is configured to store axis types like this, we obtain a dynamic histogram. The dynamic histogram is a single type that can store arbitrary sequences of axes types, which are generated at runtime. The polymorphic behavior of the generic `boost::histogram::axis::any` type has a run-time cost, however.
The static histogram is generally preferable, but cannot be used when the axis configuration is only known at run-time. This is the case, for example, when histograms are created at run-time from Python. Therefore, axis types can also be stored in a variant-like `boost::histogram::axis::any` type, which can be put in a `std::vector`. When the histogram host class is configured to store axis types like this, we obtain a dynamic histogram. The dynamic histogram is a single type that can store arbitrary sequences of axes types, which are generated at run-time. The polymorphic behavior of the generic `boost::histogram::axis::any` type has a run-time cost, however.
[note
The design decision to store axis types in the variant-like type `boost::histogram::axis::any` has several advantages over forms of run-time polymorphism based on vtables. Firstly, it guarantees that axis objects are local in memory, which reduces cache misses when the histogram iterates over axis objects in a tight loop, which it often does. Secondly, each axis may accept a different value type. Classic polymorphism with vtables assumes that all overloads provided by derived classes share the same method signature, but that is not the case here. One axis may convert numbers to indices, another strings. The method signatures are different and so classic run-time polymorphism does not work, but variants do.
@ -62,7 +62,7 @@ An axis defines which range of input values is mapped to which bin. The logic is
A storage type stores the actual bin counters. It uses a one-dimensional index for lookup, computed by the histogram host from the indices generated from all its axes. The storage needs to know nothing about axes. Users can integrate their own storage classes with the library, by providing a class compatible with the [link histogram.concepts.storage storage concept].
The buildin storage types are optimised for fast look-up of the random-access variety and use dense (aka contiguous) storage in memory. Bin lookup is often happening in a tight loop. [classref boost::histogram::array_storage] implements a simple storage based on a heap-allocated array of a static counter type. That could be the end of story, but there are some issues with this approach. It is not convenient, because the user has to decide what type to use to hold the bin counts and it is not an obvious choice. An integer type needs to be large enough to avoid counter overflow, but only a fraction of the bits are used if it is too large. Using an integral type that is too large is a waste of memory. This is still a concern today since the performance of modern CPUs depends on effective utilization of the CPU cache, which is small. Using floating point numbers instead of integers is also dangerous. They don't overflow, but cap the bin count when the bits in the mantissa are used up.
The builtin storage types are optimized for fast look-up of the random-access variety and use dense (aka contiguous) storage in memory. Bin lookup is often happening in a tight loop. [classref boost::histogram::array_storage] implements a simple storage based on a heap-allocated array of a static counter type. That could be the end of story, but there are some issues with this approach. It is not convenient, because the user has to decide what type to use to hold the bin counts and it is not an obvious choice. An integer type needs to be large enough to avoid counter overflow, but only a fraction of the bits are used if it is too large. Using an integral type that is too large is a waste of memory. This is still a concern today since the performance of modern CPUs depends on effective utilization of the CPU cache, which is small. Using floating point numbers instead of integers is also dangerous. They don't overflow, but cap the bin count when the bits in the mantissa are used up.
The standard storage used in the library is [classref boost::histogram::adaptive_storage]. It solves these issues with a dynamic counter type management, based on the following insight. The [@https://en.wikipedia.org/wiki/Curse_of_dimensionality curse of dimensionality] makes the total number of bins grow very fast as the dimension of the histogram grows. However, having many bins also reduces the number of counts per bin, since the input values are spread over many more bins now.
@ -100,7 +100,7 @@ Under- and overflow bins are useful in one-dimensional histograms, and nearly es
* Diagnosis: Unexpected extreme values show up in the extra bins, which otherwise may be overlooked.
* Ability to reduce histograms: In multi-dimensional histograms, an out-of-range value along one axis may be paired with an in-range value along another axis. If under- and overflow bins are missing, such a value pair is lost completely. If you apply a `reduce` operation on a histogram, which removes somes axes by summing all counts along that dimension, this would lead to distortions of the histogram along the remaining axes. When under- and overflow bins are present, the `reduce` operation always produces a sub-histogram identical to one obtained if it was filled from scratch with the original data.
* Ability to reduce histograms: In multi-dimensional histograms, an out-of-range value along one axis may be paired with an in-range value along another axis. If under- and overflow bins are missing, such a value pair is lost completely. If you apply a `reduce` operation on a histogram, which removes some axes by summing all counts along that dimension, this would lead to distortions of the histogram along the remaining axes. When under- and overflow bins are present, the `reduce` operation always produces a sub-histogram identical to one obtained if it was filled from scratch with the original data.
[endsect]
@ -108,22 +108,22 @@ Under- and overflow bins are useful in one-dimensional histograms, and nearly es
Once a histogram is filled, the bin counter can be accessed with the `at(...)` method. The standard counter type has a `value()` method to return the count ['k]. It also offers a `variance()` method, which returns an estimate ['v] of the [@https://en.wikipedia.org/wiki/Variance variance] of that count.
If the input values for the histogram come from a [@https://en.wikipedia.org/wiki/Stochastic_process stochastic process], the variance provides useful additional information. Examples for a stochastic process are a physics experiment or a random person filling out a questionaire[footnote The choices of the person are most likely not random, but if we pick a random person from a group, we randomly sample from a pool of opinions]. The variance ['v] is the square of the [@https://en.wikipedia.org/wiki/Standard_deviation standard deviation]. The standard deviation is a number that tells us how much we can expect the observed value to fluctuate if we or someone else would repeat our experiment with new random input.
If the input values for the histogram come from a [@https://en.wikipedia.org/wiki/Stochastic_process stochastic process], the variance provides useful additional information. Examples for a stochastic process are a physics experiment or a random person filling out a questionnaire [footnote The choices of the person are most likely not random, but if we pick a random person from a group, we randomly sample from a pool of opinions]. The variance ['v] is the square of the [@https://en.wikipedia.org/wiki/Standard_deviation standard deviation]. The standard deviation is a number that tells us how much we can expect the observed value to fluctuate if we or someone else would repeat our experiment with new random input.
Variance estimates are useful in many ways:
* Error bars: Drawing an [@https://en.wikipedia.org/wiki/Error_bar error bar] over the interval ['(k - sqrt(v), k + sqrt(v))] is a simple visualisation of the expected random scatter of the bin value ['k], if the histogram was cleared and filled again with another independent sample of the same size (e.g. by repeating the physics experiment or asking more people to fill a questionaire). If you compare the result with a fitted model (see next item), about 2/3 of the error bars should overlap with the model, if the model is correct.
* Error bars: Drawing an [@https://en.wikipedia.org/wiki/Error_bar error bar] over the interval ['(k - sqrt(v), k + sqrt(v))] is a simple visualization of the expected random scatter of the bin value ['k], if the histogram was cleared and filled again with another independent sample of the same size (e.g. by repeating the physics experiment or asking more people to fill a questionnaire). If you compare the result with a fitted model (see next item), about 2/3 of the error bars should overlap with the model, if the model is correct.
* Least-squares fitting: Often you have a model of the expected number of counts ['lambda] per bin, which is a function of parameters with unknown values. A simple method to find good (sometimes the best) estimates for those parameter values is to vary them until the sum of squared residuals ['(k - lambda)^2/v] is minimized. This is the [@https://en.wikipedia.org/wiki/Least_squares method of least squares], in which both the bin values ['k] and variance estimates ['v] enter.
* Pull distributions: If you have two histograms filled with the same number of samples and you want to know whether they are in agreement, you can compare the so-called pull distribution. It is formed by subtracting the counts and dividing by the square root of their variances ['(k1 - k2)/sqrt(v1 + v2)]. If the histograms are identical, the pull distribution randomly scatters around zero, and about 2/3 of the values are in the interval ['[ -1, 1]].
Why return the variance ['v] and not the standard deviation ['s = sqrt(v)]? The reason is the additivity of variances. [@https://en.wikipedia.org/wiki/Variance#Properties Variances of independent samples can be added] like normal numbers ['v3 = v1 + v2]. This is not true for standard deviations, where the addition law is more complex ['s3 = sqrt(s1^2 + s2^2)]. In that sense, the variance is more straight-forward to use during data processing. It is also more efficient for the same reason. The user can take the square-root at the end of the processing obtain the standard deviation as needed.
Why return the variance ['v] and not the standard deviation ['s = sqrt(v)]? The reason is that variances can be trivially added. [@https://en.wikipedia.org/wiki/Variance#Properties Variances of independent samples can be added] like normal numbers ['v3 = v1 + v2]. This is not true for standard deviations, where the addition law is more complex ['s3 = sqrt(s1^2 + s2^2)]. In that sense, the variance is more straight-forward to use during data processing. It is also more efficient for the same reason. The user can take the square-root at the end of the processing obtain the standard deviation as needed.
How is the variance estimate ['v] computed? If we know the expected number of counts ['lambda] per bin, we could compute the variance as ['v = lambda], because counts in a histogram follow the [@https://en.wikipedia.org/wiki/Poisson_distribution Poisson distribution]
[footnote
The Poisson distribution is correct as far as the counts ['k] themselves are of interest. If the fractions per bin ['p = k / N] are of interest, where ['N] is the total number of counts, then the correct distribution to describe the fractions is the [@https://en.wikipedia.org/wiki/Multinomial_distribution multinomial distribution].
]. After filling a histogram, we do not know the expected number of counts ['lambda] for any particular bin, but we know the observed count ['k], which is not too far from ['lambda]. We therefore might be tempted to just replace ['lambda] with ['k] in the formula ['v = lambda = k]. This is in fact the so-called non-parameteric estimate for the variance based on the [@https://en.wikipedia.org/wiki/Plug-in_principle plug-in principle]. It is the best (and only) estimate for the variance, if we know nothing more about the underlying stochastic process which generated the inputs (or want to feign ignorance about it).
]. After filling a histogram, we do not know the expected number of counts ['lambda] for any particular bin, but we know the observed count ['k], which is not too far from ['lambda]. We therefore might be tempted to just replace ['lambda] with ['k] in the formula ['v = lambda = k]. This is in fact the so-called non-parametric estimate for the variance based on the [@https://en.wikipedia.org/wiki/Plug-in_principle plug-in principle]. It is the best (and only) estimate for the variance, if we know nothing more about the underlying stochastic process which generated the inputs (or want to feign ignorance about it).
Now, if the number returned by the `variance()` method is just the same as the number return by `value()` method, why bother with adding a `variance()` method? There is a reason, which becomes apparent if the histograms are filled with weights, which is discussed next.
@ -156,7 +156,7 @@ Serialization is implemented using [@boost:/libs/serialization/index.html Boost.
[section Comparison to Boost.Accumulators]
Boost.Histogram has a minor overlap with [@boost:/libs/accumulators/index.html Boost.Accumulators], but the scopes are rather different. The statistical accumulators `density` and `weighted_density` in Boost.Accumulators generate one-dimensional histograms. The axis range and the bin widths are determined automatically from a cached sample of initial values. They cannot be used for multi-dimensional data. Boost.Histogram focusses on multi-dimensional data and gives the user full control of how the binning should be done for each dimension.
Boost.Histogram has a minor overlap with [@boost:/libs/accumulators/index.html Boost.Accumulators], but the scopes are rather different. The statistical accumulators `density` and `weighted_density` in Boost.Accumulators generate one-dimensional histograms. The axis range and the bin widths are determined automatically from a cached sample of initial values. They cannot be used for multi-dimensional data. Boost.Histogram focuses on multi-dimensional data and gives the user full control of how the binning should be done for each dimension.
Automatic binning is not an option for Boost.Histogram, because it does not scale well to many dimensions. Because of the Curse of Dimensionality, a prohibitive number of samples would need to be collected.
@ -176,11 +176,11 @@ Recommendation:
[section Why is Boost.Histogram not built on top of Boost.MultiArray?]
Boost.MultiArray implements a multi-dimensional array, it also converts an index tuple into a global index that is used to access an element in the array. Boost.Histogram and Boost.MultiArray share this functionality, but Boost.Histogram cannot use Boost.MultiArray as a backend. Boost.MultiArray does not allow to change the element type dynamically, like it is needed to implement the adaptive storage mentioned further up. Using a variant type as the element type of a Boost.MultiArray would not work, because it creates this wasteful layout:
Boost.MultiArray implements a multi-dimensional array, it also converts an index tuple into a global index that is used to access an element in the array. Boost.Histogram and Boost.MultiArray share this functionality, but Boost.Histogram cannot use Boost.MultiArray as a back-end. Boost.MultiArray does not allow to change the element type dynamically, like it is needed to implement the adaptive storage mentioned further up. Using a variant type as the element type of a Boost.MultiArray would not work, because it creates this wasteful layout:
`[type-index 1][value 1][type-index 2][value 2]...`
A type index is stored for each cell. Moreover, the variant is always as large as the largest type in the union, so there is no way to safe memory by using a smaller type when the bin count is low, as it is done by the adaptive storage. The adaptive storage uses only one type-index for the whole array and allocates a homogenous array of values of the same type that exactly matches their sizes, creating the following layout:
A type index is stored for each cell. Moreover, the variant is always as large as the largest type in the union, so there is no way to safe memory by using a smaller type when the bin count is low, as it is done by the adaptive storage. The adaptive storage uses only one type-index for the whole array and allocates a homogeneous array of values of the same type that exactly matches their sizes, creating the following layout:
`[type-index][value 1][value 2][value 3]...`
@ -188,4 +188,20 @@ There is only one type index and the number of allocated bytes for the array can
[endsect]
[endsect]
[section Why does `axis_type::size()` return a signed integer?]
The standard library returns a container size as an unsigned integer, because container sizes cannot be negative. The `size()` method of the histogram class follows this rule, but the `size()` method of the builtin axis types does not. Why?
It is necessary to avoid serious bugs in user code. An axis type represents a value arrow and maps values to bin indices. If an input value falls below the first bin with the index 0, the axis returns the index -1. The axis index therefore has to be represented by a signed integer.
Now [@https://www.youtube.com/watch?v=wvtFGa6XJDU something awful happens when you compare -1 with an unsigned integer]: `-1 < 1u == false`. This is the rule in C++. If `axis_type::size()` would return an unsigned integer, the following innocent-looking loop construct over an axis that includes underflow and overflow bins, would have a serious bug in it:
```
for (int i = -1; i <= axis_type::size(); ++i) {
// never executes if axis_type::size() returns an unsigned integer
}
```
Therefore it must return a signed integer.
[endsect]
[endsect]