histogram/doc/rationale.qbk
2017-10-18 15:51:40 +02:00

151 lines
18 KiB
Plaintext

[section Rationale]
This library was designed based on a decade of experience collected in working with big data, more precisely in the field of particle physics and astroparticle physics.
The design is guided by these principles.
* "Do one thing and do it well", Doug McIlroy
* The [@https://www.python.org/dev/peps/pep-0020 Zen of Python] (also applies to other languages).
The library is build on advice from C++ experts, like Bjarne Stroustrup, Scott Meyers, Herb Sutter, and Andrei Alexandrescu, and Chandler Carruth.
The library was written with two major goals in mind:
* Provide the same interface for one-dimensional and multi-dimensional histograms. This makes the interface easier to learn. A consistent multi-dimensional interface makes it also easier to change the code, if a histogram is revised to have more axes. The one-dimensional case can be conveniently treated as a special case of the multi-dimensional one. Thanks to template metaprogramming, the compiler can emit optimised code for the one-dimensional case.
* Completely hide the details of how the counting is done. Other implementations, notably those in the [@https://root.cern.ch ROOT framework] expose this, which forces the user to make a choice which is potentially dangerous. At best, the choice is merely inefficient, but it can lead to information loss in form of overflowing or capped counters.
[section Structure]
The library consists of three orthogonal components:
* [link histogram.rationale.histogram_types histogram types]: Host class which defines the user interface and responsible for holding axis objects. The two implementions differ in the way axis objects are stored.
* [link histogram.rationale.axis_types axis types]: Defines how value ranges are mapped to bins. Several axis types are provided which implement different specializations.
* [link histogram.rationale.storage_types storage types]: Manages memory to hold bin counts. The requirements for a storage differ from those of an STL container. Two implementations are provided.
[endsect]
[section:histogram_types Histograms types]
Histograms have a number of axes. An axis defines a mapping of ranges of input values to bin indices. The library supports several [link histogram.rationale.axis_types axis specializations]. The number of axes may be known at compile time or runtime, depending on how the library is used. The axis types may be known at compile time or runtime.
Users can chose between two histogram implementations. The implementation is selected with the first template argument, see [classref boost::histogram::histogram]. The static implementation is faster (see [link histogram.benchmarks benchmark]), because the compiler is able to inline more code and some bookkeeping and type casting at run-time is avoided. It is also able to catch many user errors at compile time rather than runtime.
The static approach is unpractical when the user wants to create histograms at runtime, for example, from Python or from a graphical user interface. The second implementation addresses this and allows one to set the number of axes and their types at runtime. The interface of the dynamic implementation is a superset of the static implementation. With the help of included utility functions it is possible to write code which works transparently with either implementation, thus allowing users to switch the implementation in their code at any point in time.
[endsect]
[section:axis_types Axis types]
An axis defines which range of input values is mapped to which bin. The logic is encapsulated in an axis type. The library comes with five axis types, which implement different specializations.
* [classref boost::histogram::axis::regular] sorts real numbers into bins with equal width.
* [classref boost::histogram::axis::variable] sorts real numbers into bins with varying width.
* [classref boost::histogram::axis::circular] is a specialization of a regular axis for angles and other input that wraps around.
* [classref boost::histogram::axis::integer] is a specialization of a regular axis for a continuous range of integers.
* [classref boost::histogram::axis::category] is a specialization of an integer axis for categorical data, like "men" and "women".
Library users can create their own axis classes and use them with the library, by providing a class compatible with the axis concept.
[endsect]
[section:storage_types Storage types]
Dense (aka contiguous) storage in memory is needed for fast bin lookup, which is of the random-access variety and may be happening in a tight loop. All storage types therefore implement dense storage of bin counters. [classref boost::histogram::array_storage] implements a storage based on a heap-allocated array. That could be the end of story, but there are several issues with this approach. For one, it is not convenient, because the user has to decide what type to use to hold the bin counts and it is not an obvious choice. The integer needs to be large enough to avoid counter overflow, but if it is too large and only a fraction of the bits are used, then it is a waste of memory. Using floating point numbers is even more dangerous. They don't overflow, but cap the bin count when the bits in the mantissa are used up.
The standard storage used in the library is [classref boost::histogram::adaptive_storage], which solves these issues in an effective way, based on the following insight.
The [@https://en.wikipedia.org/wiki/Curse_of_dimensionality curse of dimensionality] becomes a problem when the number of histogram axes grows, since the number of bins grows exponentially. High-dimensional histograms can consume GBs of memory. However, having many bins typically reduces the number of counts per bin, since the input values are spread over many more bins now. This suggests an adaptive solution: start with a minimum amount of memory per bin by using the smallest integer type to hold a count. If the bin counter is about to overflow, switch to the next larger integer type.
We start with 1 byte per bin counter and then double the size as needed, until 8 byte per bin are reached. When even that is not enough, we switch to the [@boost:/libs/multiprecision/index.html Boost.Multiprecision] type `cpp_int`, whose capacity is limited only by available memory. This approach is not only memory conserving, but also ultimately safe, because bin counters cannot overflow.
And now comes the best part: it is even often faster despite several overheads. We require dense storage and so need to resize all bin counters to the new size whenever any one counter is about to overflow. A new memory block is allocated with a uniformly larger cell size, then the content of the old memory is copied, and finally the old memory block is deallocated. This is a slow operation, but happens only O(logN) times for N bin increments. There is an runtime overhead involved in addressing bin counters of varying size. Nevertheless, the benchmarks show that this approach is faster for larger histograms than using fixed but large cell size, because the more compact size leads to better utilization of the CPU cache.
In a sense, [classref boost::histogram::adaptive_storage adaptive_storage] is the opposite of a `std::vector`, which keeps the size of the stored type constant, but grows to hold a larger number of elements. Here, the number of elements remains the same, but the storage grows to hold a uniform collection of larger and larger elements.
[endsect]
[section:uoflow Overflow and underflow bins]
The library supports extra bins that count values which fall below or above the range covered by the axis. These extra bins are called under- and overflow bins, respectively. The extra bins can be turned off individually for each axis at runtime to conserve memory, but are turned on by default. The extra bins do not interfere with normal bin counting. On an axis with `n` bins, the first bin has the index `0`, the last bin `n-1`, while the under- and overflow bins are accessible at the indices `-1` and `n`, respectively.
Under- and overflow bins are useful in one-dimensional histograms, and nearly essential in multi-dimensional histograms. Here are the advantages:
* No loss: The total sum over all bin counts is strictly equal to the number of times `fill(...)` was called. Even NaN values are counted, they end up in the underflow bin by convention.
* Diagnosis: Unexpected extreme values show up in the extra bins, which otherwise might have been overlooked.
* Reducability: In multi-dimensional histograms, an out-of-range value along one axis may be paired with an in-range value along another axis. If under- and overflow bins are missing, such a value pair is lost. This distorts the histogram even along the axis where the value was in range. When under- and overflow bins are present, it is possible to reduce (summing up along other axes) the histogram into a lower-dimensional version and get the same result as if one had filled the latter histogram.
[endsect]
[section:weights Variance estimates]
Once a histogram is filled, the count ['k] in a bin can be queried with the `value(...)` method. The histogram also offers a `variance(...)` method, which returns an estimate of the [@https://en.wikipedia.org/wiki/Variance variance] ['v] of that count.
If the input values for the histogram come from a [@https://en.wikipedia.org/wiki/Stochastic_process stochastic process], the variance provides useful additional information. Examples for a stochastic process are a physics experiment or a random person filling out a questionaire[footnote The choices of the person are most likely not random, but if we pick a random person from a group, we randomly sample from a pool of opinions]. The variance ['v] is the square of the [@https://en.wikipedia.org/wiki/Standard_deviation standard deviation]. The standard deviation in a bin tells us how much we can expect the observed value to fluctuate if we or someone else would repeat our experiment with new random input.
Variance estimates are useful in many ways:
* Error bars: Drawing a line (so-called [@https://en.wikipedia.org/wiki/Error_bar error bar]) over the interval ['(k - sqrt(v), k + sqrt(v))] is a simple visualisation of the expected random scatter of the bin value ['k], if the histogram was cleared and filled again with another independent sample of the same size (e.g. by repeating the physics experiment or asking more people to fill a questionaire). If you compare the result with a fitted model (see next item), about 2/3 of the error bars should overlap with the model, if the model is correct.
* Least-squares fitting: Often you have a model of the expected number of counts ['lambda] per bin, which is a function of parameters with unknown values. A simple method to find good (sometimes the best) estimates for those parameter values is to vary them until the sum of squared residuals ['(k - lambda)^2/v] is minimized. This is the [@https://en.wikipedia.org/wiki/Least_squares method of least squares], in which both the bin values ['k] and variance estimates ['v] enter.
* Pull distributions: If you have two histograms filled with the same number of samples and you want to know whether they are in agreement, you can compare the so-called pull distribution. It is formed by subtracting the counts and dividing by the square root of their variances ['(k1 - k2)/sqrt(v1 + v2)]. If the histograms are identical, the pull distribution randomly scatters around zero, and about 2/3 of the values are in the interval ['[ -1, 1 ]].
Why return the variance ['v] and not the standard deviation ['s = sqrt(v)]? There are two reasons:
* Additivity: [@https://en.wikipedia.org/wiki/Variance#Properties Variances of independent samples can be added] like normal numbers ['v3 = v1 + v2]. This is not true for standard deviations, where the addition law is more complex ['s3 = sqrt(s1^2 + s2^2)]. In that sense, the variance is easier to use during data processing.
* Efficiency: Computing the variance requires fewer instructions. Therefore we return the variance and let the user optionally take the square-root to obtain the standard deviation as needed.
How is the variance estimate ['v] computed? If we know the expected number of counts ['lambda] per bin, we could compute the variance as ['v = lambda], because counts in a histogram follow the [@https://en.wikipedia.org/wiki/Poisson_distribution Poisson distribution]
[footnote
The Poisson distribution is correct as far as the counts ['k] themselves are of interest. If the fractions per bin ['p = k / N] are of interest, where ['N] is the total number of counts, then the correct distribution to describe the fractions is the [@https://en.wikipedia.org/wiki/Multinomial_distribution multinomial distribution].
]. After filling a histogram, we do not know the expected number of counts ['lambda] for any particular bin, but we know the observed count ['k], which is not too far from ['lambda]. We therefore might be tempted to just replace ['lambda] with ['k] in the formula ['v = lambda = k]. This is in fact the so-called non-parameteric estimate for the variance based on the [@https://en.wikipedia.org/wiki/Plug-in_principle plug-in principle]. It is the best (and only) estimate for the variance, if we know nothing more about the underlying stochastic process which generated the inputs (or want to feign ignorance about it).
Now, if the value returned by the method `variance(...)` is just the same as the value return by `value(...)`, why bother with adding a `variance(...)` method, except perhaps for convenience? There is another reason, which becomes apparent if the histograms are filled with weighted counts, which is discussed next.
[endsect]
[section:weigths Support of weights]
A histogram categorizes input values and increments a bin counter if an input value falls into the value range covered by that bin. The [classref boost::histogram::adaptive_storage standard storage] uses integer types to store these counts, see the [link histogram.rationale.storage_types storage section] how integer overflow is avoided. However, sometimes histograms need to be filled with values that have a weight ['w] attached to them. In this case, the corresponding bin counter is not increased by one, but by the passed weight ['w].
[note
There are many use cases for weighted fills. Here is an example from particle physics. Let us say the value to be histogrammed is the recorded energy of some particle that hit a detector. Let us further say that the detector sometimes misses the particle, if its energy is small (because it does not generate enough signal to be notice above the noise level). If the efficiency of the detector is known as a function of the energy, one can correct for the expected loss by weighting each energy with the inverse of the efficiency to on average compensate the loss.
]
When the [classref boost::histogram::adaptive_storage adaptive_storage] is used, histograms may also be filled with weighted values. The choice of using weighted fills can be made at run-time. If the call `fill(..., weight(x))` is used, two doubles per bin are stored (previous integer counts are automatically converted). The first double keeps track of the sum of weights. The second double keeps track of the sum of weights squared. The latter is the variance estimate in this case and returned by a call to `variance(...)`.
[note
This variance estimate can be derived from the [@https://en.wikipedia.org/wiki/Variance#Properties mathematical properties of the variance]. Let us say a bin is filled ['k1] times with weight ['w1]. The sum of weights is then ['w1 k1]. It then follows from the variance properties that ['Var(w1 k1) = w1^2 Var(k1)]. Using the reasoning from before, the estimated variance of ['k1] is ['k1], so that ['Var(w1 k1) = w1^2 Var(k1) = w1^2 k1]. Variances of independent samples are additive. If the bin is further filled ['k2] times with weight ['w2], the sum of weights is ['w1 k1 + w2 k2], with variance ['w1^2 k1 + w2^2 k2]. This also holds for ['k1 = k2 = 1]. Therefore, the sum of weights ['w[i]] has variance sum of ['w[i]^2]. In other words, to incrementally keep track of the variance of the sum of weights, we need to keep track of the sum of weights squared.
]
[endsect]
[section Python support]
Python is a popular scripting language in the data science community. Thus, the library provides Python bindings. The histogram may be used as an interface between a complex simulation or data-storage system written in C++ and data-analysis/plotting in Python. Users are able to define the histogram in Python, let it be filled on the C++ side, and then get it back for further data analysis or plotting.
Data analysis in Python is Numpy-based, so Numpy support is included. If number of dimensions is larger than one, this implementation is faster than the equivalent Numpy functions (while being more flexible), see [link histogram.benchmarks benchmark].
[note
The Python and C++ interface use consistent naming, but sometimes Python offers more elegant and pythonic ways of implementing things. Where possible, the more pythonic interface is used.
Properties: Getter/setter-like functions are wrapped as properties. Examples: `histogram.dim`, `regular_axis.bins`.
Axis access: The `histogram` object implements the sequence protocol and behaves like an immutable sequence of axes. Examples: `len(histogram)` returns the number of axes, `histogram[0]` accesses the first axis, `for axis in histogram: [...]`.
Keyword-based parameters: the member function call `fill(..., weight(x))` in C++ is translated into a Python member function call `fill(..., weight=x)`.
]
[endsect]
[section Serialization]
Serialization is implemented using [@boost:/libs/serialization/index.html Boost.Serialization]. Pickling in Python is implemented based on the C++ serialization code. In the current implementation, the pickled stream is *not* portable, since it uses `boost::archive::binary_archive`. It would be great to switch to a portable binary representation in the future, when that becomes available.
[endsect]
[endsect]