separate out rationale into extra section, making overview shorter

This commit is contained in:
Hans Dembinski 2019-08-23 08:59:50 +02:00
parent 93f4b4b1d8
commit f869fb16d7
4 changed files with 120 additions and 114 deletions

View File

@ -28,7 +28,7 @@ In the example above, the compiler knows the number of axes and their type at co
[guide_make_dynamic_histogram]
[note
When the axis types are known at compile-time, the histogram internally holds them in a `std::tuple`, which is very efficient and avoids a memory allocation from the heap. If they are only known at run-time, it stores the axes in a `std::vector`. The interface hides this difference, but not perfectly. There are a corner cases where the difference becomes apparent. The [link histogram.overview.rationale.structure.host rationale] has more details on this point.
When the axis types are known at compile-time, the histogram internally holds them in a `std::tuple`, which is very efficient and avoids a memory allocation from the heap. If they are only known at run-time, it stores the axes in a `std::vector`. The interface hides this difference, but not perfectly. There are a corner cases where the difference becomes apparent. The [link histogram.overview.structure.host overview] has more details on this point.
]
The factory function named [headerref boost/histogram/make_histogram.hpp make_histogram] uses the default storage type, which provides safe counting, is fast, and memory efficient. If you want to create a histogram with another storage type, use [headerref boost/histogram/make_histogram.hpp make_histogram_with]. To learn more about other storage types and how to create your own, have a look at the section [link histogram.guide.expert Advanced Usage].
@ -154,7 +154,7 @@ Users may write their own transforms and use them with the builtin [classref boo
[*Under- and overflow bins]
By default, under- and overflow bins are added automatically for each axis (except if adding them makes no sense). If you create an axis with 20 bins, the histogram will actually have 22 bins along that axis. The extra bins are very useful, as explained in the [link histogram.overview.rationale.uoflow rationale]. If the input cannot exceed the axis range, you can disable the extra bins to save memory. Example:
By default, under- and overflow bins are added automatically for each axis (except if adding them makes no sense). If you create an axis with 20 bins, the histogram will actually have 22 bins along that axis. The extra bins are very useful, as explained in the [link histogram.rationale.uoflow rationale]. If the input cannot exceed the axis range, you can disable the extra bins to save memory. Example:
[import ../examples/guide_axis_with_uoflow_off.cpp]
[guide_axis_with_uoflow_off]
@ -201,7 +201,7 @@ After the histogram has been filled, use [memberref boost::histogram::histogram:
[import ../examples/guide_fill_histogram.cpp]
[guide_fill_histogram]
For a histogram `hist`, the calls `hist(weight(w), ...)` and `hist(..., weight(w))` increment the bin counter by the value `w` instead, where `w` may be an integer or floating point number. The helper function [funcref boost::histogram::weight() weight()] marks this argument as a weight, so that it can be distinguished from the other inputs. It can be the first or last argument. You can freely mix calls with and without a weight. Calls without `weight` act like the weight is `1`. Why weighted increments are sometimes useful is explained [link histogram.overview.rationale.weights in the rationale].
For a histogram `hist`, the calls `hist(weight(w), ...)` and `hist(..., weight(w))` increment the bin counter by the value `w` instead, where `w` may be an integer or floating point number. The helper function [funcref boost::histogram::weight() weight()] marks this argument as a weight, so that it can be distinguished from the other inputs. It can be the first or last argument. You can freely mix calls with and without a weight. Calls without `weight` act like the weight is `1`. Why weighted increments are sometimes useful is explained [link histogram.rationale.weights in the rationale].
[note The default storage loses its no-overflow-guarantee when you pass floating point weights, but maintains it for integer weights.]

View File

@ -70,6 +70,14 @@ Boost.Histogram provides an easy-to-use, fast, and extensible multi-dimensional
The performance of this library is compared with other popular libraries.
]
]
[
[
[link histogram.rationale Rationale]
]
[
Thoughts behind the design for contributors and future maintainers of the library.
]
]
[
[
[link histogram.history Revision history]
@ -86,6 +94,7 @@ Boost.Histogram provides an easy-to-use, fast, and extensible multi-dimensional
[include benchmarks.qbk]
[include concepts.qbk]
[xinclude reference_pp.xml]
[include rationale.qbk]
[include changelog.qbk]
[*Acknowledgments]

View File

@ -23,18 +23,90 @@ Finally, the histogram can be configured to store an accumulator in each cell. A
[endsect]
[section:motivation Motivation]
[section:structure Structure]
C++ lacks a widely-used, free multi-dimensional histogram class. While it is easy to write a one-dimensional histogram, writing a general multi-dimensional histogram poses more of a challenge. If you add a few more features required by scientific professionals onto the wish-list, then the implementation becomes non-trivial and a well-tested library solution desirable.
The library consists of three orthogonal components:
The [@https://www.gnu.org/software/gsl GNU Scientific Library (GSL)] and the [@https://root.cern.ch ROOT framework] from CERN have histogram implementations. The GSL has histograms for one and two dimensions in C. The implementations are not customizable. ROOT has well-tested implementations of histograms, but they are not customizable and they are not easy to use correctly. ROOT also has new implementations in beta-stage similar to this one, but they cannot be used without the rest of ROOT, which is a huge library to install just to get histograms.
* [link histogram.overview.structure.host histogram host class]: The histogram host class defines the public user interface and holds axis objects (one for each dimension) and a storage object. The user can chose whether axis objects are stored in a static tuple, a vector, or a vector of variants.
The templated histogram class in this library has a minimalistic interface, which strives to be as elegant as the GSL implementations. In addition, it is very customizable and extensible through user-provided classes. A single implementation is used for one and multi-dimensional histograms. While being safe, customizable, and convenient, the histogram is also very fast. The static version, which has an axis configuration that is hard-coded at compile-time, is faster than any tested competitor.
* [link histogram.overview.structure.axis axis types]: Defines how input values are mapped to bins. Several axis types are provided which implement different specializations. Users can make their own axis types following the axis concept and use them with the library. A variant type is provided, which can hold one of several concrete axis types.
One of the central design goals was to provide an abstract interface to the internal bin counters. The internal counting mechanism is encapsulated in a storage class, which can be replaced at compile-time. The default storage uses an adaptive memory management which is safe to use, memory-efficient, and fast. The safety comes from the guarantee, that counts cannot overflow or be capped. This is a rare guarantee, hardly found in other libraries. In the standard configuration, the histogram /just works/ under any circumstance. Yet, users with unusual requirements can implement their own custom storage class or use an alternative builtin array-based storage.
* [link histogram.overview.structure.storage storage types]: Manages a collection of bin counters. The requirements for a storage differ from those of an STL container, it needs to follow the storage concept. Two implementations are provided.
[section:host Histogram host class]
Histograms store axis objects and a storage object. A one-dimensional histogram has one axis, a multi-dimensional histogram has several. When you pass an input tuple, say (v1, v2, v3), then the first axis will map v1 onto index i1, the second axis v2 onto i2, and so on, to generate the index tuple (i1, i2, i3). The histogram host class then converts these indices into a linear global index that is used to address bin counter in the storage object.
[note
To understand the need for multi-dimensional histograms, think of point coordinates. If all points that you consider lie on a line, you need only one value to describe the point. If all points lie in a plane, you need two values to describe the position. Three values are needed for a point in space. A histogram puts a discrete grid over the line, the plane or the space, and counts how many points lie in each cell of the grid. To approximate a point distribution on a line, a 1d-histogram is sufficient. To do the same in 3d-space, one needs a 3d-histogram.
]
This library supports different axis types, so that the user can customize how the mapping is done exactly, see [link histogram.overview.structure.axis axis types]. Users can furthermore chose between several ways of storing axis types in the histogram.
When the number and types of the axes are known at compile-time, the histogram host class stores axis types in a `std::tuple`. We call this a /static histogram/. To access a particular axis, one should use a compile-time number as index (a run-time index also works with some limitations). A static histogram is extremely fast (see [link histogram.benchmarks benchmark]), because there is no overhead and the compiler can inline code, unroll loops, and more. Also, many user errors are can be caught at compile-time rather than run-time.
Static histograms are the best kind, but cannot be used when histograms are to be created with an axis configuration that is only known at run-time. This is the case, for example, when histograms are created from Python or from a graphical user interface. Therefore also more dynamic kinds of histograms are supported.
There are two levels of dynamism. The histogram can hold instances of a single axis type in a `std::vector`. Now the number of axis instances per histogram can vary at run-time, but the axis type must be the same for all instances. We call this a /semi-dynamic histogram/.
If also the axis types need to vary at run-time, one can place `boost::histogram::axis::variant` type in a `std::vector`, which can hold one of a set of different concrete axis types. We call this a /dynamic histogram/. The dynamic histogram is a single type that can store arbitrary sequences of different axes types, which may be generated at run-time. The polymorphic behavior of the generic `boost::histogram::axis::variant` type has a run-time cost, however. Typically, the performance is reduced by a factor of two compared to a static histogram.
[note
The design decision to store axis types in the variant-like type `boost::histogram::axis::variant` has several advantages over forms of run-time polymorphism. Firstly, it guarantees that axis objects which belong to the same histogram are stored locally together in memory, which reduces cache misses when the histogram iterates over axis objects in a tight loop, which it often does. Secondly, each axis can accept a different value type in this scheme. Classic polymorphism with vtables requires that all classes share the same method call signatures, but we want different axis to allowed to accepted different types of arguments. Variants work well for this case.
]
[endsect]
[include rationale.qbk]
[section:axis Axis types]
An axis defines an injective mapping of (a range of) input values to a bin. The logic is encapsulated in an axis type. Users can create their own axis classes and use them with the library, by implementing the [link histogram.concepts.Axis [*Axis] concept]. The library comes with four builtin types, which implement different specializations.
* [classref boost::histogram::axis::regular] sorts real numbers into bins with equal width. The regular axis also supports monotonic transforms, which are applied when the input values are passed to the axis. This can be used to make a fast logarithmic axis, where the bins have equal width in the logarithm of the variable.
* [classref boost::histogram::axis::variable] sorts real numbers into bins with varying width.
* [classref boost::histogram::axis::integer] is a specialization of a regular axis for a range of integers with unit bin width. It is much faster than a regular axis.
* [classref boost::histogram::axis::category] is a bijective mapping of unique values onto bin indices and vice versa. This can be used with discrete categorical data, like "red", "green", "blue", for example.
Each builtin axis type has a few compile-time options, which change its behavior.
* All axis types can have an optional overflow bin. When the overflow bin is enabled and an input value is above the range covered by the axis, it is not discarded but counted in the overflow bin.
* All axis types except the category axis can have an optional underflow bin. When the underflow bin is enabled and an input value is below the range covered by the axis, it is not discarded but counted in the underflow bin.
* All axis types except the category axis can be circular, meaning that the axis range is periodic. This is useful for periodic data like polar angles.
* All axis types can optionally grow. When an input value is outside of the range of an axis which is configured to grow, the range of the axis is extended until the value is in range. This option is incompatible with the circular option, only either can be active.
[endsect]
[section:storage Storage types]
A storage type holds the actual cell values. It uses a one-dimensional index for cell lookup, computed by the histogram host from the indices generated by its axes. The storage needs to know nothing about axes. Users can integrate their own storage classes with the library, by implementing the [link histogram.concepts.Storage storage concept]. Standard containers can be used as storage backends, the library adapts them with the [classref boost::histogram::storage_adaptor].
Cell lookup is often happening in a tight loop and is random-access. A normal `std::vector` works well as a storage backend. Sometimes this is the best solution, but there are some caveats to this approach. The user has to decide which type should represents the cell counts and it is not an obvious choice. An integer type needs to be large enough to avoid counter overflow, but only a fraction of the bits are used if its capacity is too large. This is a waste of memory then. When memory is wasted, more cache misses occur and performance is degraded (see the benchmarks). The performance of modern CPUs depends a lot on effective utilization of the CPU cache, which is still small. Using floating point numbers instead of integers is also dangerous. They don't overflow, but cap the bin count when the bits in the mantissa are used up.
The default storage used in the library is [classref boost::histogram::unlimited_storage]. It solves these issues with a dynamic counter type management, based on the following insight. The [@https://en.wikipedia.org/wiki/Curse_of_dimensionality curse of dimensionality] makes the total number of bins grow very fast as the dimension of the histogram grows. However, having many bins also reduces the typical number of counts per bin, since the input values are spread over many more bins now. This means a small counter is often sufficient for high-dimensional histograms.
The default storage therefore starts with a minimum amount of memory per cell, it uses an 1 byte. If the count in any cell is about to overflow, all cells switch to the next larger integer type simultaneously. This goes on, the capacity per cell is always doubled when it is used up, until 8 byte per bin are reached. The following images illustrate this progression for a storage of 3 bin counters. A new memory block is always allocated for all counters, when the first one of them hits its capacity limit.
[$../storage_3_uint8.svg]
[$../storage_3_uint16.svg]
[$../storage_3_uint32.svg]
[$../storage_3_uint64.svg]
When even that is not enough, the default storage switches to a multiprecision type similar to those in [@boost:/libs/multiprecision/index.html Boost.Multiprecision], whose capacity is limited only by available memory. The following image is not to scale:
[$../storage_3_cpp_int.svg]
This approach is not only memory conserving, but also provides the strong guarantee that bin counters cannot overflow.
[note
The no-overflow-guarantee only applies when the histogram is not using weighted fills or if all weights are integral numbers. When floating point weights are used, the default storage switches to a double counter per cell to store the sum of such weights. A double cannot provide the no-overflow-guarantee.
]
The best part: this approach is even faster for a histogram with sufficient size despite the run-time overheads of handling the counter type dynamically. The benchmarks show that the gains from better cache usage outweigh the run-time overheads of dynamic dispatching to the right bin counter type and the occasional allocation costs. Doubling the size of the bin counters each time helps, because the allocations happen only O(logN) times for N increments.
[endsect]
[endsect]
[endsect]

View File

@ -7,9 +7,21 @@
[section:rationale Rationale]
[section:motivation Motivation]
C++ lacks a widely-used, free multi-dimensional histogram class. While it is easy to write a one-dimensional histogram, writing a general multi-dimensional histogram poses more of a challenge. If a few more features required by scientific professionals are added onto the wish-list, then the implementation becomes non-trivial and a well-tested library solution desirable.
The [@https://www.gnu.org/software/gsl GNU Scientific Library (GSL)] and the [@https://root.cern.ch ROOT framework] from CERN have histogram implementations. The GSL has histograms for one and two dimensions in C. The implementations are not customizable. ROOT has well-tested implementations of histograms, but they are not customizable and they are not easy to use correctly. ROOT also has new implementations in beta-stage similar to this one, but they are still less flexible, not easy to use, and they cannot be used without the rest of ROOT, which is a huge library to install just to get histograms.
The templated histogram class in this library has a minimal interface and focuses on the core task of creating histograms from input data. It is very customizable and extensible through user-provided classes. A single implementation is used for one and multi-dimensional histograms. While being safe, customizable, and convenient, the histogram is also very fast. The static version, which has an axis configuration that is hard-coded at compile-time, is faster than any tested competitor.
One of the central design goals was to hide the implementation details of the internal counters of the histogram. The internal counting mechanism is encapsulated in a storage class, which can be switched out. The default storage uses an adaptive memory management which is safe to use, memory-efficient, and fast. The safety comes from the guarantee, that counts cannot overflow or be capped. This is a rare guarantee, hardly found in other libraries. In the standard configuration, the histogram /just works/ under any circumstance. Yet, users with special requirements can implement their own custom storage class or use an alternative builtin array-based storage.
[endsect]
[section:guidelines Guidelines]
This library was written based on a decade of experience collected in working with big data, more precisely in the field of particle physics and astroparticle physics. The design is guided by advice from people like Bjarne Stroustrup, Scott Meyers, Herb Sutter, and Andrei Alexandrescu, and Chandler Carruth. I also like the [@https://www.python.org/dev/peps/pep-0020 Zen of Python] (also applies to other languages). I also borrowed ideas from the [@https://eigen.tuxfamily.org/ Eigen library].
This library was written based on a decade of experience collected in working with big data, more precisely in the field of particle physics and astroparticle physics. The design is guided by advice from people like Bjarne Stroustrup, Scott Meyers, Herb Sutter, and Andrei Alexandrescu, and Chandler Carruth. The [@https://www.python.org/dev/peps/pep-0020 Zen of Python] (also applies to other languages) was an inspiration and well as ideas from the [@https://eigen.tuxfamily.org/ Eigen library]. The feature set was designed to be a superset of what is offered by the [@https://root.cern.ch ROOT framework] and the [@https://www.gnu.org/software/gsl GNU Scientific Library (GSL)].
Design goals of the library:
@ -19,93 +31,7 @@ Design goals of the library:
* Hide the details of how the bin counters work. This design allows for interesting implementations, such as the default storage that provides a no-overflow-guarantee, which no other library offers.
* STL and Boost compatibility. The library should be compatible with STL algorithms and other Boost libraries, especially [@boost:/libs/accumulators/index.html Boost.Accumulators] and [@boost:/libs/range/index.html Boost.Range]. This potentiates what one can do with the library. Features provided by compatible libraries do not need to be explicitly implemented in this library.
[endsect]
[section:structure Structure]
The library consists of three orthogonal components:
* [link histogram.overview.rationale.structure.host histogram host class]: The histogram host class defines the public user interface and holds axis objects (one for each dimension) and a storage object. The user can chose whether axis objects are stored in a static tuple, a vector, or a vector of variants.
* [link histogram.overview.rationale.structure.axis axis types]: Defines how input values are mapped to bins. Several axis types are provided which implement different specializations. Users can make their own axis types following the axis concept and use them with the library. A variant type is provided, which can hold one of several concrete axis types.
* [link histogram.overview.rationale.structure.storage storage types]: Manages a collection of bin counters. The requirements for a storage differ from those of an STL container, it needs to follow the storage concept. Two implementations are provided.
[section:host Histogram host class]
Histograms store axis objects and a storage object. A one-dimensional histogram has one axis, a multi-dimensional histogram has several. When you pass an input tuple, say (v1, v2, v3), then the first axis will map v1 onto index i1, the second axis v2 onto i2, and so on, to generate the index tuple (i1, i2, i3). The histogram host class then converts these indices into a linear global index that is used to address bin counter in the storage object.
[note
To understand the need for multi-dimensional histograms, think of point coordinates. If all points that you consider lie on a line, you need only one value to describe the point. If all points lie in a plane, you need two values to describe the position. Three values are needed for a point in space. A histogram puts a discrete grid over the line, the plane or the space, and counts how many points lie in each cell of the grid. To approximate a point distribution on a line, a 1d-histogram is sufficient. To do the same in 3d-space, one needs a 3d-histogram.
]
This library supports different axis types, so that the user can customize how the mapping is done exactly, see [link histogram.overview.rationale.structure.axis axis types]. Users can furthermore chose between several ways of storing axis types in the histogram.
When the number and types of the axes are known at compile-time, the histogram host class stores axis types in a `std::tuple`. We call this a /static histogram/. To access a particular axis, one should use a compile-time number as index (a run-time index also works with some limitations). A static histogram is extremely fast (see [link histogram.benchmarks benchmark]), because there is no overhead and the compiler can inline code, unroll loops, and more. Also, many user errors are can be caught at compile-time rather than run-time.
Static histograms are the best kind, but cannot be used when histograms are to be created with an axis configuration that is only known at run-time. This is the case, for example, when histograms are created from Python or from a graphical user interface. Therefore also more dynamic kinds of histograms are supported.
There are two levels of dynamism. The histogram can hold instances of a single axis type in a `std::vector`. Now the number of axis instances per histogram can vary at run-time, but the axis type must be the same for all instances. We call this a /semi-dynamic histogram/.
If also the axis types need to vary at run-time, one can place `boost::histogram::axis::variant` type in a `std::vector`, which can hold one of a set of different concrete axis types. We call this a /dynamic histogram/. The dynamic histogram is a single type that can store arbitrary sequences of different axes types, which may be generated at run-time. The polymorphic behavior of the generic `boost::histogram::axis::variant` type has a run-time cost, however. Typically, the performance is reduced by a factor of two compared to a static histogram.
[note
The design decision to store axis types in the variant-like type `boost::histogram::axis::variant` has several advantages over forms of run-time polymorphism. Firstly, it guarantees that axis objects which belong to the same histogram are stored locally together in memory, which reduces cache misses when the histogram iterates over axis objects in a tight loop, which it often does. Secondly, each axis can accept a different value type in this scheme. Classic polymorphism with vtables requires that all classes share the same method call signatures, but we want different axis to allowed to accepted different types of arguments. Variants work well for this case.
]
[endsect]
[section:axis Axis types]
An axis defines an injective mapping of (a range of) input values to a bin. The logic is encapsulated in an axis type. Users can create their own axis classes and use them with the library, by implementing the [link histogram.concepts.Axis [*Axis] concept]. The library comes with four builtin types, which implement different specializations.
* [classref boost::histogram::axis::regular] sorts real numbers into bins with equal width. The regular axis also supports monotonic transforms, which are applied when the input values are passed to the axis. This can be used to make a fast logarithmic axis, where the bins have equal width in the logarithm of the variable.
* [classref boost::histogram::axis::variable] sorts real numbers into bins with varying width.
* [classref boost::histogram::axis::integer] is a specialization of a regular axis for a range of integers with unit bin width. It is much faster than a regular axis.
* [classref boost::histogram::axis::category] is a bijective mapping of unique values onto bin indices and vice versa. This can be used with discrete categorical data, like "red", "green", "blue", for example.
Each builtin axis type has a few compile-time options, which change its behavior.
* All axis types can have an optional overflow bin. When the overflow bin is enabled and an input value is above the range covered by the axis, it is not discarded but counted in the overflow bin.
* All axis types except the category axis can have an optional underflow bin. When the underflow bin is enabled and an input value is below the range covered by the axis, it is not discarded but counted in the underflow bin.
* All axis types except the category axis can be circular, meaning that the axis range is periodic. This is useful for periodic data like polar angles.
* All axis types can optionally grow. When an input value is outside of the range of an axis which is configured to grow, the range of the axis is extended until the value is in range. This option is incompatible with the circular option, only either can be active.
[endsect]
[section:storage Storage types]
A storage type holds the actual cell values. It uses a one-dimensional index for cell lookup, computed by the histogram host from the indices generated by its axes. The storage needs to know nothing about axes. Users can integrate their own storage classes with the library, by implementing the [link histogram.concepts.Storage storage concept]. Standard containers can be used as storage backends, the library adapts them with the [classref boost::histogram::storage_adaptor].
Cell lookup is often happening in a tight loop and is random-access. A normal `std::vector` works well as a storage backend. Sometimes this is the best solution, but there are some caveats to this approach. The user has to decide which type should represents the cell counts and it is not an obvious choice. An integer type needs to be large enough to avoid counter overflow, but only a fraction of the bits are used if its capacity is too large. This is a waste of memory then. When memory is wasted, more cache misses occur and performance is degraded (see the benchmarks). The performance of modern CPUs depends a lot on effective utilization of the CPU cache, which is still small. Using floating point numbers instead of integers is also dangerous. They don't overflow, but cap the bin count when the bits in the mantissa are used up.
The default storage used in the library is [classref boost::histogram::unlimited_storage]. It solves these issues with a dynamic counter type management, based on the following insight. The [@https://en.wikipedia.org/wiki/Curse_of_dimensionality curse of dimensionality] makes the total number of bins grow very fast as the dimension of the histogram grows. However, having many bins also reduces the typical number of counts per bin, since the input values are spread over many more bins now. This means a small counter is often sufficient for high-dimensional histograms.
The default storage therefore starts with a minimum amount of memory per cell, it uses an 1 byte. If the count in any cell is about to overflow, all cells switch to the next larger integer type simultaneously. This goes on, the capacity per cell is always doubled when it is used up, until 8 byte per bin are reached. The following images illustrate this progression for a storage of 3 bin counters. A new memory block is always allocated for all counters, when the first one of them hits its capacity limit.
[$../storage_3_uint8.svg]
[$../storage_3_uint16.svg]
[$../storage_3_uint32.svg]
[$../storage_3_uint64.svg]
When even that is not enough, the default storage switches to a multiprecision type similar to those in [@boost:/libs/multiprecision/index.html Boost.Multiprecision], whose capacity is limited only by available memory. The following image is not to scale:
[$../storage_3_cpp_int.svg]
This approach is not only memory conserving, but also provides the strong guarantee that bin counters cannot overflow.
[note
The no-overflow-guarantee only applies when the histogram is not using weighted fills or if all weights are integral numbers. When floating point weights are used, the default storage switches to a double counter per cell to store the sum of such weights. A double cannot provide the no-overflow-guarantee.
]
The best part: this approach is even faster for a histogram with sufficient size despite the run-time overheads of handling the counter type dynamically. The benchmarks show that the gains from better cache usage outweigh the run-time overheads of dynamic dispatching to the right bin counter type and the occasional allocation costs. Doubling the size of the bin counters each time helps, because the allocations happen only O(logN) times for N increments.
[endsect]
* Minimalism, STL and Boost compatibility. Focus the library on the task of creating histograms. Functionality on top of that (drawing, further processing...) should come from other libraries. This gives users maximum flexibility to mix and match libraries. The histogram provides iterators ranges that allow other libraries access to the histogram state. The library provides iterators to access its internal counters, making it compatible with STL algorithms and other Boost libraries. In addition, the library was made compatible with [@boost:/libs/accumulators/index.html Boost.Accumulators] and [@boost:/libs/range/index.html Boost.Range].
[endsect]
@ -135,7 +61,7 @@ The presence of the extra bins does not interfere with normal indexing. On an ax
The standard library returns a container size as an unsigned integer, because a container size cannot be negative. The `size()` method of the histogram class follows this rule, but the `size()` methods of axis types return a signed integral type. Why?
As explained in the [link histogram.overview.rationale.uoflow section about under- and overflow], a histogram axis may have an optional underflow bin, which is addressed by the index `-1`. It follows that the index type must be signed integer for all axis types.
As explained in the [link histogram.rationale.uoflow section about under- and overflow], a histogram axis may have an optional underflow bin, which is addressed by the index `-1`. It follows that the index type must be signed integer for all axis types.
The `size()` method of any axis returns the same signed integer type. The size of an axis cannot be negative, but this choice has two advantages. Firstly, the value returned by `size()` itself is guaranteed to be a valid index, which is good since it may address the overflow bin. Secondly, comparisons between an index and the value returned by `size()` are frequent. If `size()` returned an unsigned integral type, compilers would produce a warning for each comparisons, and rightly so. [@https://www.youtube.com/watch?v=wvtFGa6XJDU Something awful happens] on most machines when you compare `-1` with an unsigned integer, `-1 < 1u == false`, which causes a serious bug in the following innocent-looking loop:
```
@ -149,19 +75,19 @@ The advantages clearly override the disadvantages of this choice.
[endsect]
[section:real_index_type Continuous axis maps real-valued indices to values]
[section:real_index_type Continuous axis accepts real-valued cell index]
An axis has a method which converts an index into the equivalent value at that index. If the axis is continuous, there are many possible values in the interval between two adjacent integer indices. User often want to access the center of such an interval. The easiest and most efficient way to compute the center value is to accept real-valued indices in the conversion method. Then, the center of the first bin between index `i` and `i+1` is simply obtained by passing `i+0.5` to the method which computes the value.
Each axis has a method called `value(index_type)` which converts an index into the equivalent value at that index. If the axis is continuous, there are many possible values in the interval between two adjacent integer indices. User often want to access the center of such an interval. An easy and very efficient way to access the center value is for this method to accept real-valued indices. Then, the center of the first bin between index `i` and `i+1` is simply obtained by passing `i+0.5`.
This scheme is computationally efficient and intuitive. It is so useful that each continuous axis is required to accept a real-valued index. Library code relies on this to detect whether an axis is continuous or discrete. It introspects the argument type of the conversion function and checks whether it is a floating point or integral type, respectively.
This scheme is computationally efficient and intuitive. Each continuous axis is required to accept a real-valued index, in fact, internal library code relies uses this to detect whether an axis is continuous or discrete.
[endsect]
[section:variance Variance estimates]
[section:variance On variance estimates]
Once a histogram is filled, the bin counter can be accessed with the `at(...)` method. The standard counter type has a `value()` method to return the count ['k]. It also offers a `variance()` method, which returns an estimate ['v] of the [@https://en.wikipedia.org/wiki/Variance variance] of that count.
Once a histogram is filled, the bin counter can be accessed with the `at(...)` method. Some accumulators offer a `value()` method to return the cell value ['k] and a `variance()` method, which returns an estimate ['v] of the [@https://en.wikipedia.org/wiki/Variance variance] of that cell.
If the input values for the histogram come from a [@https://en.wikipedia.org/wiki/Stochastic_process stochastic process], the variance provides useful additional information. Examples for a stochastic process are a physics experiment or a random person filling out a questionnaire [footnote The choices of the person are most likely not random, but if we pick a random person from a group, we randomly sample from a pool of opinions]. The variance ['v] is the square of the [@https://en.wikipedia.org/wiki/Standard_deviation standard deviation]. The standard deviation is a number that tells us how much we can expect the observed value to fluctuate if we or someone else would repeat our experiment with new random input.
If the input values for the histogram come from a [@https://en.wikipedia.org/wiki/Stochastic_process stochastic process], the variance estimate provides useful additional information. Examples for a stochastic process are a physics experiment or a random person filling out a questionnaire [footnote The choices of the person are most likely not random, but if we pick a random person from a group, we randomly sample from a pool of opinions]. The variance ['v] is the square of the [@https://en.wikipedia.org/wiki/Standard_deviation standard deviation]. The standard deviation is a number that tells us how much we can expect the observed value to fluctuate if we or someone else would repeat our experiment with new random input.
Variance estimates are useful in many ways:
@ -171,33 +97,32 @@ Variance estimates are useful in many ways:
* Pull distributions: If you have two histograms filled with the same number of samples and you want to know whether they are in agreement, you can compare the so-called pull distribution. It is formed by subtracting the counts and dividing by the square root of their variances ['(k1 - k2)/sqrt(v1 + v2)]. If the histograms are identical, the pull distribution randomly scatters around zero, and about 2/3 of the values are in the interval ['[ -1, 1]].
Why return the variance ['v] and not the standard deviation ['s = sqrt(v)]? The reason is that variances can be trivially added. [@https://en.wikipedia.org/wiki/Variance#Properties Variances of independent samples can be added] like normal numbers ['v3 = v1 + v2]. This is not true for standard deviations, where the addition law is more complex ['s3 = sqrt(s1^2 + s2^2)]. In that sense, the variance is more straight-forward to use during data processing. It is also more efficient for the same reason. The user can take the square-root at the end of the processing obtain the standard deviation as needed.
Why return the variance ['v] and not the standard deviation ['s = sqrt(v)]? The reason is that variances can be trivially added and it is computationally more efficient to return the variance. [@https://en.wikipedia.org/wiki/Variance#Properties Variances of independent samples can be added] like normal numbers ['v3 = v1 + v2]. This is not true for standard deviations, where the addition law is more complex ['s3 = sqrt(s1^2 + s2^2)]. In that sense, the variance is more straight-forward to use during data processing. The user can take the square-root at the end of the processing obtain the standard deviation as needed.
How is the variance estimate ['v] computed? If we know the expected number of counts ['lambda] per bin, we could compute the variance as ['v = lambda], because counts in a histogram follow the [@https://en.wikipedia.org/wiki/Poisson_distribution Poisson distribution]
How is the variance estimate ['v] computed for a normal counting histogram? If we know the expected number of counts ['lambda] per bin, we could compute the variance as ['v = lambda], because counts in a histogram follow the [@https://en.wikipedia.org/wiki/Poisson_distribution Poisson distribution]
[footnote
The Poisson distribution is correct as far as the counts ['k] themselves are of interest. If the fractions per bin ['p = k / N] are of interest, where ['N] is the total number of counts, then the correct distribution to describe the fractions is the [@https://en.wikipedia.org/wiki/Multinomial_distribution multinomial distribution].
]. After filling a histogram, we do not know the expected number of counts ['lambda] for any particular bin, but we know the observed count ['k], which is not too far from ['lambda]. We therefore might be tempted to just replace ['lambda] with ['k] in the formula ['v = lambda = k]. This is in fact the so-called non-parametric estimate for the variance based on the [@https://en.wikipedia.org/wiki/Plug-in_principle plug-in principle]. It is the best (and only) estimate for the variance, if we know nothing more about the underlying stochastic process which generated the inputs (or want to feign ignorance about it).
Now, if the number returned by the `variance()` method is just the same as the number return by `value()` method, why bother with adding a `variance()` method? There is a reason, which becomes apparent if the histograms are filled with weights, which is discussed next.
[endsect]
[section:weights Support of weighted fills]
A histogram sorts input values into bins and increments a bin counter if an input value falls into the range covered by that bin. The [classref boost::histogram::unlimited_storage standard storage] uses integer types to store these counts, see the [link histogram.overview.rationale.structure.storage storage section] how integer overflow is avoided. However, sometimes histograms need to be filled with values that have a weight ['w] attached to them. In this case, the corresponding bin counter is not increased by one, but by the weight value ['w].
A histogram sorts input values into bins and increments a bin counter if an input value falls into the range covered by that bin. The [classref boost::histogram::unlimited_storage standard storage] uses integer types to store these counts, see the [link histogram.overview.structure.storage storage section] how integer overflow is avoided. However, sometimes histograms need to be filled with values that have a weight ['w] attached to them. In this case, the corresponding bin counter is not increased by one, but by the weight value ['w].
[note
There are several use-cases for weighted increments. The main use in particle physics is to adapt simulated data of an experiment to real data. Simulations are needed to determine various corrections and efficiencies, but a simulated experiment is almost never a perfect replica of the real experiment. In addition, simulations are expensive to do. So, when deviations in a simulated distribution of a variable are found, one typically does not rerun the simulations, but assigns weights to match the simulated distribution to the real one.
]
When the [classref boost::histogram::unlimited_storage unlimited_storage] is used, histograms may also be filled with weighted value tuples. The choice of using weighted fills can be made at run-time. If the call `operator()(weight(x), ...)` is used, two doubles per bin are stored (previous integer counts are automatically converted). The first double keeps track of the sum of weights. The second double keeps track of the sum of weights squared, which is the variance estimate in this case. The former is accessed with the `value()` method of the bin counter, and the latter with the `variance()` method.
[note
When the [classref boost::histogram::weight_storage weight_storage] is used, histograms may be filled with weighted value tuples. Two real numbers per bin are stored in this case. The first keeps track of the sum of weights. The second keeps track of the sum of weights squared, which is the variance estimate in this case. The former is accessed with the `value()` method of the bin counter, and the latter with the `variance()` method.
Why the sum of weights squared is the variance estimate can be derived from the [@https://en.wikipedia.org/wiki/Variance#Properties mathematical properties of the variance]. Let us say a bin is filled ['k1] times with a fixed weight ['w1]. The sum of weights is then ['w1 k1]. It then follows from the variance properties that ['Var(w1 k1) = w1^2 Var(k1)]. Using the reasoning from before, the estimated variance of ['k1] is ['k1], so that ['Var(w1 k1) = w1^2 Var(k1) = w1^2 k1]. Variances of independent samples are additive. If the bin is further filled ['k2] times with weight ['w2], the sum of weights is ['w1 k1 + w2 k2], with variance ['w1^2 k1 + w2^2 k2]. This also holds for ['k1 = k2 = 1]. Therefore, the sum of weights ['w[i]] has variance sum of ['w[i]^2]. In other words, to incrementally keep track of the variance of the sum of weights, we need to keep track of the sum of weights squared.
]
[endsect]
[section Python support]
Python is a popular scripting language in the data science community. Thus, the library must be designed to support Python bindings, which are developed separately. The histogram is usable as an interface between a complex simulation or data-storage system written in C++ and data-analysis/plotting in Python. Users are able to define a histogram in Python, let it be filled on the C++ side, and then get it back for further data analysis or plotting.
Python is a popular scripting language in the data science community. Thus, the library must be designed to support Python bindings, which are developed separately. The histogram should usable as an interface between a complex simulation or data-storage system written in C++ and data-analysis/plotting in Python. Users are able to define a histogram in Python, let it be filled on the C++ side, and then get it back for further data analysis or plotting.
This is a major reason why a purely static design was rejected, where the histogram must be fully configured at compile-time. While this generates more efficient code, it does not work with Python, which requires one to configure histograms at run-time without recompiling the code.
[endsect]
@ -213,7 +138,7 @@ The histogram class is a valid range and can be used with the [@boost:/libs/rang
[endsect]
[section Serialization]
[section Support of serialization]
Serialization is implemented using [@boost:/libs/serialization/index.html Boost.Serialization]. It would be great to have a portable binary archive with support for floating point data to store and retrieve histograms efficiently, which is currently not available. The library has to be open for other serialization libraries.