histogram/doc/rationale.qbk
2017-02-11 18:41:44 +00:00

58 lines
4.2 KiB
Plaintext

[section Rationale]
I designed the histogram based on a decade of experience collected in working with Big Data, more precisely in the field of particle physics and astroparticle physics. In many ways, the [@https://root.cern.ch ROOT] histograms served as an example of *not to do it*, and my annoyance with them led to this library.
All work should be guided by principles. Mine are:
* "Do one thing and do it well", Doug McIlroy
* The [@https://www.python.org/dev/peps/pep-0020 Zen of Python] (also applies to other languages).
I also follow advice of popular C++ experts: Bjarne Stroustrup, Scott Meyers, Herb Sutter, and Andrei Alexandrescu, and Chandler Carruth.
[section Language transparency]
Python is a great language for data analysis, so the histogram needs Python bindings. The histogram should be usable as an interface between a complex simulation or data-storage system written in C++ and data-analysis/plotting in Python: define the histogram in Python, let it be filled on the C++ side, and then get it back for further data analysis or plotting.
Data analysis in Python is Numpy-based, so Numpy support is a must.
The Python and C++ interface try to be consistent, but sometimes Python offers more elegant and pythonic ways of implementing things. Where possible, the more pythonic interface is used.
Properties
Getter/setter-like functions are wrapped as properties.
Keyword-based parameters
C++ member functions :cpp:func:`histogram::fill` and :cpp:func:`histogram::wfill` are wrapped by the single Python member function :py:func:`histogram.increment` with an optional keyword parameter `w` to pass a weight.
[endsect]
[section Powerful binning strategies]
The histogram supports five different binning strategies, conveniently encapsulated in axis objects. There is the standard sorting of real-valued data into bins of equal or varying width, but also binning of angles or integer values.
Extra bins that count over- and underflow values are added by default. This feature can be turned off individually for each axis. The extra bins do not disturb normal bin counting. On an axis with `n` bins, the first bin has the index `0`, the last bin `n-1`, while the under- and overflow bins are accessible at `-1` and `n`, respectively.
[endsect]
[section Performance and memory-efficiency]
Dense storage in memory is a must for high performance. Unfortunately, the [@https://en.wikipedia.org/wiki/Curse_of_dimensionality curse of dimensionality] quickly become a problem as the number of dimensions grows, leading to histograms which consume large amounts (up to GBs) of memory.
Fortunately, having many dimensions typically reduces the number of counts per bin, since counts get spread over many dimensions. The histogram uses an adaptive count size per bin to be as memory-efficient as possible, by starting with the smallest integer size per bin of 1 byte and increasing as needed to up to 8 byte. A `std::vector` grows in *length* as new elements are added, while the count storage grows in *depth*.
[endsect]
[section Weighted counts and variance estimates]
A histogram categorizes and counts, so the natural choice for the data type of the counts are integers. However, in particle physics, histograms are also often filled with weighted events, for example, to make sure that two histograms look the same in one variable, while the distribution of another, correlated variable is a subject of study.
The histogram can be filled with either weighted or unweighted counts. In the weighted case, the sum of weights is stored in a `double`. The histogram provides a variance estimate is both cases. In the unweighted case, the estimate is computed from the count itself, using Poisson-theory. In the weighted case, the sum of squared weights is stored alongside the sum of weights in a second `double`, and used to compute a variance estimate.
[endsect]
[section Serialization]
Serialization is implemented using `boost::serialization`. Pickling in Python is implemented based on the C++ serialization code. In the current implementation, the pickled stream is *not* portable, since it uses `boost::archive::binary_archive`. It would be great to switch to a portable binary representation in the future, when that becomes available.
[endsect]
[endsect]