nicer entry page, lots of improvements to the overview

This commit is contained in:
Hans Dembinski 2019-01-31 23:24:42 +01:00
parent 088fe2f690
commit 2c3450b429
11 changed files with 115 additions and 105 deletions

View File

@ -2,7 +2,7 @@
**Fast multi-dimensional histogram with convenient interface for C++14 and Python**
Coded with :heart:, powered by the Boost community and the [Scikit-HEP Project](http://scikit-hep.org).
Coded with ❤. Powered by the Boost community and the [Scikit-HEP Project](http://scikit-hep.org).
Branch | Linux [1] and OSX [2] | Windows [3] | Coverage
------- | --------------------- |------------ | --------

View File

@ -35,13 +35,6 @@ doxygen autodoc
<doxygen:param>WARN_IF_DOC_ERROR=YES
;
# exe speed
# : ../test/speed_cpp.cpp
# : <variant>release
# <c++-template-depth>256
# <cxxflags>"-std=c++11"
# ;
boostbook histogram
:
histogram.qbk
@ -50,7 +43,8 @@ boostbook histogram
<format>html:<xsl:param>boost.root=../..
<format>html:<xsl:param>boost.libraries=../../../../libs/libraries.htm
<xsl:param>generate.section.toc.level=3
<xsl:param>chunk.first.sections=1
<xsl:param>boost.mathjax=1
<xsl:param>chunk.first.sections=1
<xsl:param>generate.section.toc.level=1
<xsl:param>generate.toc="chapter nop section nop"
;

View File

@ -1,7 +0,0 @@
[section Bibliography]
* [@https://root.cern.ch ROOT framework]
* [@https://en.wikipedia.org/wiki/Poisson_distribution Poisson distribution]
* [@https://en.wikipedia.org/wiki/Propagation_of_uncertainty Uncertainty propagation]
[endsect]

View File

@ -1,35 +0,0 @@
[section:setup How to build and install]
[section:cmake With CMake]
If you build this library outside of the Boost distribution, you can use CMake.
[teletype]
``
git clone https://github.com/HDembinski/histogram.git
mkdir build && cd build
cmake ../histogram/build
make
``
Do `make test` to run the tests, or `ctest -V` for more output.
[endsect]
[section:b2 With Boost.Build]
If you want to build this library as part of the Boost distribution, do this.
[teletype]
``
git clone https://github.com/HDembinski/histogram.git
mv histogram $BOOST_ROOT/libs
cd $BOOST_ROOT
b2 libs/histogram/test # build and run the tests
``
Only the tests need building, the rest of the library is header only.
[endsect]
[endsect]

View File

@ -1,9 +1,17 @@
[section Changelog]
[section:history Revision history]
[master]
[heading 4.0 (first boost release)]
* Removed Python bindings, will be developed in separate repository
github.com/hdembinski/histogram-python
* Make all axis optionally circular (except category axis)
* Removed circular axis (which is just a circular regular axis)
* Added indexed adaptor generator for convenient and fast iteration over histograms
* Support for axes that can grow in range
* Support for axes which accept multiple values (example: hexagonal binning)
* Support for profiles and more generally, arbitrary accumulators in each cell
* Added compatibility with Boost.Range, Boost.Units, and Boost.Accumulators
* Performance improvements
[heading 3.2 (not in boost)]

View File

@ -1,4 +1,4 @@
[section Concepts]
[section:concepts Concepts]
Users can extend the library with new axis and storage types.

View File

@ -2,16 +2,6 @@
This guide covers the basic and more advanced usage of the library. It is designed to make simple things simple, yet complex things possible. For a quick start, you don't need to read the complete user guide; have a look at the [link histogram.getting_started Getting started] section.
[section Introduction]
This library provides a templated [@https://en.wikipedia.org/wiki/Histogram histogram] class for multi-dimensional data. A histogram consists of a number of non-overlapping cells in data space, called *bins*. When a value is passed to the histogram, the corresponding bin that envelopes the value is found and an associated counter is incremented. In large data sets, keeping the bin counts in memory for analysis requires fewer resources than keeping the original value tuples. If the bins are small enough[footnote What small enough means has to be decided case by case.], they still represent the original information in the data distribution. A histogram is therefore a useful lossy compression. It is also often used as a simple estimator for the [@https://en.wikipedia.org/wiki/Probability_density_function probability density function] of the input data. More complex density estimators exist, but histograms are easy to reason about.
Input for the histogram can be one- or multi-dimensional. In the multi-dimensional case, the input consist of tuples of values which belong together, describing different aspects of the same entity. A point in space is an example. You need three coordinate values to describe a point. The entity here is the point, and to fully characterize a point distribution in space you need three values and therefore a three-dimensional (3d) histogram. The advantage of using a 3d histogram over three separate 1d histograms, one for each coordinate, is that the 3d histogram is able to capture more information. For example, you could have a point distribution that looks like a checker board in three dimensions (a checker cube): high and low densities are alternating along each coordinate. Then the 1d histograms for each separate coordinate would look like flat distributions, completely hiding the complex structure, while the 3d histogram would retain the structure for further analysis.
The term /histogram/ is usually strictly used for something with bins over discrete or continuous data. The histogram class can also process categorical variables and it even allows for non-consecutive bins if that is desired. There is no restriction to numbers as input. Any type can be fed into the histogram, if the user provides a specialized axis class that maps values of this type to a bin index. The only remaining restriction is that bins are non-overlapping, since there must be a unique mapping from input value to bin. The library is not able to automatically ensure this for user-provided axis classes, so the responsibly is on the implementer.
[endsect]
[section Create a histogram]
[section Static or dynamic histogram]

View File

@ -1,7 +1,8 @@
[library Boost.Histogram
[quickbook 1.6]
[copyright 2016 - 2019 Hans Dembinski]
[purpose Histogram library]
[authors [Dembinski, Hans]]
[copyright 2016-2017 Hans Dembinski]
[id histogram]
[dirname histogram]
[license
@ -11,33 +12,73 @@
]
]
[section Abstract]
Boost.Histogram provides an easy-to-use, very fast, and extensible multi-dimensional histograms and profiles.
This C++14 library provides an easy-to-use and multi-dimensional [@https://en.wikipedia.org/wiki/Histogram histogram] template class for your counting and statistics needs. It is very customisable through policy classes, but the default policies were carefully crafted so that most users won't need to customize anything. The histogram has a convenient interface, designed to work well for the one- and multi-dimensional cases. If the default policies are used, the histogram is guaranteed to be safe to use as a black box, memory efficient, and very fast. Safe means that bin counts *cannot overflow* or be capped at some large value, which is generally not guaranteed in other implementations.
[variablelist
[
[
[link histogram.overview Overview]
]
[
An overview of the features included in Boost.Histogram, the motivation and rationale.
]
]
[
[
[link histogram.getting_started Getting started]
]
[
See Boost.Histogram in action in a few heavily commented examples. Copy/paste from these examples to jump-start your project.
]
]
[
[
[link histogram.guide User guide]
]
[
A user guide that introduces all aspects of the library, starting simple and ending with the advanced features.
]
]
[
[
[link histogram.reference Reference]
]
[
Detailed class and function reference.
]
]
[
[
[link histogram.concepts Concepts]
]
[
Explains the axis and storage type concepts, the behavior and interface you need to implemented when you make a new axis or storage type for Boost.Histogram.
]
]
[
[
[link histogram.history Revision history]
]
[
Log of Boost.Histogram changes.
]
]
]
The histogram class comes in two variants, which share a common interface. The *static* variant uses a maximum of compile-time information to provide maximum performance, at the cost of reduced runtime flexibility and potentially larger executables, if many different histograms are instantiated. The *dynamic* variant is a bit slower, but configurable at run-time, and does not increase the size of the executable if several different configurations are used. Optional serialization support is implemented with [@boost:/libs/serialization/index.html Boost.Serialization].
My goal is to submit this project to [@http://www.boost.org Boost], thus it uses the Boost directory structure and namespace. The code is released under the [@http://www.boost.org/LICENSE_1_0.txt Boost Software License].
[endsect]
[section Acknowledgments]
Klemens Morgenstern helped me to make this library boost-compliant, converting the documentation and adding Jamfiles, and provided code improvements.
Mateusz Loskot kindly agreed to be the Review Manager for this library and contributed various patches, for example, better Jamfiles.
Steven Watanabe provided an outstandingly detailed review of the documentation and code of the library.
[endsect]
[include motivation.qbk]
[include build.qbk]
[include overview.qbk]
[include getting_started.qbk]
[include guide.qbk]
[include benchmarks.qbk]
[include rationale.qbk]
[include concepts.qbk]
[xinclude autodoc.xml]
[include changelog.qbk]
[include bibliography.qbk]
[*Acknowledgments]
Klemens Morgenstern helped to make this library Boost-compliant, converting the documentation and adding Jamfiles, and provided code improvements.
Mateusz Loskot kindly agreed to fill the role of the Review Manager for this library and contributed various patches.
Steven Watanabe provided a very detailed review of the documentation and code of the library. Great reviews were submitted by Bjorn Reese, Jim Pivarski, Klemens Morgenstern, and Alex Hagen-Zanker. Comments and suggestions were provided by Andrea Bocci, degksi, Glen Fernandes, Gavin Lambert, Seth, and Mateusz Loskot.
The members of the [@http://www.scikit-hep.org Scikit-HEP project], in particular Henry Schreiner and Jim Pivarski, provided valuable feedback and input on the design of this library.

View File

@ -1,14 +0,0 @@
[section:motivation Motivation]
Histograms are a basic tool in statistical analysis. When analysing large data sets, it is usually more convenient to work with a histogram of the input values. Histograms can compactly represent a data set of stochastic variables. If the histogram layout is chosen appropriately, any information present in the original can also be extracted from the histogram[footnote
Parameters of interest, like the center of a distribution, can be extracted from the histogram instead of the original data set; statistical models can be fitted to histograms to the same end.]; the information loss due to binning is then negligible. Processing a histogram is much faster than processing the original data, because the memory footprint of a histogram is much smaller. Next to data visualisation, this is the main reason to use histograms. In other words, a histogram is a lossy compression of statistical data.
C++ lacks a widely-used, free multi-dimensional histogram class. While it is easy to write a one-dimensional histogram, writing a general multi-dimensional histogram poses more of a challenge. If you add serialization and Python/Numpy bindings onto the wish-list, then the implementation becomes non-trivial and a well-tested library solution desirable.
The [@https://www.gnu.org/software/gsl GNU Scientific Library (GSL)] and the [@https://root.cern.ch ROOT framework] from CERN have histogram implementations. The GSL has histograms for one and two dimensions in C. The implementations are not customizable. You have to live with the trade-offs chosen by the implementors. ROOT has decade-old implementations of histograms which are not customizable and suffer from a few design flaws. It also has new better implementations in beta-stage similar to this one, but they cannot be used without the rest of ROOT, which is a huge highly non-modular library.
The histogram class template in this library has a minimalistic interface, which strives to be as elegant as the GSL implementations. In addition, it is very customizable and extensible through policy classes and in the way input values are binned. Thanks to variadic templates, the interface remains straight-forward for any number of dimensions. While being safe, customizable, and convenient, the histogram is also very performant. The static variant, which uses compile-time information wherever possible, is faster than any tested competitor.
A central design goal was to abstract away details of the internal counters. The internal counting mechanism is encapsulated in a storage policy, which can be replaced at compile-time. The default storage implements an adaptive memory managment which is safe to use, memory-efficient, and fast. The safety comes from the guarantee, that counts cannot overflow or be capped. This is a rare guarantee other libraries usually cannot give. In the standard configuration, the histogram *just works* under any circumstance. Yet, users with unusual requirements can implement their own custom storage policy or use an alternative builtin array-based storage.
[endsect]

33
doc/overview.qbk Normal file
View File

@ -0,0 +1,33 @@
[section:overview Overview]
[section:introduction Introduction]
[@https://en.wikipedia.org/wiki/Histogram Histograms] are a basic tool in statistical analysis. A histogram consists of a number of non-overlapping cells in data space. When an input value is passed to the histogram, the corresponding cell that envelopes the value is found and an associated counter is incremented.
When analyzing a large low-dimensional data set, it is more convenient to work with a histogram of the input values than the original values. Keeping the cell counts in memory for analysis and/or processing the counts requires far fewer resources than keeping the original values in memory and processing them. Information present in the original can also be extracted from the histogram[footnote Parameters of interest, like the center of a distribution, can be extracted from the histogram instead of the original data set; likewise, statistical models can be fitted to histograms.]. Some information is lost in this way, but if the cells are small enough[footnote What small enough means has to be decided case by case.], the loss is often negligible. A histogram is a kind of lossy data-compression. It is also often used as a simple estimator for the [@https://en.wikipedia.org/wiki/Probability_density_function probability density function] of the input data. More complex density estimators exist, but histograms remain attractive because they are easy to reason about.
This library provides a histogram for multi-dimensional data. In the multi-dimensional case, the input consist of tuples of values which belong together and describing different aspects of the same entity. A point in space is a good example. You need three coordinate values to describe a point. The entity is the point, and to fully characterize a point distribution in space you need three values and therefore a three-dimensional histogram. A three-dimensional histogram collects more information than three separate one-dimensional histograms, one for each coordinate. For example, you could have a point distribution that looks like a checker board in three dimensions (a checker cube): high and low densities are alternating along each coordinate. Then, the one-dimensional histograms along each coordinate would look like flat distributions, completely hiding the complex structure, while the three-dimensional histogram would retain the structure for further analysis.
The term /histogram/ is usually strictly used for something with cells over discrete or continuous data. This histogram class can also process categorical variables and it even allows for non-consecutive cells if that is desired. There is no restriction to numbers as input. Any C++ type can be fed into the histogram, if the user provides a specialized axis class that maps values of this type to a cell index. The only remaining restriction is that cells are non-overlapping, since there must be a unique mapping from input value to cell. The library is not able to automatically ensure this for user-provided axis classes, so the responsibly is on the user.
Furthermore, the histogram can handle weighted input. Normally, the cell counter which is connected to an input tuple is incremented by one, but sometimes it is useful to increment by a weight, an integral or floating point number.
Finally, the histogram can be configured to store an accumulator in each cell. Arbitrary samples can be passed to this accumulator, which may compute the mean, variance, median, or other interesting statistics from the samples that are sorted into its cell. When the accumulator computes a mean, the result is called a /profile/.
[endsect]
[section:motivation Motivation]
C++ lacks a widely-used, free multi-dimensional histogram class. While it is easy to write a one-dimensional histogram, writing a general multi-dimensional histogram poses more of a challenge. If you add a few more features required by scientific professionals onto the wish-list, then the implementation becomes non-trivial and a well-tested library solution desirable.
The [@https://www.gnu.org/software/gsl GNU Scientific Library (GSL)] and the [@https://root.cern.ch ROOT framework] from CERN have histogram implementations. The GSL has histograms for one and two dimensions in C. The implementations are not customizable. ROOT has well-tested implementations of histograms, but they are not customizable and they are not easy to use correctly. ROOT also has new implementations in beta-stage similar to this one, but they cannot be used without the rest of ROOT, which is a huge library to install just to get histograms.
The templated histogram class in this library has a minimalistic interface, which strives to be as elegant as the GSL implementations. In addition, it is very customizable and extensible through user-provided classes. A single implementation is used for one and multi-dimensional histograms. While being safe, customizable, and convenient, the histogram is also very fast. The static version, which has an axis configuration that is hard-coded at compile-time, is faster than any tested competitor.
One of the central design goals was to provide an abstract interface to the internal bin counters. The internal counting mechanism is encapsulated in a storage class, which can be replaced at compile-time. The default storage uses an adaptive memory management which is safe to use, memory-efficient, and fast. The safety comes from the guarantee, that counts cannot overflow or be capped. This is a rare guarantee, hardly found in other libraries. In the standard configuration, the histogram /just works/ under any circumstance. Yet, users with unusual requirements can implement their own custom storage class or use an alternative builtin array-based storage.
[endsect]
[include rationale.qbk]
[endsect]

View File

@ -36,13 +36,13 @@ To understand the need for multi-dimensional histograms, think of point coordina
This library supports different axis types, so that the user can customize how the mapping is done exactly, see [link histogram.rationale.structure.axis_types axis types]. Users can furthermore chose between several ways of storing axis types in the histogram.
When the number and types of the axes are known at compile-time, the histogram host class stores axis types in a `std::tuple`. We call this a *static histogram*. To access a particular axis, one should use a compile-time number as index (a run-time index also works with some limitations). A static histogram is extremely fast (see [link histogram.benchmarks benchmark]), because there is no overhead and the compiler can inline code, unroll loops, and more. Also nice: many user errors are can be caught at compile-time rather than run-time.
When the number and types of the axes are known at compile-time, the histogram host class stores axis types in a `std::tuple`. We call this a /static histogram/. To access a particular axis, one should use a compile-time number as index (a run-time index also works with some limitations). A static histogram is extremely fast (see [link histogram.benchmarks benchmark]), because there is no overhead and the compiler can inline code, unroll loops, and more. Also nice: many user errors are can be caught at compile-time rather than run-time.
Static histograms are the best kind, but cannot be used when histograms are to be created with an axis configuration that is only known at run-time. This is the case, for example, when histograms are created at run-time from Python.
There are two levels of dynamism. Firstly, the histogram can hold instances of a single axis type in a `std::vector`. Now the number of axis instances per histogram can vary at run-time, but the axis type must be the same for all instances. We call this *semi-dynamic histogram*.
There are two levels of dynamism. Firstly, the histogram can hold instances of a single axis type in a `std::vector`. Now the number of axis instances per histogram can vary at run-time, but the axis type must be the same for all instances. We call this /semi-dynamic histogram/.
If also the axis type should vary, one can use the `boost::histogram::axis::variant` type, which can hold one of a set of different concrete axis types and can be placed in a `std::vector`. When the histogram is configured to store axis types like this, we obtain a *dynamic histogram*. The dynamic histogram is a single type that can store arbitrary sequences of different axes types, which may be generated at run-time. The polymorphic behavior of the generic `boost::histogram::axis::variant` type has a run-time cost, however. Typically, the performance is reduced by a factor of two compared to a static histogram.
If also the axis type should vary, one can use the `boost::histogram::axis::variant` type, which can hold one of a set of different concrete axis types and can be placed in a `std::vector`. When the histogram is configured to store axis types like this, we obtain a /dynamic histogram/. The dynamic histogram is a single type that can store arbitrary sequences of different axes types, which may be generated at run-time. The polymorphic behavior of the generic `boost::histogram::axis::variant` type has a run-time cost, however. Typically, the performance is reduced by a factor of two compared to a static histogram.
[note
The design decision to store axis types in the variant-like type `boost::histogram::axis::variant` has several advantages over forms of run-time polymorphism. Firstly, it guarantees that axis objects which belong to the same histogram are stored locally together in memory, which reduces cache misses when the histogram iterates over axis objects in a tight loop, which it often does. Secondly, each axis can accept a different value type in this scheme. Classic polymorphism with vtables requires that all overloads provided by derived classes share the same method call signature, but that was found to be too restrictive for this library. In this library, the first axis of a histogram may convert numbers to indices, the second strings. The method signatures of different axis types are allowed to differ. Classic run-time polymorphism does not work, but variants do.