histogram/doc/guide.qbk
Hans Dembinski e742787353 improvements
2017-11-07 16:57:18 +01:00

214 lines
11 KiB
Plaintext

[section User guide]
How to create and work with histograms is described here. This library is designed to make simple things simple, yet complex things possible. For a quick start, you don't need to read the complete user guide; have a look into the tutorial and the examples instead. This guide covers the basic and more advanced usage of the library.
[section C++ usage]
[section Create a histogram]
This library provides a class with a simple interface, which implements a general multi-dimensional histogram for multi-dimensional input values. The histogram class comes in two variants with a common interface, see the [link histogram.rationale.histogram_types rationale] for more information. Using [classref boost::histogram::histogram<Static,...>] is recommended whenever possible. You need [classref boost::histogram::histogram<Dynamic,...>] if:
* you need to create histogram configurations based on input you only have at runtime
* you want to interoperate with Python
Use the factory function [funcref boost::histogram::make_static_histogram] (or [funcref boost::histogram::make_dynamic_histogram], respectively) to make histograms with default options. The default options make sure that the histogram is safe to use, very fast, and memory efficient. If you are curious about changing these options, have a look at the expert section below.
[c++]``
#include <boost/histogram.hpp>
namespace bh = boost::histogram;
int main() {
// create a 1d-histogram in default configuration which
// covers the real line from -1 to 1 in 100 bins
auto h = bh::make_static_histogram(bh::axis::regular<>(100, -1, 1));
// do something with h
}
``
The function `make_static_histogram(...)` takes a variable number of axis objects as arguments. An axis object defines how input values are mapped to bins, which means that it defines the mapping function and the number bins. If you provide one axis, the histogram is one-dimensional. If you provide two, it is two-dimensional, and so on.
The library comes with a number of builtin axis classes (you can write your own, too, see [link histogram.concepts.axis axis concept]). The [classref boost::histogram::axis::regular regular axis] should be your default choice, because it is easy to use and fast. If you have a continous range of integers, the [classref boost::histogram::axis::integer integer axis] is faster. If you have data which wraps around, like angles, use a [classref boost::histogram::axis::circular circular axis].
Check the class descriptions of [classref boost::histogram::axis::regular regular axis], [classref boost::histogram::axis::variable variable axis], [classref boost::histogram::axis::circular circular axis], [classref boost::histogram::axis::integer integer axis], and [classref boost::histogram::axis::category category axis] for advice. See the [link histogram.rationale.axis_types rationale about axis types] for more information.
In addition to the required parameters for an axis, you can provide an optional label as a string to any axis, which helps to remember what the axis is categorising. Example: you have census data and you want to investigate how yearly income correlates with age, you could do:
[c++]``
#include <boost/histogram.hpp>
namespace bh = boost::histogram;
int main() {
// create a 2d-histogram in default configuration with an "age" axis
// and an "income" axis
auto h = bh::make_static_histogram(bh::axis::regular<>(20, 0, 100, "age in years"),
bh::axis::regular<>(20, 0, 100, "yearly income in $1000"));
// do something with h
}
``
Without the labels it would be difficult to remember which axis was covering which quantity. Beware, for safety reasons, labels cannot be changed once the axis is created. Axes objects which differ in their label do not compare equal with `operator==`.
By default, under- and overflow bins are added automatically for each axis range. Therefore, if you create an axis with 20 bins, the histogram will actually have 22 bins in that dimension. The two extra bins are very useful and in most cases you want to have them. However, if you know for sure that the input is strictly covered by the axis, you can disable them and save memory:
[c++]``
#include <boost/histogram.hpp>
namespace bh = boost::histogram;
int main() {
// create a 1d-histogram for dice throws with eye values from 1 to 6
auto h = bh::make_static_histogram(bh::axis::integer<>(1, 7, "eyes", bh::axis::uoflow::off));
// do something with h
}
``
Using a [classref boost::histogram::axis::integer integer axis] in this example is convenient, because the input values are integers and we want one bin for each eye value. The intervals in all axes are always semi-open, the last value is never included. That's why the upper end is 7 and not 6, here. This is similar to iterator
ranges from `begin` to `end`, where `end` is also not included.
[note The specialised [classref boost::histogram::axis::circular circular axis] never creates under- and overflow bins, because the axis is circular. The highest bin wrapps around to the lowest bin and vice versa, so there is no need for extra bins.]
When you work with [classref boost::histogram::histogram<Dynamic,...>], you can also create a histogram from a run-time compiled collection of axis objects:
[c++]``
// also see examples/create_dynamic_histogram.cpp
#include <boost/histogram.hpp>
#include <vector>
namespace bh = boost::histogram;
int main() {
using hist_type = bh::histogram<bh::Dynamic, bh::builtin_axes>;
auto v = std::vector<hist_type::axis_type>();
v.push_back(bh::axis::regular<>(100, -1, 1));
v.push_back(bh::axis::integer<>(1, 6));
auto h = hist_type(v.begin(), v.end());
// do something with h
}
``
[note In all these examples, memory for bin counters is allocated lazily, because the default policy [classref boost::histogram::adaptive_storage] is used. Allocation is deferred to the first call to `fill(...)`, which are described in the next section. Therefore memory allocation exceptions are not thrown when the histogram is created, but possibly later on the first fill.]
[endsect]
[section Fill a histogram with data]
The histogram (either type) supports three kinds of fills.
* `fill(...)` initiates a normal fill, which increments an internal counter by one.
* `fill(..., count(n))` initiates a fill, which increments an internal counter by the integer number `n`.
* `fill(..., weight(x))` initiates a weighted fill, which increments an internal counter a weight `x` (a real number) when a value is in the bin range.
Why weighted fills are sometimes useful is explained [link histogram.rationale.weights in the rationale]. This is mostly required in a scientific context. If you don't see the point, you can just ignore this type of call. Especially, do not use the form `fill(..., weight(x))` if you just wanted to avoid calling `fill(...)` repeatedly with the same arguments. Use `fill(..., count(n))` for that, because it is way more efficient. Apart for that, you are free to mix these calls in any order, meaning, you can start calling `fill(...)` and later switch to `fill(..., weight(x))` on the same histogram or vice versa.
Here is an example which fills a 2d-histogram with 1000 pairs of normal distributed numbers taken from a generator:
[c++]``
// also see examples/example_2d.cpp
#include <boost/histogram.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/normal_distribution.hpp>
namespace br = boost::random;
namespace bh = boost::histogram;
int main() {
br::mt19937 gen;
br::normal_distribution<> norm;
auto h = bh::make_static_histogram(
bh::axis::regular<>(100, -5, 5, "x"),
bh::axis::regular<>(100, -5, 5, "y")
);
for (int i = 0; i < 1000; ++i)
h.fill(norm(gen), norm(gen));
// h is now filled
}
``
Here is a second example which using a weighted fill in a functional programming style. The input values are taken from a container:
[c++]``
// also see examples/create_dynamic_histogram.cpp
#include <boost/histogram.hpp>
#include <algorithm>
#include <vector>
namespace bh = boost::histogram;
int main() {
auto h = bh::make_static_histogram(bh::axis::integer<>(0, 9));
std::vector<int> v{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
std::for_each(v.begin(), v.end(), [&h](int x) { h.fill(x, bh::weight(2.0)); });
// h is now filled
}
``
[endsect]
[section Work with a filled histogram]
TODO: explain how to access values and variances, operators
The histogram provides a non-parametric variance estimate for the bin count in either case.
Histograms can be added if they have the same signature. This is convenient if histograms are filled in parallel on a cluster and then merged (added).
The histogram can be serialized to disk for persistent storage from C++ and pickled in Python. It comes with Numpy support, too. The histogram can be fast-filled with Numpy arrays for maximum performance, and viewed as a Numpy array without copying its memory buffer.
[endsect]
[section Expert usage]
TODO
[endsect]
[endsect]
[section Python usage]
The C++ histogram has Python-bindings, so you can create histograms in Python. It is a useful workflow to create and configure histograms in Python and then pass them to some C++ code which fills them at maximum speed. You rarely need to change the way the histogram is filled, but you likely want to iterate the range and binning of the axis after seeing the data.
Here is a conceptual example:
[python]``
# also see examples/create_python_fill_cpp.py and examples/module_cpp_filler.cpp
import histogram as bh
import cpp_filler
h = bh.histogram(bh.axis.regular(100, -1, 1),
bh.axis.integer(0, 10))
cpp_filler.process(h) # histogram is filled with input values
# continue with statistical analysis of h
``
In Python, you can straight-forwardly create a histogram object with a variable number of axis arguments. The histogram instance passes the language barrier without copying its internal (possibly large) data buffer, so this workflow is efficient.
You can also fill the histogram in Python. Looping over a collection in Python is very slow and should be avoided. If you pass the input values as numpy arrays, this is efficient and almost as fast as using C++, see the [link histogram.benchmarks benchmark]. Here is an example to illustrate:
[python]``
import histogram as bh
import numpy as np
h = bh.histogram(bh.axis.integer(0, 9))
# don't do this, it is very slow
for i in range(10):
h.fill(i)
# do this instead, it is very fast
v = np.arange(10, dtype=float)
h.fill(v) # fills the histogram with each value in the array
``
`fill(...)` accepts any sequence that can be converted into a numpy array with `dtype=float`. To get the best performance, avoid the conversion and work with such numpy arrays directly.
[endsect]
[endsect]