histogram/doc/guide.qbk
2017-03-16 09:15:34 +01:00

177 lines
8.8 KiB
Plaintext

[section User guide]
Histograms are a basic tool in statistical analysis. They compactly represent a data set of one or several random variables with acceptable loss of information. It is often more convenient to work with a histogram of input values, rather than with the input values directly, which may consume a lot of memory or disc space and may be slow to process. Quantities of interest, like the mean, variance, or mode may be extracted from the histogram instead of the original data set, often with negligible loss in precision. You may think of a histogram as a lossy compression of statistical data.
[section Create a histogram]
This library provides a class with a simple interface, which implements a general multi-dimensional histogram for multi-dimensional input values. Actually, there are two histogram classes with nearly identical interfaces, see the [link histogram.rationale.histogram_types rationale] for more information. If you are unsure, pick the [classref boost::histogram::static_histogram static_histogram]. You need the [classref boost::histogram::dynamic_histogram dynamic_histogram] if...
* you want to interoperate with Python, or
* you need a very flexible way to create various histogram configurations at runtime.
To create histograms in default configuration, use the factory function [funcref boost::histogram::make_static_histogram] (or [funcref boost::histogram::make_dynamic_histogram], respectively). The default configuration makes sure that the histogram just works. It is fast, memory-efficient, and safe to use.
[c++]``
#include <boost/histogram/histogram.hpp>
namespace bh = boost::histogram;
int main() {
// create a 1d-histogram in default configuration which
// covers the real line from -1 to 1 in 100 bins
auto h = bh::make_static_histogram(bh::regular_axis<>(100, -1, 1));
// do something with h
}
``
The function `make_static_histogram(...)` takes a variable number of axis objects as arguments. Axis objects define how input values are mapped to bins. If you provide one axis, the histogram is one-dimensional. If you provide two, it is two-dimensional, and so on.
Which axis class should I use? Check the class descriptions of [classref boost::histogram::regular_axis regular_axis], [classref boost::histogram::variable_axis variable_axis], [classref boost::histogram::polar_axis polar_axis], [classref boost::histogram::integer_axis integer_axis], and [classref boost::histogram::category_axis category_axis] for advice. Also see the [link histogram.rationale.axis_types rationale about axis types] for more information. The [classref boost::histogram::regular_axis regular_axis] should be your default choice, because it is easy to use and provides fast mapping. If you have a continous range of integers, the [classref boost::histogram::integer_axis integer_axis] is faster.
Optionally, you can attach a label to an axis, to help you remember what the axis is categorising. For example, you have census data and you want to investigate how yearly income correlates with age, you could do:
[c++]``
#include <boost/histogram/histogram.hpp>
namespace bh = boost::histogram;
int main() {
// create a 2d-histogram in default configuration with an "age" axis
// and an "income" axis
auto h = bh::make_static_histogram(bh::regular_axis<>(20, 0, 100, "age in years"),
bh::regular_axis<>(20, 0, 100, "yearly income in $1000"));
// do something with h
}
``
Without the labels it would be difficult to remember which axis was covering which quantity. To avoid errors, labels cannot be changed once the axis is passed in the constructor of the histogram.
By default, axis objects add under- and overflow bins to the covered range. Therefore, if you create an axis with 20 bins, it will actually have 22 bins. These extra bins are often very useful and in most cases you want to have them. However, if you know for sure that the input is strictly bounded, you can disable the extra bins:
[c++]``
#include <boost/histogram/histogram.hpp>
namespace bh = boost::histogram;
int main() {
// create a 1d-histogram for dice throws, eyes are always between 1 and 6
auto h = bh::make_static_histogram(bh::integer_axis(1, 6, "eyes", false));
// do something with h
``
Using a [classref boost::histogram::integer_axis integer_axis] here is convenient, because the input values are integers and we want one bin for each eye value.
[note The specialised [classref boost::histogram::polar_axis polar_axis] never creates under- and overflow bins, because the axis is circular. The highest bin wrapps around to the lowest bin and vice versa, so there is no need for extra bins.]
The histogram has Python-bindings, and you can also create histograms in Python. In fact, it is quite useful workflow to create histograms in a flexible way in Python and then pass them to some C++ code which fills them. You rarely want to change the way the histogram is filled, but you probably want to interactively iterate on how the axes are defined. Here is an example:
[python]``
import histogram as bh
import complex_cpp_module
h = bh.histogram(bh.regular_axis(100, -1, 1))
complex_cpp_module.run_filling(h)
# continue with statistical analysis of h
``
You do not need the factory functions in Python, you can straight-forwardly create a histogram object with a variable number of axis arguments. The histogram object passes the language barrier without copying its internal (possibly large) data buffer, so this workflow is efficient.
When you work with the [classref boost::histogram::dynamic_histogram dynamic_histogram], you can also create a histogram from a runtime collection of axis objects:
[c++]``
#include <boost/histogram/histogram.hpp>
namespace bh = boost::histogram;
int main() {
auto v = std::vector<bh::dynamic_histogram<>::axis_type>();
v.push_back(bh::regular_axis<>(100, -1, 1));
v.push_back(bh::integer_axis(1, 6));
auto h = bh::dynamic_histogram<>(v.begin(), v.end());
// do something with h
}
``
[note In all these examples, memory for bin counters is allocated lazily. Allocation is deferred to the first call to `fill(...)` or `wfill(...)`, which are described in the next section. This means that memory allocation exceptions are possibly thrown later.]
[endsect]
[section Fill a histogram with data]
The histogram (either type) supports two kinds of fills.
* `fill(...)` initiates a normal fill, which increments a bin counter by one when a value is in the bin range
* `wfill(...)` initiates a weighted fill, which increment a bin counter by a weight when a value is in the bin range.
Arguments for supporting weighted fills are given [link histogram.rationale.weights in the rationale]. Do not use `wfill(...)` if all your weights are equal to 1. Use `fill(...)` in this case, because it is more efficient. You are free to mix the two calls.
Here is an example which fills a 2d-histogram with 1000 pairs of normal distributed numbers in C++:
[c++]``
#include <boost/histogram/histogram.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/normal_distribution.hpp>
int main() {
using namespace boost;
random::mt19937 gen;
random::normal_distribution<> norm();
auto h = histogram::make_static_histogram(
histogram::regular_axis<>(100, -5, 5, "x"),
histogram::regular_axis<>(100, -5, 5, "y")
);
for (int i = 0; i < 1000; ++i)
h.fill(norm(gen), norm(gen));
// h is now filled
}
``
Here is a second example which using wfill and a functional programming style:
[c++]``
#include <boost/histogram/histogram.hpp>
#include <algorithm>
int main() {
namespace bh = boost::histogram;
auto h = bh::make_static_histogram(bh::integer_axis(0, 9));
std::vector<int> v{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
std::for_each(v.begin(), v.end(), [&h](int x) { h.wfill(2.0, x); });
// h is now filled
}
``
You can also fill the histogram from Python. If you use pass the input values as numpy arrays, this is efficient. Looping over a collection in Python, however, is very slow and should be avoided:
[python]``
import histogram as bh
import numpy as np
h = bh.histogram(bh.integer_axis(0, 9))
# don't do this, it is very slow
for i in range(10):
h.fill(i)
# do this instead, it is very fast
v = np.arange(10)
h.fill(v) # fills the histogram with each value in the array
``
[endsect]
[section Work with a filled histogram]
TODO: explain how to access values and variances, operators
The histogram provides a non-parametric variance estimate for the bin count in either case.
Histograms can be added if they have the same signature. This is convenient if histograms are filled in parallel on a cluster and then merged (added).
The histogram can be serialized to disk for persistent storage from C++ and pickled in Python. It comes with Numpy support, too. The histogram can be fast-filled with Numpy arrays for maximum performance, and viewed as a Numpy array without copying its memory buffer.
[endsect]
[endsect]