mirror of
https://github.com/boostorg/histogram.git
synced 2025-05-11 21:24:14 +00:00
183 lines
9.4 KiB
Plaintext
183 lines
9.4 KiB
Plaintext
[section User guide]
|
|
|
|
Histograms are a basic tool in statistical analysis. They compactly represent a data set of one or several random variables with acceptable loss of information. It is often more convenient to work with a histogram of input values, rather than with the input values directly, which may consume a lot of memory or disc space and may be slow to process. Quantities of interest, like the mean, variance, or mode may be extracted from the histogram instead of the original data set, often with negligible loss in precision. You may think of a histogram as a lossy compression of statistical data.
|
|
|
|
[section Create a histogram]
|
|
|
|
This library provides a class with a simple interface, which implements a general multi-dimensional histogram for multi-dimensional input values. The histogram class comes in two variants with a common interface, see the [link histogram.rationale.histogram_types rationale] for more information. If you are unsure, pick the [classref boost::histogram::histogram<Static,...>]. You need [classref boost::histogram::histogram<Dynamic,...>] if...
|
|
|
|
* you want to interoperate with Python, or
|
|
|
|
* you need a simple way to create various histogram configurations at runtime.
|
|
|
|
To create histograms in default configuration, use the factory function [funcref boost::histogram::make_static_histogram] (or [funcref boost::histogram::make_dynamic_histogram], respectively). The default configuration makes sure that the histogram just works. It is fast, memory-efficient, and safe to use.
|
|
|
|
[c++]``
|
|
#include <boost/histogram.hpp>
|
|
|
|
namespace bh = boost::histogram;
|
|
|
|
int main() {
|
|
// create a 1d-histogram in default configuration which
|
|
// covers the real line from -1 to 1 in 100 bins
|
|
auto h = bh::make_static_histogram(bh::regular_axis<>(100, -1, 1));
|
|
// do something with h
|
|
}
|
|
``
|
|
|
|
The function `make_static_histogram(...)` takes a variable number of axis objects as arguments. Axis objects define how input values are mapped to bins. If you provide one axis, the histogram is one-dimensional. If you provide two, it is two-dimensional, and so on.
|
|
|
|
Which axis class should you use? The [classref boost::histogram::regular_axis regular_axis] should be your default choice, because it is easy to use and provides fast mapping. If you have a continous range of integers, the [classref boost::histogram::integer_axis integer_axis] is faster.
|
|
|
|
Check the class descriptions of [classref boost::histogram::regular_axis regular_axis], [classref boost::histogram::variable_axis variable_axis], [classref boost::histogram::circular_axis circular_axis], [classref boost::histogram::integer_axis integer_axis], and [classref boost::histogram::category_axis category_axis] for advice. See the [link histogram.rationale.axis_types rationale about axis types] for more information.
|
|
|
|
You can attach a label to any axis, which helps to remember what the axis is categorising. Example: you have census data and you want to investigate how yearly income correlates with age, you could do:
|
|
|
|
[c++]``
|
|
#include <boost/histogram.hpp>
|
|
|
|
namespace bh = boost::histogram;
|
|
|
|
int main() {
|
|
// create a 2d-histogram in default configuration with an "age" axis
|
|
// and an "income" axis
|
|
auto h = bh::make_static_histogram(bh::regular_axis<>(20, 0, 100, "age in years"),
|
|
bh::regular_axis<>(20, 0, 100, "yearly income in $1000"));
|
|
// do something with h
|
|
}
|
|
``
|
|
|
|
Without the labels it would be difficult to remember which axis was covering which quantity, as they or otherwise identical. To avoid errors, labels cannot be changed once the axis is passed in the constructor of the histogram. Axes objects which differ in their label do not compare equal with `operator==`.
|
|
|
|
By default, axis objects add under- and overflow bins to the covered range. Therefore, if you create an axis with 20 bins, it will actually get 22 bins. The two extra bins are very useful and in most cases you want to have them. However, if you know for sure that the input is strictly covered by the axis, you can disable the extra bins:
|
|
|
|
[c++]``
|
|
#include <boost/histogram.hpp>
|
|
|
|
namespace bh = boost::histogram;
|
|
|
|
int main() {
|
|
// create a 1d-histogram for dice throws, eyes are always between 1 and 6
|
|
auto h = bh::make_static_histogram(bh::integer_axis(1, 6, "eyes", false));
|
|
// do something with h
|
|
``
|
|
|
|
Using a [classref boost::histogram::integer_axis integer_axis] in this example is convenient, because the input values are integers and we want one bin for each eye value.
|
|
[note The specialised [classref boost::histogram::circular_axis circular_axis] never creates under- and overflow bins, because the axis is circular. The highest bin wrapps around to the lowest bin and vice versa, so there is no need for extra bins.]
|
|
|
|
Since the histogram has Python-bindings, and you can also create histograms in Python. In fact, it is quite a useful workflow to create histograms in a flexible way in Python and then pass them to some C++ code which fills them. You rarely need to change the way the histogram is filled, but you likely want to iterate the range and binning of the axis after seeing the data. Hybrid programming in C++ and Python fits the bill. Here is a conceptual example:
|
|
|
|
[python]``
|
|
|
|
import histogram as bh
|
|
import complex_cpp_module
|
|
|
|
h = bh.histogram(bh.regular_axis(100, -1, 1),
|
|
bh.integer_axis(0, 10))
|
|
|
|
complex_cpp_module.process(h) # histogram is filled with input values
|
|
|
|
# continue with statistical analysis of h
|
|
``
|
|
|
|
In Python, you can straight-forwardly create a histogram object with a variable number of axis arguments. The histogram instance passes the language barrier without copying its internal (possibly large) data buffer, so this workflow is efficient.
|
|
|
|
When you work with [classref boost::histogram::histogram<Dynamic,...>], you can also create a histogram from a run-time compiled collection of axis objects:
|
|
|
|
[c++]``
|
|
#include <boost/histogram.hpp>
|
|
#include <vector>
|
|
|
|
namespace bh = boost::histogram;
|
|
|
|
int main() {
|
|
using H = bh::histogram<bh::Dynamic, bh::builtin_axes>;
|
|
auto v = std::vector<H::axis_type>();
|
|
v.push_back(bh::regular_axis<>(100, -1, 1));
|
|
v.push_back(bh::integer_axis(1, 6));
|
|
auto h = H(v.begin(), v.end());
|
|
// do something with h
|
|
}
|
|
``
|
|
|
|
[note In all these examples, memory for bin counters is allocated lazily, because the default policy [classref boost::histogram::adaptive_storage<>] is used. Allocation is deferred to the first call to `fill(...)`, which are described in the next section. Therefore memory allocation exceptions are not thrown when the histogram is created, but possibly later.]
|
|
|
|
[endsect]
|
|
|
|
[section Fill a histogram with data]
|
|
|
|
The histogram (either type) supports two kinds of fills.
|
|
|
|
* `fill(...)` initiates a normal fill, which increments a bin counter by one when a value is in the bin range
|
|
|
|
* `fill(..., weight(x))` initiates a weighted fill, which increment a bin counter by a weight `x` (a real number) when a value is in the bin range.
|
|
|
|
The need to support weighted fills is explained [link histogram.rationale.weights in the rationale]. Do not use the form `fill(..., weight(x))` if all your weights are equal to 1. Use `fill(...)` in this case, because it is way more efficient. You are free to mix the two calls, meaning, you can start calling `fill(...)` and later switch to `fill(..., weight(x))` on the same histogram or vice versa.
|
|
|
|
Here is an example which fills a 2d-histogram with 1000 pairs of normal distributed numbers taken from a generator:
|
|
[c++]``
|
|
#include <boost/histogram.hpp>
|
|
#include <boost/random/mersenne_twister.hpp>
|
|
#include <boost/random/normal_distribution.hpp>
|
|
|
|
int main() {
|
|
using namespace boost;
|
|
random::mt19937 gen;
|
|
random::normal_distribution<> norm;
|
|
auto h = histogram::make_static_histogram(
|
|
histogram::regular_axis<>(100, -5, 5, "x"),
|
|
histogram::regular_axis<>(100, -5, 5, "y")
|
|
);
|
|
for (int i = 0; i < 1000; ++i)
|
|
h.fill(norm(gen), norm(gen));
|
|
// h is now filled
|
|
}
|
|
``
|
|
|
|
Here is a second example which using a weighted fill in a functional programming style. The input values are taken from a container:
|
|
[c++]``
|
|
#include <boost/histogram.hpp>
|
|
#include <algorithm>
|
|
#include <vector>
|
|
|
|
int main() {
|
|
namespace bh = boost::histogram;
|
|
auto h = bh::make_static_histogram(bh::integer_axis(0, 9));
|
|
std::vector<int> v{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
|
|
std::for_each(v.begin(), v.end(), [&h](int x) { h.fill(x, bh::weight(2.0)); });
|
|
// h is now filled
|
|
}
|
|
``
|
|
|
|
You can also fill the histogram in Python. If you pass the input values as numpy arrays, this is quite efficient and almost as fast as using C++, see the [link histogram.benchmarks benchmark]. Looping over a collection in Python, however, is very slow and should be avoided. Here is an example to illustrate:
|
|
[python]``
|
|
import histogram as bh
|
|
import numpy as np
|
|
|
|
h = bh.histogram(bh.integer_axis(0, 9))
|
|
|
|
# don't do this, it is very slow
|
|
for i in range(10):
|
|
h.fill(i)
|
|
|
|
# do this instead, it is very fast
|
|
v = np.arange(10)
|
|
h.fill(v) # fills the histogram with each value in the array
|
|
``
|
|
|
|
[endsect]
|
|
|
|
[section Work with a filled histogram]
|
|
|
|
TODO: explain how to access values and variances, operators
|
|
|
|
The histogram provides a non-parametric variance estimate for the bin count in either case.
|
|
|
|
Histograms can be added if they have the same signature. This is convenient if histograms are filled in parallel on a cluster and then merged (added).
|
|
|
|
The histogram can be serialized to disk for persistent storage from C++ and pickled in Python. It comes with Numpy support, too. The histogram can be fast-filled with Numpy arrays for maximum performance, and viewed as a Numpy array without copying its memory buffer.
|
|
|
|
[endsect]
|
|
|
|
[endsect]
|