update overview

This commit is contained in:
Hans Dembinski 2019-11-21 00:51:05 +01:00
parent be058afcaf
commit 701d690279

View File

@ -11,15 +11,13 @@
[@https://en.wikipedia.org/wiki/Histogram Histograms] are a basic tool in statistical analysis. A histogram consists of a number of non-overlapping cells in data space. When an input value is passed to the histogram, the corresponding cell that envelopes the value is found and an associated counter is incremented.
When analyzing a large low-dimensional data set, it is more convenient to work with a histogram of the input values than the original values. Keeping the cell counts in memory for analysis and/or processing the counts requires far fewer resources than keeping the original values in memory and processing them. Information present in the original can also be extracted from the histogram[footnote Parameters of interest, like the center of a distribution, can be extracted from the histogram instead of the original data set; likewise, statistical models can be fitted to histograms.]. Some information is lost in this way, but if the cells are small enough[footnote What small enough means has to be decided case by case.], the loss is often negligible. A histogram is a kind of lossy data-compression. It is also often used as a simple estimator for the [@https://en.wikipedia.org/wiki/Probability_density_function probability density function] of the input data. More complex density estimators exist, but histograms remain attractive because they are easy to reason about.
When analyzing a large low-dimensional data set, it is more convenient to work with a histogram of the input values than the original values. Keeping the cell counts in memory for analysis and/or processing the counts requires far fewer resources than keeping the original values in memory and processing them. Information present in the original can also be extracted from the histogram[footnote Parameters of interest, like the center of a distribution, can be extracted from the histogram instead of the original data set; likewise, statistical models can be fitted to histograms.]. Some information is lost in this way, but if the cells are small enough[footnote What small enough means has to be decided case by case.], the loss is negligible. A histogram is a kind of lossy data-compression. It is also often used as a simple estimator for the [@https://en.wikipedia.org/wiki/Probability_density_function probability density function] of the input data. More complex density estimators exist, but histograms remain attractive because they are easy to reason about.
This library provides a histogram for multi-dimensional data. In the multi-dimensional case, the input consist of tuples of values which belong together and describing different aspects of the same entity. A point in space is a good example. You need three coordinate values to describe a point. The entity is the point, and to fully characterize a point distribution in space you need three values and therefore a three-dimensional histogram. A three-dimensional histogram collects more information than three separate one-dimensional histograms, one for each coordinate. For example, you could have a point distribution that looks like a checker board in three dimensions (a checker cube): high and low densities are alternating along each coordinate. Then, the one-dimensional histograms along each coordinate would look like flat distributions, completely hiding the complex structure, while the three-dimensional histogram would retain the structure for further analysis.
This library provides a histogram for multi-dimensional data. In the multi-dimensional case, the input consist of tuples of values which belong together and describing different aspects of the same entity. For example, when you make a digital image with a camera, photons hit a pixel detector. The photon is the entity and it has two coordinates values where it hit the detector. The camera only counts how often a photon hit each cell, so it is a real-life example of making a two-dimensional histogram. A two-dimensional histogram collects more information than two separate one-dimensional histograms, one for each coordinate. For example, if the two-dimensional image looks like a checker board, with high and low densities are alternating along each coordinate, then the one-dimensional histograms along each coordinate would look flat. There would be no hint that there is a complex structure in two dimensions.
The term /histogram/ is usually strictly used for something with cells over discrete or continuous data. This histogram class can also process categorical variables and it even allows for non-consecutive cells if that is desired. There is no restriction to numbers as input. Any C++ type can be fed into the histogram, if the user provides a specialized axis class that maps values of this type to a cell index. The only remaining restriction is that cells are non-overlapping, since there must be a unique mapping from input value to cell. The library is not able to automatically ensure this for user-provided axis classes, so the responsibly is on the user.
The term /histogram/ is usually strictly used for something with cells over discrete or continuous data. This histogram class can also process categorical variables and it even allows for non-consecutive cells if that is desired. There is no restriction to numbers as input either. Any C++ type can be fed into the histogram, if the user provides a specialized axis class that maps values of this type to a cell index. The only remaining restriction is that cells are non-overlapping, since there must be a unique mapping from input value to cell. The library is not able to automatically ensure this for user-provided axis classes, so the responsibly is on the user.
Furthermore, the histogram can handle weighted input. Normally, the cell counter which is connected to an input tuple is incremented by one, but sometimes it is useful to increment by a weight, an integral or floating point number.
Finally, the histogram can be configured to store an accumulator in each cell. Arbitrary samples can be passed to this accumulator, which may compute the mean, variance, median, or other interesting statistics from the samples that are sorted into its cell. When the accumulator computes a mean, the result is called a /profile/.
Furthermore, the histogram can handle weighted input. Normally, the cell counter which is connected to an input tuple is incremented by one, but sometimes it is useful to increment by a weight, an integral or floating point number. Finally, the histogram can be configured to store any kind of accumulator in each cell. Arbitrary samples can be passed to this accumulator, which may compute the mean or other interesting quantities from the samples that are sorted into the cell. When the accumulator computes a mean, the result is called a /profile/. The feature set is informed by popular libraries for scientific computing, notably [@https://root.cern.ch CERN's ROOT framework] and the [@https://www.gnu.org/software/gsl GNU Scientific Library].
[endsect]