improved handling of vairance-less storage policies

This commit is contained in:
Hans Dembinski 2017-11-10 20:01:55 +01:00
parent 0637deed4e
commit fa3ba7bca8
10 changed files with 130 additions and 42 deletions

View File

@ -111,7 +111,7 @@ We use an [classref boost::histogram::axis::integer integer axis] here, because
[endsect]
[section Fill a histogram with data]
[section Fill histogram]
After you created the histogram, you want to insert potentionally multi-dimensional data. This is done with the flexible `fill(...)` method, which you call in a handcrafted loop. The histogram supports three kinds of fills, as shown in the example:
@ -466,7 +466,7 @@ print "overflow [{0}, {1}): {2} +/- {3}".format(lo, up, h.value(5), h.variance(
[section:numpy Numpy support]
Looping over a large collection in Python is very slow and should be avoided. If the Python module was build with Numpy support, you can also pass the input values as Numpy arrays. This is strongly recommended for efficiency and convenience. The performance is almost as good as using C++ directly, and except in case of one-dimensional histograms better than Numpy's histogram functions (which are not as flexible and not object oriented, either), see the [link histogram.benchmarks benchmark]. Here is an example to illustrate:
Looping over a large collection in Python to fill a histogram is very slow and should be avoided. If the Python module was build with Numpy support, you can instead pass the input values as Numpy arrays. This is very fast and convenient. The performance is almost as good as using C++ directly, and usually better than Numpy's histogram functions, see the [link histogram.benchmarks benchmark]. Here is an example to illustrate:
[python]``
import histogram as hg
@ -484,9 +484,9 @@ y = 0.5 * np.random.randn(1000)
# fill histogram with numpy arrays, this is very fast
h.fill(x, y) # call looks the same as if x, y were values
# get representations of the bin edges as Numpy arrays, this representation
# differs from `list(h.axis(0))`, because it is optimised for compatibility
# with Numpy conventions, i.e. return values of numpy.histogram
# get representations of the bin edges as Numpy arrays; this representation
# differs from `list(h.axis(0))` since it was adapted to be straight-forward
# for those who have worked with Numpy's histogramming functions before
x = np.array(h.axis(0))
y = np.array(h.axis(1))
@ -523,9 +523,9 @@ except ImportError:
# [ 0 0 0 0 0 1 0 0 0 0 0 0]]
``
[note The `fill(...)` method accepts any sequence that can be converted into a Numpy array with `dtype=float`. To get the best performance, avoid the conversion and work with such numpy arrays of dtype `float` directly.]
[note The `fill(...)` method accepts not only Numpy arrays as arguments, but also any sequence that can be converted into a Numpy array with `dtype=float`. The best performance is achieved, however, if such conversions are avoided. Therefore, it is best to create numpy arrays with dtype `float` directly.]
The conversion of axes which represent a contiguous sequence of bins to Numpy arrays has been optimised for interoperability with other Numpy functions and matplotlib. For such an axis with `N` bins, a sequence of bin edges is generated of length `N+1`. The Numpy array is a sequence of all bin edges, including the upper edge of the last bin. The following code illustrates how the sequence is constructed.
The conversion of an axis to a Numpy array has been optimised for interoperability with other Numpy functions and matplotlib. For such an axis with `N` bins, a sequence of bin edges is generated of length `N+1`. The Numpy array is a sequence of all bin edges, including the upper edge of the last bin. The following code illustrates how the sequence is constructed.
[python]``
import histogram as hg
@ -547,13 +547,13 @@ print xedge2
# [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
``
Using a Numpy array to view the count matrix reveals an implementation detail on how under- and overflow bins are stored internally. For a 1-d histogram with 5 bins, the counter structure looks like this:
Using a Numpy array to view the count matrix reveals an implementation detail of how under- and overflow bins are stored internally. For a 1-d histogram with 5 bins, the counter structure looks like this:
bin0 bin1 bin2 bin3 bin4 overflow underflow
There are seven counters in total, five plus two. The overflow bin naturally follows in order after the normal bins. The position of the underflow bin may be surprising, but derives from the fact that Python counts negative indices from the end of a container, so that the index `-1` accesses the last counter in the sequence.
There are seven counters in total, five plus two. The overflow bin naturally follows in order after the normal bins. The position of the underflow bin may be surprising, but derives from the fact that Python counts negative indices from the end of a container. Since the index `-1` accesses the last counter in the sequence, the underflow bin is put there.
It is very easy to crop the extra counters by slicing each dimension with under-/overflow bins with the shorthand `:-2`, which just removes the last two bins.
It is very easy to crop the extra counters by slicing each dimension with under-/overflow bins with the shorthand `:-2` along each dimension, which just removes the last two bins.
[endsect]
@ -576,15 +576,27 @@ h4 = h3 * 2
print h4.value(0), h4.value(1)
# prints:
# 2.0 2.0
# prints: 2.0 2.0
# now save the histogram
with open("h4_saved.pkl", "w") as f:
cPickle.dump(h4, f)
with open("h4_saved.pkl", "r") as f:
h5 = cPickle.load(f)
print h4 == h5
# prints: true
``
[endsect]
[note Python has no concept of rvalue references and therefore cannot avoid creating temporary objects in arithmetic operations like C++ can. A statement `h3 = h1 + h2` is equivalent to `tmp = hg.histogram(h1); tmp += h2; h3 = hg.histogram(tmp)`, which amounts to two allocations, one in the first and one in the last copy-assignment, while only one allocation is really necessary.
To avoid creating superfluous temporaries, write your code entirely with operators `+=` and `*=`, this is always possible. In the example above, the optimised code would be: `h3 = hg.histogram(h1); h3 += h2`. It is actually quite puzzling why Python does not do such simple optimisations by itself.]
[warning Pickling in Python uses a binary representation which is *not* portable between different hardware platforms. It will only work on platforms with the same endianess and same floating point binary format. In practice, most computing clusters and most consumer hardware are x86 compatible nowadays. The binary_archive is portable between such hardware platforms. It is also portable between operating systems, like OSX, Windows, and Linux.]
[endsect]
[section Mixing Python and C++]
It is a efficient workflow to create and configure histograms in Python and then pass them to some C++ code which fills them at maximum speed. You rarely need to change the way the histogram is filled, but you likely want to iterate the range and binning of the axis after seeing the data. With [@boost:/libs/python/index.html Boost.Python] it is easy to set this up.
@ -644,14 +656,19 @@ for iy in range(len(h.axis(1))):
[section:expert Advanced usage]
The library was carefully designed to be customizable and extensible for users. Users can create new axis classes and use them with the histogram, or implemented a custom storage policy.
[section User-defined axis class]
TODO
[endsect]
[section User-defined storage policy]
Histograms can easily be created which use a different storage policy than the default `adaptive_storage`. A simple alternative is provided by the library, `array_storage`, which does not do dynamic memory management, and let's the user choose the counter type. As a consequence, it does
A simple alternat
TODO
[endsect]

View File

@ -561,7 +561,6 @@ public:
}
/// Returns the bin index for the passed argument.
/// Performs a range check.
inline int index(const value_type &x) const noexcept {
auto it = map_->left.find(x);
if (it == map_->left.end())
@ -569,7 +568,7 @@ public:
return it->second;
}
/// Returns the value for the bin index.
/// Returns the value for the bin index (performs a range check).
bin_type operator[](int idx) const {
auto it = map_->right.find(idx);
if (it == map_->right.end())

View File

@ -35,6 +35,23 @@ template <typename T, typename = decltype(std::declval<T &>().size(),
std::declval<T &>().value(0))>
struct is_storage {};
template <typename T>
struct has_variance_support
{
template<typename U, typename T::value_type (U::*)(std::size_t) const> struct SFINAE {};
template<typename U> static std::true_type Test(SFINAE<U, &U::variance>*);
template<typename U> static std::false_type Test(...);
using type = decltype(Test<T>(nullptr));
};
template <typename T>
using has_variance_support_t = typename has_variance_support<T>::type;
template <typename S>
using requires_variance_support = typename std::enable_if<
has_variance_support_t<S>::value, typename S::value_type
>::type;
template <typename T,
typename = decltype(*std::declval<T &>(), ++std::declval<T &>())>
struct is_iterator {};

View File

@ -9,6 +9,7 @@
#include <algorithm>
#include <boost/utility/string_view.hpp>
#include <boost/histogram/detail/meta.hpp>
#include <ostream>
#include <vector>
@ -16,6 +17,24 @@ namespace boost {
namespace histogram {
namespace detail {
template <typename Storage>
void storage_add_impl(std::true_type, Storage& result, const Storage& input,
std::size_t dst, std::size_t src) {
result.add(dst, input.value(src), input.variance(src));
}
template <typename Storage>
void storage_add_impl(std::false_type, Storage& result, const Storage& input,
std::size_t dst, std::size_t src) {
result.add(dst, input.value(src));
}
template <typename Storage>
void storage_add(Storage& result, const Storage& input,
std::size_t dst, std::size_t src) {
storage_add_impl(has_variance_support_t<Storage>(), result, input, dst, src);
}
inline void escape(std::ostream &os, const string_view s) {
os << '\'';
for (auto sit = s.begin(); sit != s.end(); ++sit) {

View File

@ -192,8 +192,9 @@ public:
return storage_.value(idx);
}
template <typename... Indices>
value_type variance(Indices &&... indices) const {
template <typename S = Storage, typename... Indices>
detail::requires_variance_support<S>
variance(Indices &&... indices) const {
if(dim() != sizeof...(indices))
throw std::invalid_argument("variance arguments does not match histogram dimension");
std::size_t idx = 0, stride = 1;
@ -204,8 +205,9 @@ public:
return storage_.variance(idx);
}
template <typename Iterator, typename = detail::is_iterator<Iterator>>
value_type variance(Iterator begin, Iterator end) const {
template <typename S = Storage, typename Iterator, typename = detail::is_iterator<Iterator>>
detail::requires_variance_support<S>
variance(Iterator begin, Iterator end) const {
if(dim() != std::distance(begin, end))
throw std::invalid_argument("variance iterator range does not match histogram dimension");
std::size_t idx = 0, stride = 1;
@ -416,8 +418,7 @@ private:
histogram h(axes.begin(), axes.end());
detail::index_mapper m(n, b);
do {
h.storage_.add(m.second, storage_.value(m.first),
storage_.variance(m.first));
detail::storage_add(h.storage_, storage_, m.second, m.first);
} while (m.next());
return h;
}

View File

@ -131,8 +131,9 @@ public:
return storage_.value(idx);
}
template <typename... Indices>
value_type variance(Indices &&... indices) const {
template <typename S=Storage, typename... Indices>
detail::requires_variance_support<S>
variance(Indices &&... indices) const {
static_assert(sizeof...(indices) == axes_size::value,
"number of arguments does not match histogram dimension");
std::size_t idx = 0, stride = 1;
@ -305,8 +306,7 @@ private:
for_each_axis(helper);
detail::index_mapper m(n, b);
do {
h.storage_.add(m.second, storage_.value(m.first),
storage_.variance(m.first));
detail::storage_add(h.storage_, storage_, m.second, m.first);
} while (m.next());
}

View File

@ -75,13 +75,8 @@ public:
std::size_t size() const noexcept { return size_; }
void increase(std::size_t i) noexcept { ++array_[i]; }
void add(std::size_t i, const value_type &n) noexcept { array_[i] += n; }
void add(std::size_t i, const value_type &n,
const value_type & /* variance */) noexcept {
array_[i] += n;
}
value_type value(std::size_t i) const noexcept { return array_[i]; }
value_type variance(std::size_t i) const noexcept { return array_[i]; }
template <typename U>
array_storage &operator+=(const array_storage<U> &rhs) noexcept {

View File

@ -11,16 +11,50 @@
namespace boost {
namespace histogram {
namespace detail {
template <typename S1, typename S2>
bool equal_impl(std::true_type, std::true_type, const S1& s1, const S2& s2)
{
for (decltype(s1.size()) i = 0, n = s1.size(); i < n; ++i)
if (s1.value(i) != s2.value(i) || s1.variance(i) != s2.variance(i))
return false;
return true;
}
template <typename S1, typename S2>
bool equal_impl(std::true_type, std::false_type, const S1& s1, const S2& s2)
{
for (decltype(s1.size()) i = 0, n = s1.size(); i < n; ++i)
if (s1.value(i) != s2.value(i) || s1.value(i) != s1.variance(i))
return false;
return true;
}
template <typename S1, typename S2>
bool equal_impl(std::false_type, std::true_type, const S1& s1, const S2& s2)
{
return equal_impl(std::true_type(), std::false_type(), s2, s1);
}
template <typename S1, typename S2>
bool equal_impl(std::false_type, std::false_type, const S1& s1, const S2& s2)
{
for (decltype(s1.size()) i = 0, n = s1.size(); i < n; ++i)
if (s1.value(i) != s2.value(i))
return false;
return true;
}
}
template <typename S1, typename S2, typename = detail::is_storage<S1>,
typename = detail::is_storage<S2>>
bool operator==(const S1 &s1, const S2 &s2) noexcept {
if (s1.size() != s2.size())
return false;
for (decltype(s1.size()) i = 0, n = s1.size(); i < n; ++i)
if (s1.value(i) != s2.value(i) || s1.variance(i) != s2.variance(i))
return false;
return true;
return detail::equal_impl(
detail::has_variance_support_t<S1>(),
detail::has_variance_support_t<S2>(),
s1, s2);
}
template <typename S1, typename S2, typename = detail::is_storage<S1>,

View File

@ -53,19 +53,13 @@ int main() {
a.increase(0);
a *= 3;
BOOST_TEST_EQ(a.value(0), 3.0);
BOOST_TEST_EQ(a.variance(0), 3.0);
BOOST_TEST_EQ(a.value(1), 0.0);
BOOST_TEST_EQ(a.variance(1), 0.0);
a.add(1, 2.0, 5.0); // 5.0 is intentionally ignored
a.add(1, 2.0);
BOOST_TEST_EQ(a.value(0), 3.0);
BOOST_TEST_EQ(a.variance(0), 3.0);
BOOST_TEST_EQ(a.value(1), 2.0);
BOOST_TEST_EQ(a.variance(1), 2.0);
a *= 3;
BOOST_TEST_EQ(a.value(0), 9.0);
BOOST_TEST_EQ(a.variance(0), 9.0);
BOOST_TEST_EQ(a.value(1), 6.0);
BOOST_TEST_EQ(a.variance(1), 6.0);
}
// copy

View File

@ -18,5 +18,17 @@ int main() {
using result = bhd::unique_sorted<numbers>;
BOOST_MPL_ASSERT((equal<result, expected, equal_to<_, _>>));
struct no_variance_method {
using value_type = int;
};
struct variance_method {
using value_type = int;
value_type variance(std::size_t) const;
};
BOOST_TEST_EQ(typename bhd::has_variance_support<no_variance_method>::type(), false);
BOOST_TEST_EQ(typename bhd::has_variance_support<variance_method>::type(), true);
return boost::report_errors();
}