/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ /* vim: set ts=8 sts=2 et sw=2 tw=80: */ /* This Source Code Form is subject to the terms of the Mozilla Public * License, v. 2.0. If a copy of the MPL was not distributed with this * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ #ifndef mozilla_FastBernoulliTrial_h #define mozilla_FastBernoulliTrial_h #include "mozilla/Assertions.h" #include "mozilla/XorShift128PlusRNG.h" #include #include namespace mozilla { /** * class FastBernoulliTrial: Efficient sampling with uniform probability * * When gathering statistics about a program's behavior, we may be observing * events that occur very frequently (e.g., function calls or memory * allocations) and we may be gathering information that is somewhat expensive * to produce (e.g., call stacks). Sampling all the events could have a * significant impact on the program's performance. * * Why not just sample every N'th event? This technique is called "systematic * sampling"; it's simple and efficient, and it's fine if we imagine a * patternless stream of events. But what if we're sampling allocations, and the * program happens to have a loop where each iteration does exactly N * allocations? You would end up sampling the same allocation every time through * the loop; the entire rest of the loop becomes invisible to your measurements! * More generally, if each iteration does M allocations, and M and N have any * common divisor at all, most allocation sites will never be sampled. If * they're both even, say, the odd-numbered allocations disappear from your * results. * * Ideally, we'd like each event to have some probability P of being sampled, * independent of its neighbors and of its position in the sequence. This is * called "Bernoulli sampling", and it doesn't suffer from any of the problems * mentioned above. * * One disadvantage of Bernoulli sampling is that you can't be sure exactly how * many samples you'll get: technically, it's possible that you might sample * none of them, or all of them. But if the number of events N is large, these * aren't likely outcomes; you can generally expect somewhere around P * N * events to be sampled. * * The other disadvantage of Bernoulli sampling is that you have to generate a * random number for every event, which can be slow. * * [significant pause] * * BUT NOT WITH THIS CLASS! FastBernoulliTrial lets you do true Bernoulli * sampling, while generating a fresh random number only when we do decide to * sample an event, not on every trial. When it decides not to sample, a call to * |FastBernoulliTrial::trial| is nothing but decrementing a counter and * comparing it to zero. So the lower your sampling probability is, the less * overhead FastBernoulliTrial imposes. * * Probabilities of 0 and 1 are handled efficiently. (In neither case need we * ever generate a random number at all.) * * The essential API: * * - FastBernoulliTrial(double P) * Construct an instance that selects events with probability P. * * - FastBernoulliTrial::trial() * Return true with probability P. Call this each time an event occurs, to * decide whether to sample it or not. * * - FastBernoulliTrial::trial(size_t n) * Equivalent to calling trial() |n| times, and returning true if any of those * calls do. However, like trial, this runs in fast constant time. * * What is this good for? In some applications, some events are "bigger" than * others. For example, large allocations are more significant than small * allocations. Perhaps we'd like to imagine that we're drawing allocations * from a stream of bytes, and performing a separate Bernoulli trial on every * byte from the stream. We can accomplish this by calling |t.trial(S)| for * the number of bytes S, and sampling the event if that returns true. * * Of course, this style of sampling needs to be paired with analysis and * presentation that makes the size of the event apparent, lest trials with * large values for |n| appear to be indistinguishable from those with small * values for |n|. */ class FastBernoulliTrial { /* * This comment should just read, "Generate skip counts with a geometric * distribution", and leave everyone to go look that up and see why it's the * right thing to do, if they don't know already. * * BUT IF YOU'RE CURIOUS, COMMENTS ARE FREE... * * Instead of generating a fresh random number for every trial, we can * randomly generate a count of how many times we should return false before * the next time we return true. We call this a "skip count". Once we've * returned true, we generate a fresh skip count, and begin counting down * again. * * Here's an awesome fact: by exercising a little care in the way we generate * skip counts, we can produce results indistinguishable from those we would * get "rolling the dice" afresh for every trial. * * In short, skip counts in Bernoulli trials of probability P obey a geometric * distribution. If a random variable X is uniformly distributed from [0..1), * then std::floor(std::log(X) / std::log(1-P)) has the appropriate geometric * distribution for the skip counts. * * Why that formula? * * Suppose we're to return |true| with some probability P, say, 0.3. Spread * all possible futures along a line segment of length 1. In portion P of * those cases, we'll return true on the next call to |trial|; the skip count * is 0. For the remaining portion 1-P of cases, the skip count is 1 or more. * * skip: 0 1 or more * |------------------^-----------------------------------------| * portion: 0.3 0.7 * P 1-P * * But the "1 or more" section of the line is subdivided the same way: *within * that section*, in portion P the second call to |trial()| returns true, and * in portion 1-P it returns false a second time; the skip count is two or * more. So we return true on the second call in proportion 0.7 * 0.3, and * skip at least the first two in proportion 0.7 * 0.7. * * skip: 0 1 2 or more * |------------------^------------^----------------------------| * portion: 0.3 0.7 * 0.3 0.7 * 0.7 * P (1-P)*P (1-P)^2 * * We can continue to subdivide: * * skip >= 0: |------------------------------------------------- (1-P)^0 --| * skip >= 1: | ------------------------------- (1-P)^1 --| * skip >= 2: | ------------------ (1-P)^2 --| * skip >= 3: | ^ ---------- (1-P)^3 --| * skip >= 4: | . --- (1-P)^4 --| * . * ^X, see below * * In other words, the likelihood of the next n calls to |trial| returning * false is (1-P)^n. The longer a run we require, the more the likelihood * drops. Further calls may return false too, but this is the probability * we'll skip at least n. * * This is interesting, because we can pick a point along this line segment * and see which skip count's range it falls within; the point X above, for * example, is within the ">= 2" range, but not within the ">= 3" range, so it * designates a skip count of 2. So if we pick points on the line at random * and use the skip counts they fall under, that will be indistinguishable * from generating a fresh random number between 0 and 1 for each trial and * comparing it to P. * * So to find the skip count for a point X, we must ask: To what whole power * must we raise 1-P such that we include X, but the next power would exclude * it? This is exactly std::floor(std::log(X) / std::log(1-P)). * * Our algorithm is then, simply: When constructed, compute an initial skip * count. Return false from |trial| that many times, and then compute a new * skip count. * * For a call to |trial(n)|, if the skip count is greater than n, return false * and subtract n from the skip count. If the skip count is less than n, * return true and compute a new skip count. Since each trial is independent, * it doesn't matter by how much n overshoots the skip count; we can actually * compute a new skip count at *any* time without affecting the distribution. * This is really beautiful. */ public: /** * Construct a fast Bernoulli trial generator. Calls to |trial()| return true * with probability |aProbability|. Use |aState0| and |aState1| to seed the * random number generator; both may not be zero. */ FastBernoulliTrial(double aProbability, uint64_t aState0, uint64_t aState1) : mProbability(0), mInvLogNotProbability(0), mGenerator(aState0, aState1), mSkipCount(0) { setProbability(aProbability); } /** * Return true with probability |mProbability|. Call this each time an event * occurs, to decide whether to sample it or not. The lower |mProbability| is, * the faster this function runs. */ bool trial() { if (mSkipCount) { mSkipCount--; return false; } return chooseSkipCount(); } /** * Equivalent to calling trial() |n| times, and returning true if any of those * calls do. However, like trial, this runs in fast constant time. * * What is this good for? In some applications, some events are "bigger" than * others. For example, large allocations are more significant than small * allocations. Perhaps we'd like to imagine that we're drawing allocations * from a stream of bytes, and performing a separate Bernoulli trial on every * byte from the stream. We can accomplish this by calling |t.trial(S)| for * the number of bytes S, and sampling the event if that returns true. * * Of course, this style of sampling needs to be paired with analysis and * presentation that makes the "size" of the event apparent, lest trials with * large values for |n| appear to be indistinguishable from those with small * values for |n|, despite being potentially much more likely to be sampled. */ bool trial(size_t aCount) { if (mSkipCount > aCount) { mSkipCount -= aCount; return false; } return chooseSkipCount(); } void setRandomState(uint64_t aState0, uint64_t aState1) { mGenerator.setState(aState0, aState1); } void setProbability(double aProbability) { MOZ_ASSERT(0 <= aProbability && aProbability <= 1); mProbability = aProbability; if (0 < mProbability && mProbability < 1) { /* * Let's look carefully at how this calculation plays out in floating- * point arithmetic. We'll assume IEEE, but the final C++ code we arrive * at would still be fine if our numbers were mathematically perfect. So, * while we've considered IEEE's edge cases, we haven't done anything that * should be actively bad when using other representations. * * (In the below, read comparisons as exact mathematical comparisons: when * we say something "equals 1", that means it's exactly equal to 1. We * treat approximation using intervals with open boundaries: saying a * value is in (0,1) doesn't specify how close to 0 or 1 the value gets. * When we use closed boundaries like [2**-53, 1], we're careful to ensure * the boundary values are actually representable.) * * - After the comparison above, we know mProbability is in (0,1). * * - The gaps below 1 are 2**-53, so that interval is (0, 1-2**-53]. * * - Because the floating-point gaps near 1 are wider than those near * zero, there are many small positive doubles ε such that 1-ε rounds to * exactly 1. However, 2**-53 can be represented exactly. So * 1-mProbability is in [2**-53, 1]. * * - log(1 - mProbability) is thus in (-37, 0]. * * That range includes zero, but when we use mInvLogNotProbability, it * would be helpful if we could trust that it's negative. So when log(1 * - mProbability) is 0, we'll just set mProbability to 0, so that * mInvLogNotProbability is not used in chooseSkipCount. * * - How much of the range of mProbability does this cause us to ignore? * The only value for which log returns 0 is exactly 1; the slope of log * at 1 is 1, so for small ε such that 1 - ε != 1, log(1 - ε) is -ε, * never 0. The gaps near one are larger than the gaps near zero, so if * 1 - ε wasn't 1, then -ε is representable. So if log(1 - mProbability) * isn't 0, then 1 - mProbability isn't 1, which means that mProbability * is at least 2**-53, as discussed earlier. This is a sampling * likelihood of roughly one in ten trillion, which is unlikely to be * distinguishable from zero in practice. * * So by forbidding zero, we've tightened our range to (-37, -2**-53]. * * - Finally, 1 / log(1 - mProbability) is in [-2**53, -1/37). This all * falls readily within the range of an IEEE double. * * ALL THAT HAVING BEEN SAID: here are the five lines of actual code: */ double logNotProbability = std::log(1 - mProbability); if (logNotProbability == 0.0) mProbability = 0.0; else mInvLogNotProbability = 1 / logNotProbability; } chooseSkipCount(); } private: /* The likelihood that any given call to |trial| should return true. */ double mProbability; /* * The value of 1/std::log(1 - mProbability), cached for repeated use. * * If mProbability is exactly 0 or exactly 1, we don't use this value. * Otherwise, we guarantee this value is in the range [-2**53, -1/37), i.e. * definitely negative, as required by chooseSkipCount. See setProbability for * the details. */ double mInvLogNotProbability; /* Our random number generator. */ non_crypto::XorShift128PlusRNG mGenerator; /* The number of times |trial| should return false before next returning true. */ size_t mSkipCount; /* * Choose the next skip count. This also returns the value that |trial| should * return, since we have to check for the extreme values for mProbability * anyway, and |trial| should never return true at all when mProbability is 0. */ bool chooseSkipCount() { /* * If the probability is 1.0, every call to |trial| returns true. Make sure * mSkipCount is 0. */ if (mProbability == 1.0) { mSkipCount = 0; return true; } /* * If the probabilility is zero, |trial| never returns true. Don't bother us * for a while. */ if (mProbability == 0.0) { mSkipCount = SIZE_MAX; return false; } /* * What sorts of values can this call to std::floor produce? * * Since mGenerator.nextDouble returns a value in [0, 1-2**-53], std::log * returns a value in the range [-infinity, -2**-53], all negative. Since * mInvLogNotProbability is negative (see its comments), the product is * positive and possibly infinite. std::floor returns +infinity unchanged. * So the result will always be positive. * * Converting a double to an integer that is out of range for that integer * is undefined behavior, so we must clamp our result to SIZE_MAX, to ensure * we get an acceptable value for mSkipCount. * * The clamp is written carefully. Note that if we had said: * * if (skipCount > double(SIZE_MAX)) * mSkipCount = SIZE_MAX; * else * mSkipCount = skipCount; * * that leads to undefined behavior 64-bit machines: SIZE_MAX coerced to * double can equal 2^64, so if skipCount equaled 2^64 converting it to * size_t would induce undefined behavior. * * Jakob Olesen cleverly suggested flipping the sense of the comparison to * skipCount < double(SIZE_MAX). The conversion will evaluate to 2^64 or * the double just below it: either way, skipCount is guaranteed to have a * value that's safely convertible to size_t. * * (On 32-bit machines, all size_t values can be represented exactly in * double, so all is well.) */ double skipCount = std::floor(std::log(mGenerator.nextDouble()) * mInvLogNotProbability); if (skipCount < double(SIZE_MAX)) mSkipCount = skipCount; else mSkipCount = SIZE_MAX; return true; } }; } /* namespace mozilla */ #endif /* mozilla_FastBernoulliTrial_h */