1 files changed, 304 insertions, 0 deletions
diff --git a/src/rocksdb/third-party/folly/folly/synchronization/DistributedMutex.h b/src/rocksdb/third-party/folly/folly/synchronization/DistributedMutex.h
new file mode 100644
index 000000000..7acf04f45
--- /dev/null
+++ b/src/rocksdb/third-party/folly/folly/synchronization/DistributedMutex.h
@@ -0,0 +1,304 @@
+//  Copyright (c) 2011-present, Facebook, Inc.  All rights reserved.
+//  This source code is licensed under both the GPLv2 (found in the
+//  COPYING file in the root directory) and Apache 2.0 License
+//  (found in the LICENSE.Apache file in the root directory).
+
+#pragma once
+
+#include <atomic>
+#include <chrono>
+#include <cstdint>
+
+namespace folly {
+namespace detail {
+namespace distributed_mutex {
+
+/**
+ * DistributedMutex is a small, exclusive-only mutex that distributes the
+ * bookkeeping required for mutual exclusion in the stacks of threads that are
+ * contending for it.  It has a mode that can combine critical sections when
+ * the mutex experiences contention; this allows the implementation to elide
+ * several expensive coherence and synchronization operations to boost
+ * throughput, surpassing even atomic instructions in some cases.  It has a
+ * smaller memory footprint than std::mutex, a similar level of fairness
+ * (better in some cases) and no dependencies on heap allocation.  It is the
+ * same width as a single pointer (8 bytes on most platforms), where on the
+ * other hand, std::mutex and pthread_mutex_t are both 40 bytes.  It is larger
+ * than some of the other smaller locks, but the wide majority of cases using
+ * the small locks are wasting the difference in alignment padding anyway
+ *
+ * Benchmark results are good - at the time of writing, in the contended case,
+ * for lock/unlock based critical sections, it is about 4-5x faster than the
+ * smaller locks and about ~2x faster than std::mutex.  When used in
+ * combinable mode, it is much faster than the alternatives, going more than
+ * 10x faster than the small locks, about 6x faster than std::mutex, 2-3x
+ * faster than flat combining and even faster than std::atomic<> in some
+ * cases, allowing more work with higher throughput.  In the uncontended case,
+ * it is a few cycles faster than folly::MicroLock but a bit slower than
+ * std::mutex.  DistributedMutex is also resistent to tail latency pathalogies
+ * unlike many of the other mutexes in use, which sleep for large time
+ * quantums to reduce spin churn, this causes elevated latencies for threads
+ * that enter the sleep cycle.  The tail latency of lock acquisition can go up
+ * to 10x lower because of a more deterministic scheduling algorithm that is
+ * managed almost entirely in userspace.  Detailed results comparing the
+ * throughput and latencies of different mutex implementations and atomics are
+ * at the bottom of folly/synchronization/test/SmallLocksBenchmark.cpp
+ *
+ * Theoretically, write locks promote concurrency when the critical sections
+ * are small as most of the work is done outside the lock.  And indeed,
+ * performant concurrent applications go through several pains to limit the
+ * amount of work they do while holding a lock.  However, most times, the
+ * synchronization and scheduling overhead of a write lock in the critical
+ * path is so high, that after a certain point, making critical sections
+ * smaller does not actually increase the concurrency of the application and
+ * throughput plateaus.  DistributedMutex moves this breaking point to the
+ * level of hardware atomic instructions, so applications keep getting
+ * concurrency even under very high contention.  It does this by reducing
+ * cache misses and contention in userspace and in the kernel by making each
+ * thread wait on a thread local node and futex.  When combined critical
+ * sections are used DistributedMutex leverages template metaprogramming to
+ * allow the mutex to make better synchronization decisions based on the
+ * layout of the input and output data.  This allows threads to keep working
+ * only on their own cache lines without requiring cache coherence operations
+ * when a mutex experiences heavy contention
+ *
+ * Non-timed mutex acquisitions are scheduled through intrusive LIFO
+ * contention chains.  Each thread starts by spinning for a short quantum and
+ * falls back to two phased sleeping.  Enqueue operations are lock free and
+ * are piggybacked off mutex acquisition attempts.  The LIFO behavior of a
+ * contention chain is good in the case where the mutex is held for a short
+ * amount of time, as the head of the chain is likely to not have slept on
+ * futex() after exhausting its spin quantum.  This allow us to avoid
+ * unnecessary traversal and syscalls in the fast path with a higher
+ * probability.  Even though the contention chains are LIFO, the mutex itself
+ * does not adhere to that scheduling policy globally.  During contention,
+ * threads that fail to lock the mutex form a LIFO chain on the central mutex
+ * state, this chain is broken when a wakeup is scheduled, and future enqueue
+ * operations form a new chain.  This makes the chains themselves LIFO, but
+ * preserves global fairness through a constant factor which is limited to the
+ * number of concurrent failed mutex acquisition attempts.  This binds the
+ * last in first out behavior to the number of contending threads and helps
+ * prevent starvation and latency outliers
+ *
+ * This strategy of waking up wakers one by one in a queue does not scale well
+ * when the number of threads goes past the number of cores.  At which point
+ * preemption causes elevated lock acquisition latencies.  DistributedMutex
+ * implements a hardware timestamp publishing heuristic to detect and adapt to
+ * preemption.
+ *
+ * DistributedMutex does not have the typical mutex API - it does not satisfy
+ * the Lockable concept.  It requires the user to maintain ephemeral bookkeeping
+ * and pass that bookkeeping around to unlock() calls.  The API overhead,
+ * however, comes for free when you wrap this mutex for usage with
+ * std::unique_lock, which is the recommended usage (std::lock_guard, in
+ * optimized mode, has no performance benefit over std::unique_lock, so has been
+ * omitted).  A benefit of this API is that it disallows incorrect usage where a
+ * thread unlocks a mutex that it does not own, thinking a mutex is functionally
+ * identical to a binary semaphore, which, unlike a mutex, is a suitable
+ * primitive for that usage
+ *
+ * Combined critical sections allow the implementation to elide several
+ * expensive operations during the lifetime of a critical section that cause
+ * slowdowns with regular lock/unlock based usage.  DistributedMutex resolves
+ * contention through combining up to a constant factor of 2 contention chains
+ * to prevent issues with fairness and latency outliers, so we retain the
+ * fairness benefits of the lock/unlock implementation with no noticeable
+ * regression when switching between the lock methods.  Despite the efficiency
+ * benefits, combined critical sections can only be used when the critical
+ * section does not depend on thread local state and does not introduce new
+ * dependencies between threads when the critical section gets combined.  For
+ * example, locking or unlocking an unrelated mutex in a combined critical
+ * section might lead to unexpected results or even undefined behavior.  This
+ * can happen if, for example, a different thread unlocks a mutex locked by
+ * the calling thread, leading to undefined behavior as the mutex might not
+ * allow locking and unlocking from unrelated threads (the posix and C++
+ * standard disallow this usage for their mutexes)
+ *
+ * Timed locking through DistributedMutex is implemented through a centralized
+ * algorithm.  The underlying contention-chains framework used in
+ * DistributedMutex is not abortable so we build abortability on the side.
+ * All waiters wait on the central mutex state, by setting and resetting bits
+ * within the pointer-length word.  Since pointer length atomic integers are
+ * incompatible with futex(FUTEX_WAIT) on most systems, a non-standard
+ * implementation of futex() is used, where wait queues are managed in
+ * user-space (see p1135r0 and folly::ParkingLot for more)
+ */
+template <
+    template <typename> class Atomic = std::atomic,
+    bool TimePublishing = true>
+class DistributedMutex {
+ public:
+  class DistributedMutexStateProxy;
+
+  /**
+   * DistributedMutex is only default constructible, it can neither be moved
+   * nor copied
+   */
+  DistributedMutex();
+  DistributedMutex(DistributedMutex&&) = delete;
+  DistributedMutex(const DistributedMutex&) = delete;
+  DistributedMutex& operator=(DistributedMutex&&) = delete;
+  DistributedMutex& operator=(const DistributedMutex&) = delete;
+
+  /**
+   * Acquires the mutex in exclusive mode
+   *
+   * This returns an ephemeral proxy that contains internal mutex state.  This
+   * must be kept around for the duration of the critical section and passed
+   * subsequently to unlock() as an rvalue
+   *
+   * The proxy has no public API and is intended to be for internal usage only
+   *
+   * There are three notable cases where this method causes undefined
+   * behavior:
+   *
+   *  - This is not a recursive mutex.  Trying to acquire the mutex twice from
+   *    the same thread without unlocking it results in undefined behavior
+   *  - Thread, coroutine or fiber migrations from within a critical section
+   *    are disallowed.  This is because the implementation requires owning the
+   *    stack frame through the execution of the critical section for both
+   *    lock/unlock or combined critical sections.  This also means that you
+   *    cannot allow another thread, fiber or coroutine to unlock the mutex
+   *  - This mutex cannot be used in a program compiled with segmented stacks,
+   *    there is currently no way to detect the presence of segmented stacks
+   *    at compile time or runtime, so we have no checks against this
+   */
+  DistributedMutexStateProxy lock();
+
+  /**
+   * Unlocks the mutex
+   *
+   * The proxy returned by lock must be passed to unlock as an rvalue.  No
+   * other option is possible here, since the proxy is only movable and not
+   * copyable
+   *
+   * It is undefined behavior to unlock from a thread that did not lock the
+   * mutex
+   */
+  void unlock(DistributedMutexStateProxy);
+
+  /**
+   * Try to acquire the mutex
+   *
+   * A non blocking version of the lock() function.  The returned object is
+   * contextually convertible to bool.  And has the value true when the mutex
+   * was successfully acquired, false otherwise
+   *
+   * This is allowed to return false spuriously, i.e. this is not guaranteed
+   * to return true even when the mutex is currently unlocked.  In the event
+   * of a failed acquisition, this does not impose any memory ordering
+   * constraints for other threads
+   */
+  DistributedMutexStateProxy try_lock();
+
+  /**
+   * Try to acquire the mutex, blocking for the given time
+   *
+   * Like try_lock(), this is allowed to fail spuriously and is not guaranteed
+   * to return false even when the mutex is currently unlocked.  But only
+   * after the given time has elapsed
+   *
+   * try_lock_for() accepts a duration to block for, and try_lock_until()
+   * accepts an absolute wall clock time point
+   */
+  template <typename Rep, typename Period>
+  DistributedMutexStateProxy try_lock_for(
+      const std::chrono::duration<Rep, Period>& duration);
+
+  /**
+   * Try to acquire the lock, blocking until the given deadline
+   *
+   * Other than the difference in the meaning of the second argument, the
+   * semantics of this function are identical to try_lock_for()
+   */
+  template <typename Clock, typename Duration>
+  DistributedMutexStateProxy try_lock_until(
+      const std::chrono::time_point<Clock, Duration>& deadline);
+
+  /**
+   * Execute a task as a combined critical section
+   *
+   * Unlike traditional lock and unlock methods, lock_combine() enqueues the
+   * passed task for execution on any arbitrary thread.  This allows the
+   * implementation to prevent cache line invalidations originating from
+   * expensive synchronization operations.  The thread holding the lock is
+   * allowed to execute the task before unlocking, thereby forming a "combined
+   * critical section".
+   *
+   * This idea is inspired by Flat Combining.  Flat Combining was introduced
+   * in the SPAA 2010 paper titled "Flat Combining and the
+   * Synchronization-Parallelism Tradeoff", by Danny Hendler, Itai Incze, Nir
+   * Shavit, and Moran Tzafrir -
+   * https://www.cs.bgu.ac.il/~hendlerd/papers/flat-combining.pdf.  The
+   * implementation used here is significantly different from that described
+   * in the paper.  The high-level goal of reducing the overhead of
+   * synchronization, however, is the same.
+   *
+   * Combined critical sections work best when kept simple.  Since the
+   * critical section might be executed on any arbitrary thread, relying on
+   * things like thread local state or mutex locking and unlocking might cause
+   * incorrectness.  Associativity is important.  For example
+   *
+   *    auto one = std::unique_lock{one_};
+   *    two_.lock_combine([&]() {
+   *      if (bar()) {
+   *        one.unlock();
+   *      }
+   *    });
+   *
+   * This has the potential to cause undefined behavior because mutexes are
+   * only meant to be acquired and released from the owning thread.  Similar
+   * errors can arise from a combined critical section introducing implicit
+   * dependencies based on the state of the combining thread.  For example
+   *
+   *    // thread 1
+   *    auto one = std::unique_lock{one_};
+   *    auto two = std::unique_lock{two_};
+   *
+   *    // thread 2
+   *    two_.lock_combine([&]() {
+   *      auto three = std::unique_lock{three_};
+   *    });
+   *
+   * Here, because we used a combined critical section, we have introduced a
+   * dependency from one -> three that might not obvious to the reader
+   *
+   * This function is exception-safe.  If the passed task throws an exception,
+   * it will be propagated to the caller, even if the task is running on
+   * another thread
+   *
+   * There are three notable cases where this method causes undefined
+   * behavior:
+   *
+   *  - This is not a recursive mutex.  Trying to acquire the mutex twice from
+   *    the same thread without unlocking it results in undefined behavior
+   *  - Thread, coroutine or fiber migrations from within a critical section
+   *    are disallowed.  This is because the implementation requires owning the
+   *    stack frame through the execution of the critical section for both
+   *    lock/unlock or combined critical sections.  This also means that you
+   *    cannot allow another thread, fiber or coroutine to unlock the mutex
+   *  - This mutex cannot be used in a program compiled with segmented stacks,
+   *    there is currently no way to detect the presence of segmented stacks
+   *    at compile time or runtime, so we have no checks against this
+   */
+  template <typename Task>
+  auto lock_combine(Task task) -> decltype(std::declval<const Task&>()());
+
+ private:
+  Atomic<std::uintptr_t> state_{0};
+};
+
+} // namespace distributed_mutex
+} // namespace detail
+
+/**
+ * Bring the default instantiation of DistributedMutex into the folly
+ * namespace without requiring any template arguments for public usage
+ */
+extern template class detail::distributed_mutex::DistributedMutex<>;
+using DistributedMutex = detail::distributed_mutex::DistributedMutex<>;
+
+} // namespace folly
+
+#include <folly/synchronization/DistributedMutex-inl.h>
+#include <folly/synchronization/DistributedMutexSpecializations.h>