summaryrefslogtreecommitdiffstats
path: root/src/rocksdb/third-party/folly/folly/synchronization/DistributedMutex.h
diff options
context:
space:
mode:
Diffstat (limited to 'src/rocksdb/third-party/folly/folly/synchronization/DistributedMutex.h')
-rw-r--r--src/rocksdb/third-party/folly/folly/synchronization/DistributedMutex.h304
1 files changed, 304 insertions, 0 deletions
diff --git a/src/rocksdb/third-party/folly/folly/synchronization/DistributedMutex.h b/src/rocksdb/third-party/folly/folly/synchronization/DistributedMutex.h
new file mode 100644
index 000000000..7acf04f45
--- /dev/null
+++ b/src/rocksdb/third-party/folly/folly/synchronization/DistributedMutex.h
@@ -0,0 +1,304 @@
+// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
+// This source code is licensed under both the GPLv2 (found in the
+// COPYING file in the root directory) and Apache 2.0 License
+// (found in the LICENSE.Apache file in the root directory).
+
+#pragma once
+
+#include <atomic>
+#include <chrono>
+#include <cstdint>
+
+namespace folly {
+namespace detail {
+namespace distributed_mutex {
+
+/**
+ * DistributedMutex is a small, exclusive-only mutex that distributes the
+ * bookkeeping required for mutual exclusion in the stacks of threads that are
+ * contending for it. It has a mode that can combine critical sections when
+ * the mutex experiences contention; this allows the implementation to elide
+ * several expensive coherence and synchronization operations to boost
+ * throughput, surpassing even atomic instructions in some cases. It has a
+ * smaller memory footprint than std::mutex, a similar level of fairness
+ * (better in some cases) and no dependencies on heap allocation. It is the
+ * same width as a single pointer (8 bytes on most platforms), where on the
+ * other hand, std::mutex and pthread_mutex_t are both 40 bytes. It is larger
+ * than some of the other smaller locks, but the wide majority of cases using
+ * the small locks are wasting the difference in alignment padding anyway
+ *
+ * Benchmark results are good - at the time of writing, in the contended case,
+ * for lock/unlock based critical sections, it is about 4-5x faster than the
+ * smaller locks and about ~2x faster than std::mutex. When used in
+ * combinable mode, it is much faster than the alternatives, going more than
+ * 10x faster than the small locks, about 6x faster than std::mutex, 2-3x
+ * faster than flat combining and even faster than std::atomic<> in some
+ * cases, allowing more work with higher throughput. In the uncontended case,
+ * it is a few cycles faster than folly::MicroLock but a bit slower than
+ * std::mutex. DistributedMutex is also resistent to tail latency pathalogies
+ * unlike many of the other mutexes in use, which sleep for large time
+ * quantums to reduce spin churn, this causes elevated latencies for threads
+ * that enter the sleep cycle. The tail latency of lock acquisition can go up
+ * to 10x lower because of a more deterministic scheduling algorithm that is
+ * managed almost entirely in userspace. Detailed results comparing the
+ * throughput and latencies of different mutex implementations and atomics are
+ * at the bottom of folly/synchronization/test/SmallLocksBenchmark.cpp
+ *
+ * Theoretically, write locks promote concurrency when the critical sections
+ * are small as most of the work is done outside the lock. And indeed,
+ * performant concurrent applications go through several pains to limit the
+ * amount of work they do while holding a lock. However, most times, the
+ * synchronization and scheduling overhead of a write lock in the critical
+ * path is so high, that after a certain point, making critical sections
+ * smaller does not actually increase the concurrency of the application and
+ * throughput plateaus. DistributedMutex moves this breaking point to the
+ * level of hardware atomic instructions, so applications keep getting
+ * concurrency even under very high contention. It does this by reducing
+ * cache misses and contention in userspace and in the kernel by making each
+ * thread wait on a thread local node and futex. When combined critical
+ * sections are used DistributedMutex leverages template metaprogramming to
+ * allow the mutex to make better synchronization decisions based on the
+ * layout of the input and output data. This allows threads to keep working
+ * only on their own cache lines without requiring cache coherence operations
+ * when a mutex experiences heavy contention
+ *
+ * Non-timed mutex acquisitions are scheduled through intrusive LIFO
+ * contention chains. Each thread starts by spinning for a short quantum and
+ * falls back to two phased sleeping. Enqueue operations are lock free and
+ * are piggybacked off mutex acquisition attempts. The LIFO behavior of a
+ * contention chain is good in the case where the mutex is held for a short
+ * amount of time, as the head of the chain is likely to not have slept on
+ * futex() after exhausting its spin quantum. This allow us to avoid
+ * unnecessary traversal and syscalls in the fast path with a higher
+ * probability. Even though the contention chains are LIFO, the mutex itself
+ * does not adhere to that scheduling policy globally. During contention,
+ * threads that fail to lock the mutex form a LIFO chain on the central mutex
+ * state, this chain is broken when a wakeup is scheduled, and future enqueue
+ * operations form a new chain. This makes the chains themselves LIFO, but
+ * preserves global fairness through a constant factor which is limited to the
+ * number of concurrent failed mutex acquisition attempts. This binds the
+ * last in first out behavior to the number of contending threads and helps
+ * prevent starvation and latency outliers
+ *
+ * This strategy of waking up wakers one by one in a queue does not scale well
+ * when the number of threads goes past the number of cores. At which point
+ * preemption causes elevated lock acquisition latencies. DistributedMutex
+ * implements a hardware timestamp publishing heuristic to detect and adapt to
+ * preemption.
+ *
+ * DistributedMutex does not have the typical mutex API - it does not satisfy
+ * the Lockable concept. It requires the user to maintain ephemeral bookkeeping
+ * and pass that bookkeeping around to unlock() calls. The API overhead,
+ * however, comes for free when you wrap this mutex for usage with
+ * std::unique_lock, which is the recommended usage (std::lock_guard, in
+ * optimized mode, has no performance benefit over std::unique_lock, so has been
+ * omitted). A benefit of this API is that it disallows incorrect usage where a
+ * thread unlocks a mutex that it does not own, thinking a mutex is functionally
+ * identical to a binary semaphore, which, unlike a mutex, is a suitable
+ * primitive for that usage
+ *
+ * Combined critical sections allow the implementation to elide several
+ * expensive operations during the lifetime of a critical section that cause
+ * slowdowns with regular lock/unlock based usage. DistributedMutex resolves
+ * contention through combining up to a constant factor of 2 contention chains
+ * to prevent issues with fairness and latency outliers, so we retain the
+ * fairness benefits of the lock/unlock implementation with no noticeable
+ * regression when switching between the lock methods. Despite the efficiency
+ * benefits, combined critical sections can only be used when the critical
+ * section does not depend on thread local state and does not introduce new
+ * dependencies between threads when the critical section gets combined. For
+ * example, locking or unlocking an unrelated mutex in a combined critical
+ * section might lead to unexpected results or even undefined behavior. This
+ * can happen if, for example, a different thread unlocks a mutex locked by
+ * the calling thread, leading to undefined behavior as the mutex might not
+ * allow locking and unlocking from unrelated threads (the posix and C++
+ * standard disallow this usage for their mutexes)
+ *
+ * Timed locking through DistributedMutex is implemented through a centralized
+ * algorithm. The underlying contention-chains framework used in
+ * DistributedMutex is not abortable so we build abortability on the side.
+ * All waiters wait on the central mutex state, by setting and resetting bits
+ * within the pointer-length word. Since pointer length atomic integers are
+ * incompatible with futex(FUTEX_WAIT) on most systems, a non-standard
+ * implementation of futex() is used, where wait queues are managed in
+ * user-space (see p1135r0 and folly::ParkingLot for more)
+ */
+template <
+ template <typename> class Atomic = std::atomic,
+ bool TimePublishing = true>
+class DistributedMutex {
+ public:
+ class DistributedMutexStateProxy;
+
+ /**
+ * DistributedMutex is only default constructible, it can neither be moved
+ * nor copied
+ */
+ DistributedMutex();
+ DistributedMutex(DistributedMutex&&) = delete;
+ DistributedMutex(const DistributedMutex&) = delete;
+ DistributedMutex& operator=(DistributedMutex&&) = delete;
+ DistributedMutex& operator=(const DistributedMutex&) = delete;
+
+ /**
+ * Acquires the mutex in exclusive mode
+ *
+ * This returns an ephemeral proxy that contains internal mutex state. This
+ * must be kept around for the duration of the critical section and passed
+ * subsequently to unlock() as an rvalue
+ *
+ * The proxy has no public API and is intended to be for internal usage only
+ *
+ * There are three notable cases where this method causes undefined
+ * behavior:
+ *
+ * - This is not a recursive mutex. Trying to acquire the mutex twice from
+ * the same thread without unlocking it results in undefined behavior
+ * - Thread, coroutine or fiber migrations from within a critical section
+ * are disallowed. This is because the implementation requires owning the
+ * stack frame through the execution of the critical section for both
+ * lock/unlock or combined critical sections. This also means that you
+ * cannot allow another thread, fiber or coroutine to unlock the mutex
+ * - This mutex cannot be used in a program compiled with segmented stacks,
+ * there is currently no way to detect the presence of segmented stacks
+ * at compile time or runtime, so we have no checks against this
+ */
+ DistributedMutexStateProxy lock();
+
+ /**
+ * Unlocks the mutex
+ *
+ * The proxy returned by lock must be passed to unlock as an rvalue. No
+ * other option is possible here, since the proxy is only movable and not
+ * copyable
+ *
+ * It is undefined behavior to unlock from a thread that did not lock the
+ * mutex
+ */
+ void unlock(DistributedMutexStateProxy);
+
+ /**
+ * Try to acquire the mutex
+ *
+ * A non blocking version of the lock() function. The returned object is
+ * contextually convertible to bool. And has the value true when the mutex
+ * was successfully acquired, false otherwise
+ *
+ * This is allowed to return false spuriously, i.e. this is not guaranteed
+ * to return true even when the mutex is currently unlocked. In the event
+ * of a failed acquisition, this does not impose any memory ordering
+ * constraints for other threads
+ */
+ DistributedMutexStateProxy try_lock();
+
+ /**
+ * Try to acquire the mutex, blocking for the given time
+ *
+ * Like try_lock(), this is allowed to fail spuriously and is not guaranteed
+ * to return false even when the mutex is currently unlocked. But only
+ * after the given time has elapsed
+ *
+ * try_lock_for() accepts a duration to block for, and try_lock_until()
+ * accepts an absolute wall clock time point
+ */
+ template <typename Rep, typename Period>
+ DistributedMutexStateProxy try_lock_for(
+ const std::chrono::duration<Rep, Period>& duration);
+
+ /**
+ * Try to acquire the lock, blocking until the given deadline
+ *
+ * Other than the difference in the meaning of the second argument, the
+ * semantics of this function are identical to try_lock_for()
+ */
+ template <typename Clock, typename Duration>
+ DistributedMutexStateProxy try_lock_until(
+ const std::chrono::time_point<Clock, Duration>& deadline);
+
+ /**
+ * Execute a task as a combined critical section
+ *
+ * Unlike traditional lock and unlock methods, lock_combine() enqueues the
+ * passed task for execution on any arbitrary thread. This allows the
+ * implementation to prevent cache line invalidations originating from
+ * expensive synchronization operations. The thread holding the lock is
+ * allowed to execute the task before unlocking, thereby forming a "combined
+ * critical section".
+ *
+ * This idea is inspired by Flat Combining. Flat Combining was introduced
+ * in the SPAA 2010 paper titled "Flat Combining and the
+ * Synchronization-Parallelism Tradeoff", by Danny Hendler, Itai Incze, Nir
+ * Shavit, and Moran Tzafrir -
+ * https://www.cs.bgu.ac.il/~hendlerd/papers/flat-combining.pdf. The
+ * implementation used here is significantly different from that described
+ * in the paper. The high-level goal of reducing the overhead of
+ * synchronization, however, is the same.
+ *
+ * Combined critical sections work best when kept simple. Since the
+ * critical section might be executed on any arbitrary thread, relying on
+ * things like thread local state or mutex locking and unlocking might cause
+ * incorrectness. Associativity is important. For example
+ *
+ * auto one = std::unique_lock{one_};
+ * two_.lock_combine([&]() {
+ * if (bar()) {
+ * one.unlock();
+ * }
+ * });
+ *
+ * This has the potential to cause undefined behavior because mutexes are
+ * only meant to be acquired and released from the owning thread. Similar
+ * errors can arise from a combined critical section introducing implicit
+ * dependencies based on the state of the combining thread. For example
+ *
+ * // thread 1
+ * auto one = std::unique_lock{one_};
+ * auto two = std::unique_lock{two_};
+ *
+ * // thread 2
+ * two_.lock_combine([&]() {
+ * auto three = std::unique_lock{three_};
+ * });
+ *
+ * Here, because we used a combined critical section, we have introduced a
+ * dependency from one -> three that might not obvious to the reader
+ *
+ * This function is exception-safe. If the passed task throws an exception,
+ * it will be propagated to the caller, even if the task is running on
+ * another thread
+ *
+ * There are three notable cases where this method causes undefined
+ * behavior:
+ *
+ * - This is not a recursive mutex. Trying to acquire the mutex twice from
+ * the same thread without unlocking it results in undefined behavior
+ * - Thread, coroutine or fiber migrations from within a critical section
+ * are disallowed. This is because the implementation requires owning the
+ * stack frame through the execution of the critical section for both
+ * lock/unlock or combined critical sections. This also means that you
+ * cannot allow another thread, fiber or coroutine to unlock the mutex
+ * - This mutex cannot be used in a program compiled with segmented stacks,
+ * there is currently no way to detect the presence of segmented stacks
+ * at compile time or runtime, so we have no checks against this
+ */
+ template <typename Task>
+ auto lock_combine(Task task) -> decltype(std::declval<const Task&>()());
+
+ private:
+ Atomic<std::uintptr_t> state_{0};
+};
+
+} // namespace distributed_mutex
+} // namespace detail
+
+/**
+ * Bring the default instantiation of DistributedMutex into the folly
+ * namespace without requiring any template arguments for public usage
+ */
+extern template class detail::distributed_mutex::DistributedMutex<>;
+using DistributedMutex = detail::distributed_mutex::DistributedMutex<>;
+
+} // namespace folly
+
+#include <folly/synchronization/DistributedMutex-inl.h>
+#include <folly/synchronization/DistributedMutexSpecializations.h>