From 8daa83a594a2e98f39d764422bfbdbc62c9efd44 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Fri, 19 Apr 2024 19:20:00 +0200 Subject: Adding upstream version 2:4.20.0+dfsg. Signed-off-by: Daniel Baumann --- ctdb/doc/readonlyrecords.txt | 343 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 343 insertions(+) create mode 100644 ctdb/doc/readonlyrecords.txt (limited to 'ctdb/doc/readonlyrecords.txt') diff --git a/ctdb/doc/readonlyrecords.txt b/ctdb/doc/readonlyrecords.txt new file mode 100644 index 0000000..e7be1c3 --- /dev/null +++ b/ctdb/doc/readonlyrecords.txt @@ -0,0 +1,343 @@ +Read-Only locks in CTDB +======================= + +Problem +======= +CTDB currently only supports exclusive Read-Write locks for clients(samba) accessing the +TDB databases. +This mostly works well but when very many clients are accessing the same file, +at the same time, this causes the exclusive lock as well as the record itself to +rapidly bounce between nodes and acts as a scalability limitation. + +This primarily affects locking.tdb and brlock.tdb, two databases where record access is +read-mostly and where writes are semi-rare. + +For the common case, if CTDB provided shared non-exclusive Read-Only lock semantics +this would greatly improve scaling for these workloads. + + +Desired properties +================== +We can not make backward incompatible changes the ctdb_ltdb header for the records. + +A Read-Only lock enabled ctdb demon must be able to interoperate with a non-Read-Only +lock enbled daemon. + +Getting a Read-Only lock should not be slower than getting a Read-Write lock. + +When revoking Read-Only locks for a record, this should involve only those nodes that +currently hold a Read-Only lock and should avoid broadcasting opportunistic revocations. +(must track which nodes are delegated to) + +When a Read-Write lock is requested, if there are Read-Only locks delegated to other +nodes, the DMASTER will defer the record migration until all read-only locks are first +revoked (synchronous revoke). + +Due to the cost of revoking Read-Only locks has on getting a Read-Write lock, the +implementation should try to avoid creating Read-Only locks unless it has indication +that there is contention. This may mean that even if client requests a Read-Only lock +we might still provide a full Read-Write lock in order to avoid the cost of revoking +the locks in some cases. + +Read-Only locks require additional state to be stored in a separate database, containing +information about which nodes have have been delegated Read-Only locks. +This database should be kept at minimal size. + +Read-Only locks should not significantly complicate the normal record +create/migration/deletion cycle for normal records. + +Read-Only locks should not complicate the recovery process. + +Read-Only locks should not complicate the vacuuming process. + +We should avoid forking new child processes as far as possible from the main daemon. + +Client-side implementation, samba, libctdb, others, should have minimal impact when +Read-Only locks are implemented. +Client-side implementation must be possible with only minor conditionals added to the +existing lock-check-fetch-unlock loop that clients use today for Read-Write locks. So +that clients only need one single loop that can handle both Read-Write locking as well +as Read-Only locking. Clients should not need two nearly identical loops. + + +Implementation +============== + +Four new flags are allocated in the ctdb_ltdb record header. +HAVE_DELEGATIONS, HAVE_READONLY_LOCK, REVOKING_READONLY and REVOKE_COMPLETE + +HAVE_DELEGATIONS is a flag that can only be set on the node that is currently the +DMASTER for the record. When set, this flag indicates that there are Read-Only locks +delegated to other nodes in the cluster for this record. + +HAVE_READONLY is a flag that is only set on nodes that are NOT the DMASTER for the +record. If set this flag indicates that this record contains an up-to-date Read-Only +version of this record. A client that only needs to read, but not to write, the record +can safely use the content of this record as is regardless of the value of the DMASTER +field of the record. + +REVOKING_READONLY is a flag that is used while a set of read only delegations are being +revoked. +This flag is only set when HAVE_DELEGATIONS is also set, and is cleared at the same time +as HAVE_DELEGATIONS is cleared. +Normal operations is that first the HAVE_DELEGATIONS flag is set when the first +delegation is generated. When the delegations are about to be revoked, the +REVOKING_READONLY flag is set too. +Once all delegations are revoked, both flags are cleared at the same time. +While REVOKING_READONLY is set, any requests for the record, either normal request or +request for readonly will be deferred. +Deferred requests are linked on a list for deferred requests until the time that the +revokation is completed. +This flags is set by the main ctdb daemon when it starts revoking this record. + +REVOKE_COMPLETE +The actual revoke of records is done by a child process, spawned from the main ctdb +daemon when it starts the process to revoke the records. +Once the child process has finished revoking all delegations it will set the flag +REVOKE_COMPLETE for this record to signal to the main daemon that the record has been +successfully revoked. +At this stage the child process will also trigger an event in the main daemon that +revoke is complete and that the main daemon should start re-processing all deferred +requests. + + + +Once the revoke process is completed there will be at least one deferred request to +access this record. That is the initical call to for an exclusive fetch_lock() that +triggered the revoke process to be started. +In addition to this deferred request there may also be additional requests that have +also become deferred while the revoke was in process. These can be either exclusive +fetch_locks() or they can be readonly lock requests. +Once the revoke is completed the main daemon will reprocess all exclusive fetch_lock() +requests immediately and respond to these clients. +Any requests for readadonly lock requests will be deferred for an additional period of +time before they are re-processed. +This is to allow the client that needs a fetch_lock() to update the record to get some +time to access and work on the record without having to compete with the possibly +very many readonly requests. + + + + + +The ctdb_db structure is expanded so that it contains one extra TDB database for each +normal, non-persistent database. +This new database is used for tracking delegations for the records. +A record in the normal database that has "HAVE_DELEGATION" set will always have a +corresponding record at the same key. This record contains the set of all nodes that +the record is delegated to. +This tracking database is lockless, using TDB_NOLOCK, and is only ever accessed by +the main ctdbd daemon. +The lockless nature and the fact that no other process ever access this TDB means we +are guaranteed non-blocking access to records in the tracking database. + +The ctdb_call PDU is allocated with a new flag WANT_READONLY and possibly also a new +callid: CTDB_FETCH_WITH_HEADER_FUNC. +This new function returns not only the record, as CTDB_FETCH_FUNC does, but also +returns the full ctdb_ltdb record HEADER prepended to the record. +This function is optional, clients that do not care what the header is can continue +using just CTDB_FETCH_FUNC + + +This flag is used to requesting a read-only record from the DMASTER/LMASTER. +If the record does not yet exist, this is a returned as an error to the client and the +client will retry the request loop. + +A new control is added to make remote nodes remove the HAVE_READONLY_LOCK from a record +and to invalidate any deferred readonly copies from the databases. + + + +Client implementation +===================== +Clients today use a loop for record fetch lock that looks like this + try_again: + lock record in tdb + + if record does not exist in tdb, + unlock record + ask ctdb to migrate record onto the node + goto try_again + + if record dmaster != this node pnn + unlock record + ask ctdb to migrate record onto the node + goto try_again + + finished: + +where we basically spin, until the record is migrated onto the node and we have managed +to pin it down. + +This will change to instead to something like + + try_again: + lock record in tdb + + if record does not exist in tdb, + unlock record + ask ctdb to migrate record onto the node + goto try_again + + if record dmaster == current node pnn + goto finished + + if read-only lock + if HAVE_READONLY_LOCK or HAVE_DELEGATIONS is set + goto finished + else + unlock record + ask ctdb for read-only copy (WANT_READONLY[|WITH_HEADER]) + if failed to get read-only copy (*A) + ask ctdb to migrate the record onto the node + goto try_again + lock record in tdb + goto finished + + unlock record + ask ctdb to migrate record onto the node + goto try_again + + finished: + +If the record does not yet exist in the local TDB, we always perform a full fetch for a +Read-Write lock even if only a Read-Only lock was requested. +This means that for first access we always grab a Read-Write lock and thus upgrade any +requests for Read-Only locks into a Read-Write request. +This creates the record, migrates it onto the node and makes the local node become +the DMASTER for the record. + +Future reference to this same record by the local samba daemons will still access/lock +the record locally without triggereing a Read-Only delegation to be created since the +record is already hosted on the local node as DMASTER. + +Only if the record is contended, i.e. it has been created an migrated onto the node but +we are no longer the DMASTER for this record, only for this case will we create a +Read-Only delegation. +This heuristics provide a mechanism where we will not create Read-Only delegations until +we have some indication that the record may be contended. + +This avoids creating and revoking Read-Only delegations when only a single client is +repeatedly accessing the same set of records. +This also aims to limit the size of the tracking tdb. + + +Server implementation +===================== +When receiving a ctdb_call with the WANT_READONLY flag: + +If this is the LMASTER for the record and the record does not yet exist, LMASTER will +return an error back to the client (*A above) and the client will try to recover. +In particular, LMASTER will not create a new record for this case. + +If this is the LMASTER for the record and the record exists, the PDU will be forwarded to +the DMASTER for the record. + +If this node is not the DMASTER for this record, we forward the PDU back to the +LMASTER. Just as we always do today. + +If this is the DMASTER for the record, we need to create a Read-Only delegation. +This is done by + lock record + increase the RSN by one for this record + set the HAVE_DELEGATIONS flag for the record + write the updated record to the TDB + create/update the tracking TDB nd add this new node to the set of delegations + send a modified copy of the record back to the requesting client. + modifications are that RSN is decremented by one, so delegated records are "older" than on the DMASTER, + it has HAVE_DELEGATIONS flag stripped off, and has HAVE_READONLY_LOCK added. + unlock record + +Important to note is that this does not trigger a record migration. + + +When receiving a ctdb_call without the WANT_READONLY flag: + +If this is the DMASTER for the this might trigger a migration. If there exists +delegations we must first revoke these before allowing the Read-Write request from +proceeding. So, +IF the record has HAVE_DELEGATIONS set, we create a child process and defer processing +of this PDU until the child process has completed. + +From the child process we will call out to all nodes that have delegations for this +record and tell them to invalidate this record by clearing the HAVE_READONLY_LOCK from +the record. +Once all delegated nodes respond back, the child process signals back to the main daemon +the revoke has completed. (child process may not access the tracking tdb since it is +lockless) + +Main process is triggered to re-process the PDU once the child process has finished. +Main daemon deletes the corresponding record in the tracking database, clears the +HAVE_DELEGATIONS flag for the record and then proceeds to perform the migration as usual. + +When receiving a ctdb_call without the flag we want all delegations to be revoked, +so we must take care that the delegations are revoked unconditionally before we even +check if we are already the DMASTER (in which case the ctdb_call would normally just +be no-op (*B below)) + + + +Recovery process changes +======================== +A recovery implicitly clears/revokes any read only records and delegations from all +databases. + +During delegations of Read-Only locks, this is done in such way that delegated records +will have a RSN smaller than the DMASTER. This guarantees that read-only copies always +have a RSN that is smaller than the DMASTER. + +During recoveries we do not need to take any special action other than always picking +the copy of the record that has the highest RSN, which is what we already do today. + +During the recovery process, we strip all flags off all records while writing the new +content of the database during the PUSH_DB control. + +During processing of the PUSH_DB control and once the new database has been written we +then also wipe the tracking database. + +This makes changes to the recovery process minimal and nonintrusive. + + + +Vacuuming process +================= +Vacuuming needs only minimal changes. + + +When vacuuming runs, it will do a fetch_lock to migrate any remote records back onto the +LMASTER before the record can be purged. This will automatically force all delegations +for that record to be revoked before the migration is copied back onto the LMASTER. +This handles the case where LMASTER is not the DMASTER for the record that will be +purged. +The migration in this case does force any delegations to be revoked before the +vacuuming takes place. + +Missing is the case when delegations exist and the LMASTER is also the DMASTER. +For this case we need to change the vacuuming to unconditionally always try to do a +fetch_lock when HAVE_DELEGATIONS is set, even if the record is already stored locally. +(*B) +This fetch lock will not cause any migrations by the ctdb daemon, but since it does +not have the WANT_READONLY this will still force the delegations to be revoked but no +migration will trigger. + + +Traversal process +================= +Traversal process is changed to ignore any records with the HAVE_READONLY_LOCK + + +Forward/Backward Compatibility +============================== +Non-readonly locking daemons must be able to interoperate with readonly locking enabled daemons. + +Non-readonly enabled daemons fetching records from Readonly enabled daemons: +Non-readonly enabled daemons do not know, and never set the WANT_READONLY flag so these daemons will always request a full migration for a full fetch-lock for all records. Thus a request from a non-readonly enabled daemon will always cause any existing delegations to be immediately revoked. Access will work but performance may be harmed since there will be a lot of revoking of delegations. + +Readonly enabled daemons fetching records with WANT_READONLY from non-readonly enabled daemons: +Non-readonly enabled daemons ignore the WANT_READONLY flag and never return delegations. They always return a full record migration. +Full record migration is allowed by the protocol, even if the originator only requests the 'hint' WANT_READONLY, +so this access also interoperates between daemons with different capabilities. + + + + -- cgit v1.2.3