diff options
Diffstat (limited to 'src/doc')
-rw-r--r-- | src/doc/Commitdir.txt | 24 | ||||
-rw-r--r-- | src/doc/caching.txt | 313 | ||||
-rw-r--r-- | src/doc/dynamic-throttle.txt | 127 | ||||
-rw-r--r-- | src/doc/header.txt | 13 | ||||
-rw-r--r-- | src/doc/inos.txt | 11 | ||||
-rw-r--r-- | src/doc/killpoints.txt | 42 | ||||
-rw-r--r-- | src/doc/lazy_posix.txt | 53 | ||||
-rw-r--r-- | src/doc/mds_locks.txt | 66 | ||||
-rw-r--r-- | src/doc/modeline.txt | 2 | ||||
-rw-r--r-- | src/doc/mon-janitorial-queue.txt | 38 | ||||
-rw-r--r-- | src/doc/mon-wishlist.txt | 51 | ||||
-rw-r--r-- | src/doc/rgw.txt | 28 | ||||
-rw-r--r-- | src/doc/rgw/multisite-reshard.md | 103 |
13 files changed, 871 insertions, 0 deletions
diff --git a/src/doc/Commitdir.txt b/src/doc/Commitdir.txt new file mode 100644 index 000000000..05c727be6 --- /dev/null +++ b/src/doc/Commitdir.txt @@ -0,0 +1,24 @@ +OLD + + +How Directory Committing Works: + +Each CDir has: + version - current version of directory + committing_version - which version was sent to stable storage + last_committed_version - last version to be safely stored + +Each Inode has: + parent_dir_version - what dir version i was in when i was dirtied. (*) + + (*) note that if you change an inode, mark_dirty() again, even if it's already dirty! + + +How committing works: + +A call to commit_dir(dir, context) will ensure tha the _current_ version is stored safely on disk before the context is finished. + +When a commit completes, inodes in the directory are checked. If they are dirty and belonged to the _committed_ (or earlier) version, then they are marked clean. If they belong to a newer version, then they are _still dirty_. + + + diff --git a/src/doc/caching.txt b/src/doc/caching.txt new file mode 100644 index 000000000..f623824cb --- /dev/null +++ b/src/doc/caching.txt @@ -0,0 +1,313 @@ + +SPANNING TREE PROPERTY + +All metadata that exists in the cache is attached directly or +indirectly to the root inode. That is, if the /usr/bin/vi inode is in +the cache, then /usr/bin, /usr, and / are too, including the inodes, +directory objects, and dentries. + + +AUTHORITY + +The authority maintains a list of what nodes cache each inode. +Additionally, each replica is assigned a nonce (initial 0) to +disambiguate multiple replicas of the same item (see below). + + map<int, int> replicas; // maps replicating mds# to nonce + +The cached_by set _always_ includes all nodes that cache a +particular object, but may additionally include nodes that used to +cache it but no longer do. In those cases, an expire message should +be in transit. That is, we have two invariants: + + 1) the authority's replica set will always include all actual + replicas, and + + 2) cache expiration notices will be reliably delivered to the + authority. + +The second invariant is particularly important because the presence of +replicas will pin the metadata object in memory on the authority, +preventing it from being trimmed from the cache. Notification of +expiration of the replicas is required to allow previously replicated +objects from eventually being trimmed from the cache as well. + +Each metdata object has a authority bit that indicates whether it is +authoritative or a replica. + + +REPLICA NONCE + +Each replicated object maintains a "nonce" value, issued by the +authority at the time the replica was created. If the authority has +already created a replica for the given MDS, the new replica will be +issues a new (incremented) nonce. This nonce is attached +to cache expirations, and allows the authority to disambiguate +expirations when multiple replicas of the same object are created and +cache expiration is coincident with replication. That is, when an +old replica is expired from the replicating MDS at the same time that +a new replica is issued by the authority and the resulting messages +cross paths, the authority can tell that it was the old replica that +was expired and effectively ignore the expiration message. The +replica is removed from the replicas map only if the nonce matches. + + +SUBTREE PARTITION + +Authority of the file system namespace is partitioned using a +subtree-based partitioning strategy. This strategy effectively +separates directory inodes from directory contents, such that the +directory contents are the unit of redelegation. That is, if / is +assigned to mds0 and /usr to mds1, the inode for /usr will be managed +by mds0 (it is part of the / directory), while the contents of /usr +(and everything nested beneath it) will be managed by mds1. + +The description for this partition exists solely in the collective +memory of the MDS cluster and in the individual MDS journals. It is +not described in the regular on-disk metadata structures. This is +related to the fact that authority delegation is a property of the +{\it directory} and not the directory's {\it inode}. + +Subsequently, if an MDS is authoritative for a directory inode and does +not yet have any state associated with the directory in its cache, +then it can assume that it is also authoritative for the directory. + +Directory state consists of a data object that describes any cached +dentries contained in the directory, information about the +relationship between the cached contents and what appears on disk, and +any delegation of authority. That is, each CDir object has a dir_auth +element. Normally dir_auth has a value of AUTH_PARENT, meaning that +the authority for the directory is the same as the directory's inode. +When dir_auth specifies another metadata server, that directory is +point of authority delegation and becomes a {\it subtree root}. A +CDir is a subtree root iff its dir_auth specifies an MDS id (and is not +AUTH_PARENT). + + - A dir is a subtree root iff dir_auth != AUTH_PARENT. + + - If dir_auth = AUTH_PARENT then the inode auth == dir auth, but the + converse may not be true. + +The authority for any metadata object in the cache can be determined +by following the parent pointers toward the root until a subtree root +CDir object is reached, at which point the authority is specified by +its dir_auth. + +Each MDS cache maintains a subtree data structure that describes the +subtree partition for all objects currently in the cache: + + map< CDir*, set<CDir*> > subtrees; + + - A dir will appear in the subtree map (as a key) IFF it is a subtree + root. + +Each subtree root will have an entry in the map. The map value is a +set of all other subtree roots nested beneath that point. Nested +subtree roots effectively bound or prune a subtree. For example, if +we had the following partition: + + mds0 / + mds1 /usr + mds0 /usr/local + mds0 /home + +The subtree map on mds0 would be + + / -> (/usr, /home) + /usr/local -> () + /home -> () + +and on mds1: + + /usr -> (/usr/local) + + +AMBIGUOUS DIR_AUTH + +While metadata for a subtree is being migrated between two MDS nodes, +the dir_auth for the subtree root is allowed to be ambiguous. That +is, it will specify both the old and new MDS ids, indicating that a +migration is in progress. + +If a replicated metadata object is expired from the cache from a +subtree whose authority is ambiguous, the cache expiration is sent to +both potential authorities. This ensures that the message will be +reliably delivered, even if either of those nodes fails. A number of +alternative strategies were considered. Sending the expiration to the +old or new authority and having it forwarded if authority has been +delegated can result in message loss if the forwarding node fails. +Pinning ambiguous metadata in cache is computationally expensive for +implementation reasons, and while delaying the transmission of expiration +messages is difficult to implement because the replicating must send +the final expiration messages when the subtree authority is +disambiguated, forcing it to keep certain elements of it cache in +memory. Although duplicated expirations incurs a small communications +overhead, the implementation is much simpler. + + +AUTH PINS + +Most operations that modify metadata must allow some amount of time to +pass in order for the operation to be journaled or for communication +to take place between the object's authority and any replicas. For +this reason it must not only be pinned in the authority's metadata +cache, but also be locked such that the object's authority is not +allowed to change until the operation completes. This is accomplished +using {\it auth pins}, which increment a reference counter on the +object in question, as well as all parent metadata objects up to the +root of the subtree. As long as the pin is in place, it is impossible +for that subtree (or any fragment of it that contains one or more +pins) to be migrated to a different MDS node. Pins can be placed on +both inodes and directories. + +Auth pins can only exist for authoritative metadata, because they are +only created if the object is authoritative, and their presence +prevents the migration of authority. + + +FREEZING + +More specifically, auth pins prevent a subtree from being frozen. +When a subtree is frozen, all updates to metadata are forbidden. This +includes updates to the replicas map that describes which replicas +(and nonces) exist for each object. + +In order for metadata to be migrated between MDS nodes, it must first +be frozen. The root of the subtree is initially marked as {\it +freezing}. This prevents the creation of any new auth pins within the +subtree. After all existing auth pins are removed, the subtree is +then marked as {\it frozen}, at which point all updates are +forbidden. This allows metadata state to be packaged up in a message +and transmitted to the new authority, without worrying about +intervening updates. + +If the directory at the base of a freezing or frozen subtree is not +also a subtree root (that is, it has dir_auth == AUTH_PARENT), the +directory's parent inode is auth pinned. + + - a frozen tree root dir will auth_pin its inode IFF it is auth AND + not a subtree root. + +This prevents a parent directory from being concurrently frozen, and a +range of resulting implementation complications relating metadata +migration. + + +CACHE EXPIRATION FOR EXPORTING SUBTREES + +Cache expiration messages that are received for a subtree that is +being exported are either deferred or handled immediately, based on +the sender and receiver states. The importing MDS will always defer until +after the export finishes, because the import could fail. The exporting MDS +processes the expire UNLESS the expiring MDS does not know about the export or +the exporting MDS is no longer auth. +Because MDSes get witness notifications on export, this is safe. Either: +a) The expiring MDS knows about the export, and has sent messages to both +MDSes involved, or +b) The expiring MDS did not know about the export at the time the message +was sent, and so only sent it to the exporting MDS. (This implies that the +exporting MDS hasn't yet encoded the state to send to the replica MDS.) + +When the subtree export completes, deferred expirations are either processed +(if the MDS is authoritative) or discarded (if it is not). Because either +the exporting or importing metadata can fail during the migration +process, the MDS cannot tell whether it will be authoritative or not +until the process completes. + +During a migration, the subtree will first be frozen on both the +exporter and importer, and then all other replicas will be informed of +a subtrees ambiguous authority. This ensures that all expirations +during migration will go to both parties, and nothing will be lost in +the event of a failure. + + + +NORMAL MIGRATION + +The exporter begins by doing some checks in export_dir() to verify +that it is permissible to export the subtree at this time. In +particular, the cluster must not be degraded, the subtree root may not +be freezing or frozen, and the path must be pinned (\ie not conflicted +with a rename). If these conditions are met, the subtree root +directory is temporarily auth pinned, the subtree freeze is initiated, +and the exporter is committed to the subtree migration, barring an +intervening failure of the importer or itself. + +The MExportDiscover serves simply to ensure that the inode for the +base directory being exported is open on the destination node. It is +pinned by the importer to prevent it from being trimmed. This occurs +before the exporter completes the freeze of the subtree to ensure that +the importer is able to replicate the necessary metadata. When the +exporter receives the MDiscoverAck, it allows the freeze to proceed by +removing its temporary auth pin. + +The MExportPrep message then follows to populate the importer with a +spanning tree that includes all dirs, inodes, and dentries necessary +to reach any nested subtrees within the exported region. This +replicates metadata as well, but it is pushed out by the exporter, +avoiding deadlock with the regular discover and replication process. +The importer is responsible for opening the bounding directories from +any third parties authoritative for those subtrees before +acknowledging. This ensures that the importer has correct dir_auth +information about where authority is redelegated for all points nested +beneath the subtree being migrated. While processing the MExportPrep, +the importer freezes the entire subtree region to prevent any new +replication or cache expiration. + +A warning stage occurs only if the base subtree directory is open by +nodes other than the importer and exporter. If it is not, then this +implies that no metadata within or nested beneath the subtree is +replicated by any node other than the importer an exporter. If it is, +then a MExportWarning message informs any bystanders that the +authority for the region is temporarily ambiguous, and lists both the +exporter and importer as authoritative MDS nodes. In particular, +bystanders who are trimming items from their cache must send +MCacheExpire messages to both the old and new authorities. This is +necessary to ensure that the surviving authority reliably receives all +expirations even if the importer or exporter fails. While the subtree +is frozen (on both the importer and exporter), expirations will not be +immediately processed; instead, they will be queued until the region +is unfrozen and it can be determined that the node is or is not +authoritative. + +The exporter walks the subtree hierarchy and packages up an MExport +message containing all metadata and important state (\eg, information +about metadata replicas). At the same time, the expoter's metadata +objects are flagged as non-authoritative. The MExport message sends +the actual subtree metadata to the importer. Upon receipt, the +importer inserts the data into its cache, marks all objects as +authoritative, and logs a copy of all metadata in an EImportStart +journal message. Once that has safely flushed, it replies with an +MExportAck. The exporter can now log an EExport journal entry, which +ultimately specifies that the export was a success. In the presence +of failures, it is the existence of the EExport entry only that +disambiguates authority during recovery. + +Once logged, the exporter will send an MExportNotify to any +bystanders, informing them that the authority is no longer ambiguous +and cache expirations should be sent only to the new authority (the +importer). Once these are acknowledged back to the exporter, +implicitly flushing the bystander to exporter message streams of any +stray expiration notices, the exporter unfreezes the subtree, cleans +up its migration-related state, and sends a final MExportFinish to the +importer. Upon receipt, the importer logs an EImportFinish(true) +(noting locally that the export was indeed a success), unfreezes its +subtree, processes any queued cache expierations, and cleans up its +state. + + +PARTIAL FAILURE RECOVERY + + + + +RECOVERY FROM JOURNAL + + + + + + + + + diff --git a/src/doc/dynamic-throttle.txt b/src/doc/dynamic-throttle.txt new file mode 100644 index 000000000..39ce34506 --- /dev/null +++ b/src/doc/dynamic-throttle.txt @@ -0,0 +1,127 @@ +------ +TOPIC: +------ +Dynamic backoff throttle is been introduced in the Jewel timeframe to produce +a stable and improved performance out from filestore. This should also improve +the average and 99th latency significantly. + +----------- +WHAT IS IT? +----------- +The old throttle scheme in the filestore is to allow all the ios till the +outstanding io/bytes reached some limit. Once crossed the threshold, it will not +allow any more io to go through till the outstanding ios/bytes goes below the +threshold. That's why once it is crossed the threshold, the write behavior +becomes very spiky. +The idea of the new throttle is, based on some user configurable parameters +it can start throttling (inducing delays) early and with proper parameter +value we can prevent the outstanding io/bytes to reach to the threshold mark. +This dynamic backoff throttle is implemented in the following scenarios within +filestore. + + 1. Filestore op worker queue holds the transactions those are yet to be applied. + This queue needs to be bounded and a backoff throttle is used to gradually + induce delays to the queueing threads if the ratio of current outstanding + bytes|ops and filestore_queue_max_(bytes|ops) is more than + filestore_queue_low_threshhold. The throttles will block io once the + outstanding bytes|ops reaches max value (filestore_queue_max_(bytes|ops)). + + User configurable config option for adjusting delay is the following. + + filestore_queue_low_threshhold: + Valid values should be between 0-1 and shouldn't be > *_high_threshold. + + filestore_queue_high_threshhold: + Valid values should be between 0-1 and shouldn't be < *_low_threshold. + + filestore_expected_throughput_ops: + Shouldn't be less than or equal to 0. + + filestore_expected_throughput_bytes: + Shouldn't be less than or equal to 0. + + filestore_queue_high_delay_multiple: + Shouldn't be less than 0 and shouldn't be > *_max_delay_multiple. + + filestore_queue_max_delay_multiple: + Shouldn't be less than 0 and shouldn't be < *_high_delay_multiple + + 2. journal usage throttle is implemented to gradually slow down queue_transactions + callers as the journal fills up. We don't want journal to become full as this + will again induce spiky behavior. The configs work very similarly to the + Filestore op worker queue throttle. + + journal_throttle_low_threshhold + journal_throttle_high_threshhold + filestore_expected_throughput_ops + filestore_expected_throughput_bytes + journal_throttle_high_multiple + journal_throttle_max_multiple + +This scheme will not be inducing any delay between [0, low_threshold]. +In [low_threshold, high_threshold), delays should be injected based on a line +from 0 at low_threshold to high_multiple * (1/expected_throughput) at high_threshold. +In [high_threshold, 1), we want delays injected based on a line from +(high_multiple * (1/expected_throughput)) at high_threshold to +(high_multiple * (1/expected_throughput)) + (max_multiple * (1/expected_throughput)) +at 1. +Now setting *_high_multiple and *_max_multiple to 0 should give us similar effect +like old throttle (this is default). Setting filestore_queue_max_ops and +filestore_queue_max_bytes to zero should disable the entire backoff throttle. + +------------------------- +HOW TO PICK PROPER VALUE ? +------------------------- + +General guideline is the following. + + filestore_queue_max_bytes and filestore_queue_max_ops: + ----------------------------------------------------- + + This directly depends on how filestore op_wq will be growing in the memory. So, + user needs to calculate how much memory he can give to each OSD. Bigger value + meaning current/max ratio will be smaller and throttle will be relaxed a bit. + This could help in case of 100% write but in case of mixed read/write this can + induce more latencies for reads on the objects those are not yet been applied + i.e yet to reach in it's final filesystem location. Ideally, once we reach the + max throughput of the backend, increasing further will only increase memory + usage and latency. + + filestore_expected_throughput_bytes and filestore_expected_throughput_ops: + ------------------------------------------------------------------------- + + This directly co-relates to how much delay needs to be induced per throttle. + Ideally, this should be a bit above (or same) your backend device throughput. + + *_low_threshhold *_high_threshhold : + ---------------------------------- + + If your backend is fast you may want to set this threshold value higher so that + initial throttle would be less and it can give a chance to back end transaction + to catch up. + + *_high_multiple and *_max_multiple: + ---------------------------------- + + This is the delay tunables and one should do all effort here so that the current + doesn't reach to the max. For HDD and smaller block size, the old throttle may + make sense, in that case we should set these delay multiple to 0. + +It is recommended that user should use ceph_smalliobenchfs tool to see the effect +of these parameters (and decide) on your HW before deploying OSDs. + +Need to supply the following parameters. + + 1. --journal-path + + 2. --filestore-path + (Just mount the empty data partition and give the path here) + + 3. --op-dump-file + +'ceph_smalliobenchfs --help' will show all the options that user can tweak. +In case user wants to supply any ceph.conf parameter related to filestore, +it can be done by adding '--' in front, for ex. --debug_filestore + +Once test is done, analyze the throughput and latency information dumped in the +file provided in the option --op-dump-file. diff --git a/src/doc/header.txt b/src/doc/header.txt new file mode 100644 index 000000000..bccdb8153 --- /dev/null +++ b/src/doc/header.txt @@ -0,0 +1,13 @@ +// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*- +// vim: ts=8 sw=2 smarttab +/* + * Ceph - scalable distributed file system + * + * Copyright (C) 2004-2006 Sage Weil <sage@newdream.net> + * + * This is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License version 2.1, as published by the Free Software + * Foundation. See file COPYING. + * + */ diff --git a/src/doc/inos.txt b/src/doc/inos.txt new file mode 100644 index 000000000..b5ab1db25 --- /dev/null +++ b/src/doc/inos.txt @@ -0,0 +1,11 @@ + +inodeno_t namespace + - relevant both for ino's, and for the (ino) input for Filer and object storage namespace... + +1 - root inode + +100+mds - mds log/journal +200+mds - mds ino, fh allocation tables +300+mds - mds inode files (for non-embedded inodes) + +1000+ - regular files and directories
\ No newline at end of file diff --git a/src/doc/killpoints.txt b/src/doc/killpoints.txt new file mode 100644 index 000000000..0813386bd --- /dev/null +++ b/src/doc/killpoints.txt @@ -0,0 +1,42 @@ +Kill arguments in code: +mds_kill_mdstable_at +mds_kill_export_at +mds_kill_import_at + +mds_kill_mdstable_at: +1: (Server) After receiving MMDStableRequest +2: (Server) After logging request, before sending reply to client +3: (Client) After receiving Agree from Server +4: (Client) Prior to sending Commit to server, but after making local change +5: (Server) After receiving commit message from client (before doing so) +6: (Server) After logging commit, before sending Ack +7: (Client) After receiving commit Ack from server +8: (Client) After logging Ack from server + +mds_kill_export_at: +1: After moving to STATE_EXPORTING +2: After sending MExportDirDiscover +3: After receiving MExportDirDiscoverAck and auth_unpin'ing. +4: After sending MExportDirPrep +5: After receiving MExportDirPrepAck +6: After sending out MExportDirNotify to all replicas +7: After switching to state EXPORT_EXPORTING + (all replicas have acked ExportDirNotify) +8: After sending MExportDir to recipient +9: After receipt of MExportAck (new state: EXPORT_LOGGINGFINISH) +10: After logging EExport to journal +11: After sending out MExportDirNotify (new state: EXPORT_NOTIFYING) +12: After receiving MExportDirNotifyAck from all bystanders +13: After sending MExportDirFinish to importer + +mds_kill_import_at: +1: After moving to IMPORT_DISCOVERING +2: After moving to IMPORT_DISCOVERED and sending MExportDirDiscoverAck +3: After moving to IMPORT_PREPPING. +4: After moving to IMPORT_PREPPED and sending MExportDirPrepAck +5: After receiving MExportDir message +6: After moving to IMPORT_LOGGINGSTART and writing EImportStart +7: After moving to IMPORT_ACKING. +8: After sending out MExportDirAck +9: After logging EImportFinish +10: After entering IMPORT_ABORTING.
\ No newline at end of file diff --git a/src/doc/lazy_posix.txt b/src/doc/lazy_posix.txt new file mode 100644 index 000000000..a7bc34e30 --- /dev/null +++ b/src/doc/lazy_posix.txt @@ -0,0 +1,53 @@ + +http://www.usenix.org/events/fast05/wips/slides/welch.pdf + + + +-- STATLITE + statlite(const char *filename, struct statlite *buf); + fstatlite(int fd, struct statlite *buf); + lstatlite(const char *filename, struct statlite *buf); + + * file size, mtime are optionally not guaranteed to be correct + * mask field to specify which fields you need to be correct + + +-- READDIR+ + + struct dirent_plus *readdirplus(DIR *dirp); + int readdirplus_r(DIR *dirp, struct dirent_plus *entry, struct dirent_plus **result); + struct dirent_lite *readdirlite(DIR *dirp); + int readdirlite_r(DIR *dirp, struct dirent_lite *entry, struct dirent_lite **result); + + * plus returns lstat + * lite returns lstatlite + + +-- lazy i/o integrity + + FIXME: currently missing call to flag an Fd/file has lazy. used to be O_LAZY on open, but no more. + + * relax data coherency + * writes may not be visible until lazyio_propagate, fsync, close + + lazyio_propagate(int fd, off_t offset, size_t count); + * my writes are safe + + lazyio_synchronize(int fd, off_t offset, size_t count); + * i will see everyone else's propagated writes + +-- read/write non-serial vectors + + ssize_t readx(int fd, const struct iovec *iov, size_t iov_count, struct xtvec *xtv, size_t xtv_count); + ssize_t writex(int fd, const struct iovec *iov, size_t iov_count, struct xtvec *xtv, size_t xtv_count); + + * like readv/writev, but serial + * + + +int lockg(int fd, int cmd, lgid_t *lgid) + group locks + +int openg(char *path, int mode, fh_t *handle); + portable file handle +int sutoc(fh_t *fh); diff --git a/src/doc/mds_locks.txt b/src/doc/mds_locks.txt new file mode 100644 index 000000000..d89cc22af --- /dev/null +++ b/src/doc/mds_locks.txt @@ -0,0 +1,66 @@ + +new names + dentry_read (not path_pins) + dentry_xlock + + inode_read + inode_xlock (not inode_write) + +locks are always tied to active_requests. + +read locks can be placed on any node. +xlocks must be applied at the authority. + +for multi-lock operations (link, unlink, rename), we must acquire xlocks on a remote node. lock requests are associated with a reqid. the authoritative node keeps track of which remote xlocks it holds. when forwarded/restarted, it can drop remote locks. + +when restarting, drop all locks. +on remote, drop locks and state, and notify main req node. +recover dist request state on rejoin: + - surviving op initiator will assert read or xlock + - recovering op initiator will restart requests. (from initiator's perspective, ops have either happened or they haven't, depending on whether the event is journaled.) + - recovering or surviving op cohort will determine lock state during rejoin, or get a commit or rollback... + - + + +--- path_pin = read lock on /some/random/path + - blocks a dentry xlock + +--- dnxlock = exclusive lock on /some/random/path + - locking: prevents subsequent path pins. + - locked: prevents dn read + - on auth + +-> grab _all_ path pins at once; hold none while waiting. +-> grab xlocks in order. + +--- auth_pin = pin to authority, on *dir, *in + - prevents freezing -> frozen. + - freezing blocks new auth pins, thus blocking other local auth_pins. (hangs up local export.) + - does not block remote auth_pins, because remote side is not auth (or frozen!) until after local subtree is frozen. + +-> blocking on auth_pins is dangerous. _never_ block if we are holding other auth_pins on the same node (subtree?). +-> grab _all_ auth pins at once; hold none while waiting. + +--- hard/file_wrlock = exclusive lock on inode content + - prevents inode read + - on auth + +-> grab locks in order. + + +ORDERING +- namespace(dentries) < inodes +- order dentries on (dirino, dname) +- order inodes on (ino); +- need to order both read and write locks, esp with dentries. so, if we need to lock /usr/bin/foo with read on usr and bin and xwrite on foo, we need to acquire all of those locks using the same ordering. + - on same host, we can be 'nice' and check lockability of all items, then lock all, and drop everything while waiting. (actually, is there any use to this?) + - on multiple hosts, we need to use full ordering (at least as things separate across host boundaries). and if needed lock set changes (such that the order of already acquired locks changes), we need to drop those locks and start over. + +- how do auth pins fit into all this? + - auth pin on xlocks only. no need on read locks. + - pre-grab all auth pins on a node the first time it is visiting during lock acquisition. + - what if things move? if we find we are missing a needed auth pin when we revisit a host at any point, and the item is not still authpinnable, we back off and restart. (we cannot block.) + - + - if we find we are not authpinnable, drop all locks and wait. + + diff --git a/src/doc/modeline.txt b/src/doc/modeline.txt new file mode 100644 index 000000000..1b3956f4d --- /dev/null +++ b/src/doc/modeline.txt @@ -0,0 +1,2 @@ +// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*- +// vim: ts=8 sw=2 smarttab diff --git a/src/doc/mon-janitorial-queue.txt b/src/doc/mon-janitorial-queue.txt new file mode 100644 index 000000000..9114acbe7 --- /dev/null +++ b/src/doc/mon-janitorial-queue.txt @@ -0,0 +1,38 @@ +Items to work on the monitor: + +Low-hanging fruit: + +- audit helpers that put() messages but do not get() them. + where possible, get rid of those put(). No one expects helpers to + put() messages and that may lead to double frees. + +Time consuming / complex: + +- Split the OSDMonitor.cc file into auxiliary files. This will mean: + + 1. Logically split subsystems (osd crush, osd pool, ...) + 2. Split the big badass functions, especially prepare/process_command() + +- Have Tracked Ops on the monitor, similarly to the OSDs. + + 1. Instead of passing messages back and forth, we will pass OpRequests + 2. We may be able to get() the message when we create the OpRequest and + put() it upon OpRequest destruction. This will help controlling the + lifespan of messages and reduce leaks. + 3. There will be a fair amount of work changing stuff from Messages to + OpRequests, and we will need to make sure that we reach a format that + is easily supported throughout the monitor + + Possible format, off the top of my head: + + MonOpRequest: + + int op = m->get_type(); + Message *m = m.get(); + + template<typename T> + T* get_message() { return (T*)m.get(); } + +- Move to Ref'erenced messages instead of pointers all around. This would + also help with the Tracked Ops thing, as we'd be able to simply ignore all + the get() and put() stuff behind it. diff --git a/src/doc/mon-wishlist.txt b/src/doc/mon-wishlist.txt new file mode 100644 index 000000000..a5fb9422c --- /dev/null +++ b/src/doc/mon-wishlist.txt @@ -0,0 +1,51 @@ +Monitor Wish List (issue #10509) +================================ + +Low-hanging fruit +----------------- + +* audit helpers that put() messages but do not get() them. + where possible, get rid of those put(). No one expects helpers to + put() messages and that may lead to double frees. (issue #9378) + +Time consuming / complex +------------------------ + +* Split the OSDMonitor.cc file into auxiliary files. This will mean: + + 1. Logically split subsystems (osd crush, osd pool, ...) + 2. Split the big badass functions, especially prepare/process_command() + +* Have Tracked Ops on the monitor, similarly to the OSDs. (issue #10507) + + 1. Instead of passing messages back and forth, we will pass OpRequests + 2. We may be able to get() the message when we create the OpRequest and + put() it upon OpRequest destruction. This will help controlling the + lifespan of messages and reduce leaks. + 3. There will be a fair amount of work changing stuff from Messages to + OpRequests, and we will need to make sure that we reach a format that + is easily supported throughout the monitor + + Possible format, off the top of my head: + + MonOpRequest: + + int op = m->get_type(); + Message *m = m.get(); + + template<typename T> + T* get_message() { return (T*)m.get(); } + +* Move to Ref'erenced messages instead of pointers all around. This would + also help with the Tracked Ops thing, as we'd be able to simply ignore all + the get() and put() stuff behind it. (issue #3500) + +Delicate / complex +------------------ + +* Finer-grained Paxos::is_readable() and/or PaxosService::is_readable() + (issue #10508) + + Rationale: a given service S should be able to read its committed state + even though a Paxos proposal is happening, as long as the on-going + proposal is not a value of service S. diff --git a/src/doc/rgw.txt b/src/doc/rgw.txt new file mode 100644 index 000000000..3b55ade1d --- /dev/null +++ b/src/doc/rgw.txt @@ -0,0 +1,28 @@ +rgw_main: contains the web server interface and checks user access keys. +rgw_user: defines the RGWUserBuckets class and contains global functions +to get/store user info, get the anon user, and get UIDs from email. +rgw_user: defines the RGWUID class with some basic bookkeeping operations +rgw_common: houses data types and functions +rgw_access: abstract class providing interface for storage mechanisms +rgw_acl.h: Many different classes, some decoding XML, some encoding XML, some doing checks for owner and permissions. +rgw_fs: rgw_access based on the local fs. +rgw_rados: rgw_access based on an actual RADOS cluster. +rgw_admin: Administer the cluster -- create users, look at the state, etc. +rgw_op: Define the different operations as objects for easy tracking. +rgw_REST: extend the classes in rgw_op for a REST interface + +user IDs are strings, as with S3. + +buckets: +ui_email_bucket: hold objects named by email and containing encoded RGWUIDs +ui_bucket: holds objects named by user_id and containing encoded RGWUserInfos +root_bucket: holds objects corresponding to the other buckets, with ACLs in their attrs. + +Observed schema: +buckets: +.rgw -- contains: .users -- empty + .users.email -- empty + johnny1 -- bucket name -- empty +.users -- contains: anonymous -- empty + bucket for each user id -- contains binary, key, binary, secret key, binary, user name, binary, user email +.users.email -- contains bucket for each user email -- contains binary, then user id diff --git a/src/doc/rgw/multisite-reshard.md b/src/doc/rgw/multisite-reshard.md new file mode 100644 index 000000000..32715290e --- /dev/null +++ b/src/doc/rgw/multisite-reshard.md @@ -0,0 +1,103 @@ +# Dynamic Resharding for Multisite + +## Requirements + +* Each zone manages bucket resharding decisions independently + - With per-bucket replication policies, some zones may only replicate a subset of objects, so require fewer shards. + - Avoids having to coordinate reshards across zones. +* Resharding a bucket does not require a full sync of its objects + - Existing bilogs must be preserved and processed before new bilog shards. +* Backward compatibility + - No zone can reshard until all peer zones upgrade to a supported release. + - Requires a manual zonegroup change to enable resharding. + +## Layout + +A layout describes a set of rados objects, along with some strategy to distribute things across them. A bucket index layout distributes object names across some number of shards via `ceph_str_hash_linux()`. Resharding a bucket enacts a transition from one such layout to another. Each layout could represent data differently. For example, a bucket index layout would be used with cls_rgw to write/delete keys. Whereas a datalog layout may be used with cls_log to append and trim log entries, then later transition to a layout based on some other primitive like cls_queue or cls_fifo. + +## Bucket Index Resharding + +To reshard a bucket, we currently create a new bucket instance with the desired sharding layout, and switch to that instance when resharding completes. In multisite, though, the metadata master zone is authoritative for all bucket metadata, including the sharding layout and reshard status. Any changes to metadata must take place on the metadata master zone and replicate from there to other zones. + +If we want to allow each zone to manage its bucket sharding independently, we can't allow them each to create a new bucket instance, because data sync relies on the consistency of instance ids between zones. We also can't allow metadata sync to overwrite our local sharding information with the metadata master's copy. + +That means that the bucket's sharding information needs to be kept private to the local zone's bucket instance, and that information also needs to track all reshard status that's currently spread between the old and new bucket instance metadata: old shard layout, new shard layout, and current reshard progress. To make this information private, we can just prevent metadata sync from overwriting these fields. + +This change also affects the rados object names of the bucket index shards, currently of the form `.dir.<instance-id>.<shard-id>`. Since we need to represent multiple sharding layouts for a single instance-id, we need to add some unique identifier to the object names. This comes in the form of a generation number, incremented with each reshard, like `.dir.<instance-id>.<generation>.<shard-id>`. The first generation number 0 would be omitted from the object names for backward compatibility. + +## Bucket Index Log Resharding + +The bucket replication logs for multisite are stored in the same bucket index shards as the keys that they modify. However, we can't reshard these log entries like we do with with normal keys, because other zones need to track their position in the logs. If we shuffle the log entries around between shards, other zones no longer have a way to associate their old shard marker positions with the new shards, and their only recourse would be to restart a full sync. So when resharding buckets, we need to preserve the old bucket index logs so that other zones can finish processing their log entries, while any new events are recorded in the new bucket index logs. + +An additional goal is to move replication logs out of omap (so out of the bucket index) into separate rados objects. To enable this, the bucket instance metadata should be able to describe a bucket whose *index layout* is different from its *log layout*. For existing buckets, the two layouts would be identical and share the bucket index objects. Alternate log layouts are otherwise out of scope for this design. + +To support peer zones that are still processing old logs, the local bucket instance metadata must track the history of all log layouts that haven't been fully trimmed yet. Once bilog trimming advances past an old generation, it can delete the associated rados objects and remove that layout from the bucket instance metadata. To prevent this history from growing too large, we can refuse to reshard bucket index logs until trimming catches up. + +The distinction between *index layout* and *log layout* is important, because incremental sync only cares about changes to the *log layout*. Changes to the *index layout* would only affect full sync, which uses a custom RGWListBucket extension to list the objects of each index shard separately. But by changing the scope of full sync from per-bucket-shard to per-bucket and using a normal bucket listing to get all objects, we can make full sync independent of the *index layout*. And once the replication logs are moved out of the bucket index, dynamic resharding is free to change the *index layout* as much as it wants with no effect on multisite replication. + +## Tasks + +### Bucket Reshard + +* Modify existing state machine for bucket reshard to mutate its existing bucket instance instead of creating a new one. + +* Add fields for log layout. When resharding a bucket whose logs are in the index: + - Add a new log layout generation to the bucket instance + - Copy the bucket index entries into their new index layout + - Commit the log generation change so new entries will be written there + - Create a datalog entry with the new log generation + +### Metadata Sync + +* When sync fetches a bucket instance from the master zone, preserve any private fields in the local instance. Use cls_version to guarantee that we write back the most recent version of those private fields. + +### Data Sync + +* Datalog entries currently include a bucket shard number. We need to add the log generation number to these entries so we can tell which sharding layout it refers to. If we see a new generation number, that entry also implies an obligation to finish syncing all shards of prior generations. + +### Bucket Sync Status + +* Add a per-bucket sync status object that tracks: + - full sync progress, + - the current generation of incremental sync, and + - the set of shards that have completed incremental sync of that generation +* Existing per-bucket-shard sync status objects continue to track incremental sync. + - their object names should include the generation number, except for generation 0 +* For backward compatibility, add special handling when we get ENOENT trying to read this per-bucket sync status: + - If the remote's oldest log layout has generation=0, read any existing per-shard sync status objects. If any are found, resume incremental sync from there. + - Otherwise, initialize for full sync. + +### Bucket Sync + +* Full sync uses a single bucket-wide listing to fetch all objects. + - Use a cls_lock to prevent different shards from duplicating this work. +* When incremental sync gets to the end of a log shard (i.e. listing the log returns truncated=false): + - If the remote has a newer log generation, flag that shard as 'done' in the bucket sync status. + - Once all shards in the current generation reach that 'done' state, incremental bucket sync can advance to the next generation. + - Use cls_version on the bucket sync status object to detect racing writes from other shards. + +### Bucket Sync Disable/Enable + +Reframe in terms of log generations, instead of handling SYNCSTOP events with a special Stopped state: + +* radosgw-admin bucket sync enable: create a new log generation in the bucket instance metadata + - detect races with reshard: fail if reshard in progress, and write with cls_version to detect race with start of reshard + - if the current log generation is shared with the bucket index layout (BucketLogType::InIndex), the new log generation will point at the same index layout/generation. so the log generation increments, but the index objects keep the same generation +* SYNCSTOP in incremental sync: flag the shard as 'done' and ignore datalog events on that bucket until we see a new generation + +### Log Trimming + +* Use generation number from sync status to trim the right logs +* Once all shards of a log generation are trimmed: + - Remove their rados objects. + - Remove the associated incremental sync status objects. + - Remove the log generation from its bucket instance metadata. + +### Admin APIs + +* RGWOp_BILog_List response should include the bucket's highest log generation + - Allows incremental sync to determine whether truncated=false means that it's caught up, or that it needs to transition to the next generation. +* RGWOp_BILog_Info response should include the bucket's lowest and highest log generations + - Allows bucket sync status initialization to decide whether it needs to scan for existing shard status, and where it should resume incremental sync after full sync completes. +* RGWOp_BILog_Status response should include per-bucket status information + - For log trimming of old generations |