diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 03:20:40 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 03:20:40 +0000 |
commit | a8797415525fe24f8baf71088ec714f3902a1fa7 (patch) | |
tree | 900c6dcf46fca9767ba854e0cac83d5935c44274 /external-reshape-design.txt | |
parent | Adding debian version 4.3-1. (diff) | |
download | mdadm-a8797415525fe24f8baf71088ec714f3902a1fa7.tar.xz mdadm-a8797415525fe24f8baf71088ec714f3902a1fa7.zip |
Merging upstream version 4.3+20240412.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'external-reshape-design.txt')
-rw-r--r-- | external-reshape-design.txt | 280 |
1 files changed, 0 insertions, 280 deletions
diff --git a/external-reshape-design.txt b/external-reshape-design.txt deleted file mode 100644 index e4cf4e1..0000000 --- a/external-reshape-design.txt +++ /dev/null @@ -1,280 +0,0 @@ -External Reshape - -1 Problem statement - -External (third-party metadata) reshape differs from native-metadata -reshape in three key ways: - -1.1 Format specific constraints - -In the native case reshape is limited by what is implemented in the -generic reshape routine (Grow_reshape()) and what is supported by the -kernel. There are exceptional cases where Grow_reshape() may block -operations when it knows that the kernel implementation is broken, but -otherwise the kernel is relied upon to be the final arbiter of what -reshape operations are supported. - -In the external case the kernel, and the generic checks in -Grow_reshape(), become the super-set of what reshapes are possible. The -metadata format may not support, or have yet to implement a given -reshape type. The implication for Grow_reshape() is that it must query -the metadata handler and effect changes in the metadata before the new -geometry is posted to the kernel. The ->reshape_super method allows -Grow_reshape() to validate the requested operation and post the metadata -update. - -1.2 Scope of reshape - -Native metadata reshape is always performed at the array scope (no -metadata relationship with sibling arrays on the same disks). External -reshape, depending on the format, may not allow the number of member -disks to be changed in a subarray unless the change is simultaneously -applied to all subarrays in the container. For example the imsm format -requires all member disks to be a member of all subarrays, so a 4-disk -raid5 in a container that also houses a 4-disk raid10 array could not be -reshaped to 5 disks as the imsm format does not support a 5-disk raid10 -representation. This requires the ->reshape_super method to check the -contents of the array and ask the user to run the reshape at container -scope (if all subarrays are agreeable to the change), or report an -error in the case where one subarray cannot support the change. - -1.3 Monitoring / checkpointing - -Reshape, unlike rebuild/resync, requires strict checkpointing to survive -interrupted reshape operations. For example when expanding a raid5 -array the first few stripes of the array will be overwritten in a -destructive manner. When restarting the reshape process we need to know -the exact location of the last successfully written stripe, and we need -to restore the data in any partially overwritten stripe. Native -metadata stores this backup data in the unused portion of spares that -are being promoted to array members, or in an external backup file -(located on a non-involved block device). - -The kernel is in charge of recording checkpoints of reshape progress, -but mdadm is delegated the task of managing the backup space which -involves: -1/ Identifying what data will be overwritten in the next unit of reshape - operation -2/ Suspending access to that region so that a snapshot of the data can - be transferred to the backup space. -3/ Allowing the kernel to reshape the saved region and setting the - boundary for the next backup. - -In the external reshape case we want to preserve this mdadm -'reshape-manager' arrangement, but have a third actor, mdmon, to -consider. It is tempting to give the role of managing reshape to mdmon, -but that is counter to its role as a monitor, and conflicts with the -existing capabilities and role of mdadm to manage the progress of -reshape. For clarity the external reshape implementation maintains the -role of mdmon as a (mostly) passive recorder of raid events, and mdadm -treats it as it would the kernel in the native reshape case (modulo -needing to send explicit metadata update messages and checking that -mdmon took the expected action). - -External reshape can use the generic md backup file as a fallback, but in the -optimal/firmware-compatible case the reshape-manager will use the metadata -specific areas for managing reshape. The implementation also needs to spawn a -reshape-manager per subarray when the reshape is being carried out at the -container level. For these two reasons the ->manage_reshape() method is -introduced. This method in addition to base tasks mentioned above: -1/ Processed each subarray one at a time in series - where appropriate. -2/ Uses either generic routines in Grow.c for md-style backup file - support, or uses the metadata-format specific location for storing - recovery data. -This aims to avoid a "midlayer mistake"[1] and lets the metadata handler -optionally take advantage of generic infrastructure in Grow.c - -2 Details for specific reshape requests - -There are quite a few moving pieces spread out across md, mdadm, and mdmon for -the support of external reshape, and there are several different types of -reshape that need to be comprehended by the implementation. A rundown of -these details follows. - -2.0 General provisions: - -Obtain an exclusive open on the container to make sure we are not -running concurrently with a Create() event. - -2.1 Freezing sync_action - - Before making any attempt at a reshape we 'freeze' every array in - the container to ensure no spare assignment or recovery happens. - This involves writing 'frozen' to sync_action and changing the '/' - after 'external:' in metadata_version to a '-'. mdmon knows that - this means not to perform any management. - - Before doing this we check that all sync_actions are 'idle', which - is racy but still useful. - Afterwards we check that all member arrays have no spares - or partial spares (recovery_start != 'none') which would indicate a - race. If they do, we unfreeze again. - - Once this completes we know all the arrays are stable. They may - still have failed devices as devices can fail at any time. However - we treat those like failures that happen during the reshape. - -2.2 Reshape size - - 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally - initializes st->update_tail - 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change - is allowed (being performed at subarray scope / enough room) prepares a - metadata update - 3/ mdadm::Grow_reshape(): flushes the metadata update (via - flush_metadata_update(), or ->sync_metadata()) - 4/ mdadm::Grow_reshape(): post the new size to the kernel - - -2.3 Reshape level (simple-takeover) - -"simple-takeover" implies the level change can be satisfied without touching -sync_action - - 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally - initializes st->update_tail - 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change - is allowed (being performed at subarray scope) prepares a - metadata update - 2a/ raid10 --> raid0: degrade all mirror legs prior to calling - ->reshape_super - 3/ mdadm::Grow_reshape(): flushes the metadata update (via - flush_metadata_update(), or ->sync_metadata()) - 4/ mdadm::Grow_reshape(): post the new level to the kernel - -2.4 Reshape chunk, layout - -2.5 Reshape raid disks (grow) - - 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail - because only redundant raid levels can modify the number of raid disks - 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level - change is allowed (being performed at proper scope / permissible - geometry / proper spares available in the container), chooses - the spares to use, and prepares a metadata update. - 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the - raid level that can perform the reshape and starts mdmon. - 4/ mdadm::Grow_reshape(): Pushes the update to mdmon. - 5/ mdadm::Grow_reshape(): uses container_content to find details of - the spares and passes them to the kernel. - 6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel, - sets sync_max, sync_min, suspend_lo, suspend_hi all to zero, - and starts the reshape by writing 'reshape' to sync_action. - 7/ mdmon::monitor notices the sync_action change and tells - managemon to check for new devices. managemon notices the new - devices, opens relevant sysfs file, and passes them all to - monitor. - 8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the - rest of the reshape. - - 9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by - the kernel to either the backup file or the metadata specific location, - advances sync_max, waits for reshape, ping mdmon, repeat. - Meanwhile mdmon::read_and_act(): records checkpoints. - Specifically. - - 9a/ if the 'next' stripe to be reshaped will over-write - itself during reshape then: - 9a.1/ increase suspend_hi to cover a suitable number of - stripes. - 9a.2/ backup those stripes safely. - 9a.3/ advance sync_max to allow those stripes to be backed up - 9a.4/ when sync_completed indicates that those stripes have - been reshaped, manage_reshape must ping_manager - 9a.5/ when mdmon notices that sync_completed has been updated, - it records the new checkpoint in the metadata - 9a.6/ after the ping_manager, manage_reshape will increase - suspend_lo to allow access to those stripes again - - 9b/ if the 'next' stripe to be reshaped will over-write unused - space during reshape then we apply same process as above, - except that there is no need to back anything up. - Note that we *do* need to keep suspend_hi progressing as - it is not safe to write to the area-under-reshape. For - kernel-managed-metadata this protection is provided by - ->reshape_safe, but that does not protect us in the case - of user-space-managed-metadata. - - 10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid - level back to the nominal raid level (if necessary) - - FIXME: native metadata does not have the capability to record the original - raid level in reshape-restart case because the kernel always records current - raid level to the metadata, whereas external metadata can masquerade at an - alternate level based on the reshape state. - -2.6 Reshape raid disks (shrink) - -3 Interaction with metadata handle. - - The following calls are made into the metadata handler to assist - with initiating and monitoring a 'reshape'. - - 1/ ->reshape_super is called quite early (after only minimial - checks) to make sure that the metadata can record the new shape - and any necessary transitions. It may be passed a 'container' - or an individual array within a container, and it should notice - the difference and act accordingly. - When a reshape is requested against a container it is expected - that it should be applied to every array in the container, - however it is up to the metadata handler to determine final - policy. - - If the reshape is supportable, the internal copy of the metadata - should be updated, and a metadata update suitable for sending - to mdmon should be queued. - - If the reshape will involve converting spares into array members, - this must be recorded in the metadata too. - - 2/ ->container_content will be called to find out the new state - of all the array, or all arrays in the container. Any newly - added devices (with state==0 and raid_disk >= 0) will be added - to the array as spares with the relevant slot number. - - It is likely that the info returned by ->container_content will - have ->reshape_active set, ->reshape_progress set to e.g. 0, and - new_* set appropriately. mdadm will use this information to - cause the correct reshape to start at an appropriate time. - - 3/ ->set_array_state will be called by mdmon when reshape has - started and again periodically as it progresses. This should - record the ->last_checkpoint as the point where reshape has - progressed to. When the reshape finished this will be called - again and it should notice that ->curr_action is no longer - 'reshape' and so should record that the reshape has finished - providing 'last_checkpoint' has progressed suitably. - - 4/ ->manage_reshape will be called once the reshape has been set - up in the kernel but before sync_max has been moved from 0, so - no actual reshape will have happened. - - ->manage_reshape should call progress_reshape() to allow the - reshape to progress, and should back-up any data as indicated - by the return value. See the documentation of that function - for more details. - ->manage_reshape will be called multiple times when a - container is being reshaped, once for each member array in - the container. - - - The progress of the metadata is as follows: - 1/ mdadm sends a metadata update to mdmon which marks the array - as undergoing a reshape. This is set up by - ->reshape_super and applied by ->process_update - For container-wide reshape, this happens once for the whole - container. - 2/ mdmon notices progress via the sysfs files and calls - ->set_array_state to update the state periodically - For container-wide reshape, this happens repeatedly for - one array, then repeatedly for the next, etc. - 3/ mdmon notices when reshape has finished and call - ->set_array_state to record the the reshape is complete. - For container-wide reshape, this happens once for each - member array. - - - -... - -[1]: Linux kernel design patterns - part 3, Neil Brown https://lwn.net/Articles/336262/ |