Merging upstream version 4.3+20240412.

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-27 03:20:40 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-27 03:20:40 +0000
commit: a8797415525fe24f8baf71088ec714f3902a1fa7 (patch)
tree: 900c6dcf46fca9767ba854e0cac83d5935c44274 /external-reshape-design.txt
parent: Adding debian version 4.3-1. (diff)
download: mdadm-a8797415525fe24f8baf71088ec714f3902a1fa7.tar.xz
mdadm-a8797415525fe24f8baf71088ec714f3902a1fa7.zip
1 files changed, 0 insertions, 280 deletions
diff --git a/external-reshape-design.txt b/external-reshape-design.txt
deleted file mode 100644
index e4cf4e1..0000000
--- a/external-reshape-design.txt
+++ /dev/null
@@ -1,280 +0,0 @@
-External Reshape
-
-1 Problem statement
-
-External (third-party metadata) reshape differs from native-metadata
-reshape in three key ways:
-
-1.1 Format specific constraints
-
-In the native case reshape is limited by what is implemented in the
-generic reshape routine (Grow_reshape()) and what is supported by the
-kernel.  There are exceptional cases where Grow_reshape() may block
-operations when it knows that the kernel implementation is broken, but
-otherwise the kernel is relied upon to be the final arbiter of what
-reshape operations are supported.
-
-In the external case the kernel, and the generic checks in
-Grow_reshape(), become the super-set of what reshapes are possible.  The
-metadata format may not support, or have yet to implement a given
-reshape type.  The implication for Grow_reshape() is that it must query
-the metadata handler and effect changes in the metadata before the new
-geometry is posted to the kernel.  The ->reshape_super method allows
-Grow_reshape() to validate the requested operation and post the metadata
-update.
-
-1.2 Scope of reshape
-
-Native metadata reshape is always performed at the array scope (no
-metadata relationship with sibling arrays on the same disks).  External
-reshape, depending on the format, may not allow the number of member
-disks to be changed in a subarray unless the change is simultaneously
-applied to all subarrays in the container.  For example the imsm format
-requires all member disks to be a member of all subarrays, so a 4-disk
-raid5 in a container that also houses a 4-disk raid10 array could not be
-reshaped to 5 disks as the imsm format does not support a 5-disk raid10
-representation.  This requires the ->reshape_super method to check the
-contents of the array and ask the user to run the reshape at container
-scope (if all subarrays are agreeable to the change), or report an
-error in the case where one subarray cannot support the change.
-
-1.3 Monitoring / checkpointing
-
-Reshape, unlike rebuild/resync, requires strict checkpointing to survive
-interrupted reshape operations.  For example when expanding a raid5
-array the first few stripes of the array will be overwritten in a
-destructive manner.  When restarting the reshape process we need to know
-the exact location of the last successfully written stripe, and we need
-to restore the data in any partially overwritten stripe.  Native
-metadata stores this backup data in the unused portion of spares that
-are being promoted to array members, or in an external backup file
-(located on a non-involved block device).
-
-The kernel is in charge of recording checkpoints of reshape progress,
-but mdadm is delegated the task of managing the backup space which
-involves:
-1/ Identifying what data will be overwritten in the next unit of reshape
-   operation
-2/ Suspending access to that region so that a snapshot of the data can
-   be transferred to the backup space.
-3/ Allowing the kernel to reshape the saved region and setting the
-   boundary for the next backup.
-
-In the external reshape case we want to preserve this mdadm
-'reshape-manager' arrangement, but have a third actor, mdmon, to
-consider.  It is tempting to give the role of managing reshape to mdmon,
-but that is counter to its role as a monitor, and conflicts with the
-existing capabilities and role of mdadm to manage the progress of
-reshape.  For clarity the external reshape implementation maintains the
-role of mdmon as a (mostly) passive recorder of raid events, and mdadm
-treats it as it would the kernel in the native reshape case (modulo
-needing to send explicit metadata update messages and checking that
-mdmon took the expected action).
-
-External reshape can use the generic md backup file as a fallback, but in the
-optimal/firmware-compatible case the reshape-manager will use the metadata
-specific areas for managing reshape.  The implementation also needs to spawn a
-reshape-manager per subarray when the reshape is being carried out at the
-container level.  For these two reasons the ->manage_reshape() method is
-introduced.  This method in addition to base tasks mentioned above:
-1/ Processed each subarray one at a time in series - where appropriate.
-2/ Uses either generic routines in Grow.c for md-style backup file
-   support, or uses the metadata-format specific location for storing
-   recovery data.
-This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
-optionally take advantage of generic infrastructure in Grow.c
-
-2 Details for specific reshape requests
-
-There are quite a few moving pieces spread out across md, mdadm, and mdmon for
-the support of external reshape, and there are several different types of
-reshape that need to be comprehended by the implementation.  A rundown of
-these details follows.
-
-2.0 General provisions:
-
-Obtain an exclusive open on the container to make sure we are not
-running concurrently with a Create() event.
-
-2.1 Freezing sync_action
-
-   Before making any attempt at a reshape we 'freeze' every array in
-   the container to ensure no spare assignment or recovery happens.
-   This involves writing 'frozen' to sync_action and changing the '/'
-   after 'external:' in metadata_version to a '-'. mdmon knows that
-   this means not to perform any management.
-
-   Before doing this we check that all sync_actions are 'idle', which
-   is racy but still useful.
-   Afterwards we check that all member arrays have no spares
-   or partial spares (recovery_start != 'none') which would indicate a
-   race.  If they do, we unfreeze again.
-
-   Once this completes we know all the arrays are stable.  They may
-   still have failed devices as devices can fail at any time.  However
-   we treat those like failures that happen during the reshape.
-
-2.2 Reshape size
-
-   1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
-      initializes st->update_tail
-   2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
-      is allowed (being performed at subarray scope / enough room) prepares a
-      metadata update
-   3/ mdadm::Grow_reshape(): flushes the metadata update (via
-      flush_metadata_update(), or ->sync_metadata())
-   4/ mdadm::Grow_reshape(): post the new size to the kernel
-
-
-2.3 Reshape level (simple-takeover)
-
-"simple-takeover" implies the level change can be satisfied without touching
-sync_action
-
-    1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
-       initializes st->update_tail
-    2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
-       is allowed (being performed at subarray scope) prepares a
-       metadata update
-       2a/ raid10 --> raid0: degrade all mirror legs prior to calling
-           ->reshape_super
-    3/ mdadm::Grow_reshape(): flushes the metadata update (via
-       flush_metadata_update(), or ->sync_metadata())
-    4/ mdadm::Grow_reshape(): post the new level to the kernel
-
-2.4 Reshape chunk, layout
-
-2.5 Reshape raid disks (grow)
-
-    1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
-       because only redundant raid levels can modify the number of raid disks
-    2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
-       change is allowed (being performed at proper scope / permissible
-       geometry / proper spares available in the container), chooses
-       the spares to use, and prepares a metadata update.
-    3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
-       raid level that can perform the reshape and starts mdmon.
-    4/ mdadm::Grow_reshape(): Pushes the update to mdmon.
-    5/ mdadm::Grow_reshape(): uses container_content to find details of
-       the spares and passes them to the kernel.
-    6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel,
-       sets sync_max, sync_min, suspend_lo, suspend_hi all to zero,
-       and starts the reshape by writing 'reshape' to sync_action.
-    7/ mdmon::monitor notices the sync_action change and tells
-       managemon to check for new devices.  managemon notices the new
-       devices, opens relevant sysfs file, and passes them all to
-       monitor.
-    8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the
-       rest of the reshape.
-
-    9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
-       the kernel to either the backup file or the metadata specific location,
-       advances sync_max, waits for reshape, ping mdmon, repeat.
-       Meanwhile mdmon::read_and_act(): records checkpoints.
-       Specifically.
-
-       9a/ if the 'next' stripe to be reshaped will over-write
-           itself during reshape then:
-	9a.1/ increase suspend_hi to cover a suitable number of
-           stripes.
-	9a.2/ backup those stripes safely.
-	9a.3/ advance sync_max to allow those stripes to be backed up
-	9a.4/ when sync_completed indicates that those stripes have
-           been reshaped, manage_reshape must ping_manager
-	9a.5/ when mdmon notices that sync_completed has been updated,
-           it records the new checkpoint in the metadata
-	9a.6/ after the ping_manager, manage_reshape will increase
-           suspend_lo to allow access to those stripes again
-
-       9b/ if the 'next' stripe to be reshaped will over-write unused
-           space during reshape then we apply same process as above,
-	   except that there is no need to back anything up.
-	   Note that we *do* need to keep suspend_hi progressing as
-	   it is not safe to write to the area-under-reshape.  For
-	   kernel-managed-metadata this protection is provided by
-	   ->reshape_safe, but that does not protect us in the case
-	   of user-space-managed-metadata.
-
-   10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
-       level back to the nominal raid level (if necessary)
-
-       FIXME: native metadata does not have the capability to record the original
-       raid level in reshape-restart case because the kernel always records current
-       raid level to the metadata, whereas external metadata can masquerade at an
-       alternate level based on the reshape state.
-
-2.6 Reshape raid disks (shrink)
-
-3 Interaction with metadata handle.
-
-  The following calls are made into the metadata handler to assist
-  with initiating and monitoring a 'reshape'.
-
-  1/ ->reshape_super is called quite early (after only minimial
-     checks) to make sure that the metadata can record the new shape
-     and any necessary transitions.  It may be passed a 'container'
-     or an individual array within a container, and it should notice
-     the difference and act accordingly.
-     When a reshape is requested against a container it is expected
-     that it should be applied to every array in the container,
-     however it is up to the metadata handler to determine final
-     policy.
-
-     If the reshape is supportable, the internal copy of the metadata
-     should be updated, and a metadata update suitable for sending
-     to mdmon should be queued.
-
-     If the reshape will involve converting spares into array members,
-     this must be recorded in the metadata too.
-
-  2/ ->container_content will be called to find out the new state
-     of all the array, or all arrays in the container.  Any newly
-     added devices (with state==0 and raid_disk >= 0) will be added
-     to the array as spares with the relevant slot number.
-
-     It is likely that the info returned by  ->container_content will
-     have ->reshape_active set, ->reshape_progress set to e.g. 0, and
-     new_* set appropriately.  mdadm will use this information to
-     cause the correct reshape to start at an appropriate time.
-
-  3/ ->set_array_state will be called by mdmon when reshape has
-     started and again periodically as it progresses.  This should
-     record the ->last_checkpoint as the point where reshape has
-     progressed to.  When the reshape finished this will be called
-     again and it should notice that ->curr_action is no longer
-     'reshape' and so should record that the reshape has finished
-     providing 'last_checkpoint' has progressed suitably.
-
-  4/ ->manage_reshape will be called once the reshape has been set
-     up in the kernel but before sync_max has been moved from 0, so
-     no actual reshape will have happened.
-
-     ->manage_reshape should call progress_reshape() to allow the
-     reshape to progress, and should back-up any data as indicated
-     by the return value.  See the documentation of that function
-     for more details.
-     ->manage_reshape will be called multiple times when a
-     container is being reshaped, once for each member array in
-     the container.
-
-
-   The progress of the metadata is as follows:
-    1/ mdadm sends a metadata update to mdmon which marks the array
-       as undergoing a reshape. This is set up by
-       ->reshape_super and applied by ->process_update
-       For container-wide reshape, this happens once for the whole
-       container.
-    2/ mdmon notices progress via the sysfs files and calls
-       ->set_array_state to update the state periodically
-       For container-wide reshape, this happens repeatedly for
-       one array, then repeatedly for the next, etc.
-    3/ mdmon notices when reshape has finished and call
-       ->set_array_state to record the the reshape is complete.
-       For container-wide reshape, this happens once for each
-       member array.
-
-
-
-...
-
-[1]: Linux kernel design patterns - part 3, Neil Brown https://lwn.net/Articles/336262/
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-27 03:20:40 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-27 03:20:40 +0000
commit	a8797415525fe24f8baf71088ec714f3902a1fa7 (patch)
tree	900c6dcf46fca9767ba854e0cac83d5935c44274 /external-reshape-design.txt
parent	Adding debian version 4.3-1. (diff)
download	mdadm-a8797415525fe24f8baf71088ec714f3902a1fa7.tar.xz mdadm-a8797415525fe24f8baf71088ec714f3902a1fa7.zip