From a8c8b888d4bc9152a17cba6fb0a58856f53d3ff8 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sat, 27 Apr 2024 05:20:40 +0200 Subject: Adding upstream version 4.3+20240412. Signed-off-by: Daniel Baumann --- mdmon-design.txt | 146 ------------------------------------------------------- 1 file changed, 146 deletions(-) delete mode 100644 mdmon-design.txt (limited to 'mdmon-design.txt') diff --git a/mdmon-design.txt b/mdmon-design.txt deleted file mode 100644 index f09184a..0000000 --- a/mdmon-design.txt +++ /dev/null @@ -1,146 +0,0 @@ - -When managing a RAID1 array which uses metadata other than the -"native" metadata understood by the kernel, mdadm makes use of a -partner program named 'mdmon' to manage some aspects of updating -that metadata and synchronising the metadata with the array state. - -This document provides some details on how mdmon works. - -Containers ----------- - -As background: mdadm makes a distinction between an 'array' and a -'container'. Other sources sometimes use the term 'volume' or -'device' for an 'array', and may use the term 'array' for a -'container'. - -For our purposes: - - a 'container' is a collection of devices which are described by a - single set of metadata. The metadata may be stored equally - on all devices, or different devices may have quite different - subsets of the total metadata. But there is conceptually one set - of metadata that unifies the devices. - - - an 'array' is a set of datablock from various devices which - together are used to present the abstraction of a single linear - sequence of block, which may provide data redundancy or enhanced - performance. - -So a container has some metadata and provides a number of arrays which -are described by that metadata. - -Sometimes this model doesn't work perfectly. For example, global -spares may have their own metadata which is quite different from the -metadata from any device that participates in one or more arrays. -Such a global spare might still need to belong to some container so -that it is available to be used should a failure arise. In that case -we consider the 'metadata' to be the union of the metadata on the -active devices which describes the arrays, and the metadata on the -global spares which only describes the spares. In this case different -devices in the one container will have quite different metadata. - - -Purpose -------- - -The main purpose of mdmon is to update the metadata in response to -changes to the array which need to be reflected in the metadata before -futures writes to the array can safely be performed. -These include: - - transitions from 'clean' to 'dirty'. - - recording the devices have failed. - - recording the progress of a 'reshape' - -This requires mdmon to be running at any time that the array is -writable (a read-only array does not require mdmon to be running). - -Because mdmon must be able to process these metadata updates at any -time, it must (when running) have exclusive write access to the -metadata. Any other changes (e.g. reconfiguration of the array) must -go through mdmon. - -A secondary role for mdmon is to activate spares when a device fails. -This role is much less time-critical than the other metadata updates, -so it could be performed by a separate process, possibly -"mdadm --monitor" which has a related role of moving devices between -arrays. A main reason for including this functionality in mdmon is -that in the native-metadata case this function is handled in the -kernel, and mdmon's reason for existence to provide functionality -which is otherwise handled by the kernel. - - -Design overview ---------------- - -mdmon is structured as two threads with a common address space and -common data structures. These threads are know as the 'monitor' and -the 'manager'. - -The 'monitor' has the primary role of monitoring the array for -important state changes and updating the metadata accordingly. As -writes to the array can be blocked until 'monitor' completes and -acknowledges the update, it much be very careful not to block itself. -In particular it must not block waiting for any write to complete else -it could deadlock. This means that it must not allocate memory as -doing this can require dirty memory to be written out and if the -system choose to write to the array that mdmon is monitoring, the -memory allocation could deadlock. - -So 'monitor' must never allocate memory and must limit the number of -other system call it performs. It may: - - use select (or poll) to wait for activity on a file descriptor - - read from a sysfs file descriptor - - write to a sysfs file descriptor - - write the metadata out to the block devices using O_DIRECT - - send a signal (kill) to the manager thread - -It must not e.g. open files or do anything similar that might allocate -resources. - -The 'manager' thread does everything else that is needed. If any -files are to be opened (e.g. because a device has been added to the -array), the manager does that. If any memory needs to be allocated -(e.g. to hold data about a new array as can happen when one set of -metadata describes several arrays), the manager performs that -allocation. - -The 'manager' is also responsible for communicating with mdadm and -assigning spares to replace failed devices. - - -Handling metadata updates -------------------------- - -There are a number of cases in which mdadm needs to update the -metdata which mdmon is managing. These include: - - creating a new array in an active container - - adding a device to a container - - reconfiguring an array -etc. - -To complete these updates, mdadm must send a message to mdmon which -will merge the update into the metadata as it is at that moment. - -To achieve this, mdmon creates a Unix Domain Socket which the manager -thread listens on. mdadm sends a message over this socket. The -manager thread examines the message to see if it will require -allocating any memory and allocates it. This is done in the -'prepare_update' metadata method. - -The update message is then queued for handling by the monitor thread -which it will do when convenient. The monitor thread calls -->process_update which should atomically make the required changes to -the metadata, making use of the pre-allocate memory as required. Any -memory the is no-longer needed can be placed back in the request and -the manager thread will free it. - -The exact format of a metadata update is up to the implementer of the -metadata handlers. It will simply describe a change that needs to be -made. It will sometimes contain fragments of the metadata to be -copied in to place. However the ->process_update routine must make -sure not to over-write any field that the monitor thread might have -updated, such as a 'device failed' or 'array is dirty' state. - -When the monitor thread has completed the update and written it to the -devices, an acknowledgement message is sent back over the socket so -that mdadm knows it is complete. -- cgit v1.2.3