summaryrefslogtreecommitdiffstats
path: root/mdmon-design.txt
diff options
context:
space:
mode:
Diffstat (limited to 'mdmon-design.txt')
-rw-r--r--mdmon-design.txt146
1 files changed, 0 insertions, 146 deletions
diff --git a/mdmon-design.txt b/mdmon-design.txt
deleted file mode 100644
index f09184a..0000000
--- a/mdmon-design.txt
+++ /dev/null
@@ -1,146 +0,0 @@
-
-When managing a RAID1 array which uses metadata other than the
-"native" metadata understood by the kernel, mdadm makes use of a
-partner program named 'mdmon' to manage some aspects of updating
-that metadata and synchronising the metadata with the array state.
-
-This document provides some details on how mdmon works.
-
-Containers
-----------
-
-As background: mdadm makes a distinction between an 'array' and a
-'container'. Other sources sometimes use the term 'volume' or
-'device' for an 'array', and may use the term 'array' for a
-'container'.
-
-For our purposes:
- - a 'container' is a collection of devices which are described by a
- single set of metadata. The metadata may be stored equally
- on all devices, or different devices may have quite different
- subsets of the total metadata. But there is conceptually one set
- of metadata that unifies the devices.
-
- - an 'array' is a set of datablock from various devices which
- together are used to present the abstraction of a single linear
- sequence of block, which may provide data redundancy or enhanced
- performance.
-
-So a container has some metadata and provides a number of arrays which
-are described by that metadata.
-
-Sometimes this model doesn't work perfectly. For example, global
-spares may have their own metadata which is quite different from the
-metadata from any device that participates in one or more arrays.
-Such a global spare might still need to belong to some container so
-that it is available to be used should a failure arise. In that case
-we consider the 'metadata' to be the union of the metadata on the
-active devices which describes the arrays, and the metadata on the
-global spares which only describes the spares. In this case different
-devices in the one container will have quite different metadata.
-
-
-Purpose
--------
-
-The main purpose of mdmon is to update the metadata in response to
-changes to the array which need to be reflected in the metadata before
-futures writes to the array can safely be performed.
-These include:
- - transitions from 'clean' to 'dirty'.
- - recording the devices have failed.
- - recording the progress of a 'reshape'
-
-This requires mdmon to be running at any time that the array is
-writable (a read-only array does not require mdmon to be running).
-
-Because mdmon must be able to process these metadata updates at any
-time, it must (when running) have exclusive write access to the
-metadata. Any other changes (e.g. reconfiguration of the array) must
-go through mdmon.
-
-A secondary role for mdmon is to activate spares when a device fails.
-This role is much less time-critical than the other metadata updates,
-so it could be performed by a separate process, possibly
-"mdadm --monitor" which has a related role of moving devices between
-arrays. A main reason for including this functionality in mdmon is
-that in the native-metadata case this function is handled in the
-kernel, and mdmon's reason for existence to provide functionality
-which is otherwise handled by the kernel.
-
-
-Design overview
----------------
-
-mdmon is structured as two threads with a common address space and
-common data structures. These threads are know as the 'monitor' and
-the 'manager'.
-
-The 'monitor' has the primary role of monitoring the array for
-important state changes and updating the metadata accordingly. As
-writes to the array can be blocked until 'monitor' completes and
-acknowledges the update, it much be very careful not to block itself.
-In particular it must not block waiting for any write to complete else
-it could deadlock. This means that it must not allocate memory as
-doing this can require dirty memory to be written out and if the
-system choose to write to the array that mdmon is monitoring, the
-memory allocation could deadlock.
-
-So 'monitor' must never allocate memory and must limit the number of
-other system call it performs. It may:
- - use select (or poll) to wait for activity on a file descriptor
- - read from a sysfs file descriptor
- - write to a sysfs file descriptor
- - write the metadata out to the block devices using O_DIRECT
- - send a signal (kill) to the manager thread
-
-It must not e.g. open files or do anything similar that might allocate
-resources.
-
-The 'manager' thread does everything else that is needed. If any
-files are to be opened (e.g. because a device has been added to the
-array), the manager does that. If any memory needs to be allocated
-(e.g. to hold data about a new array as can happen when one set of
-metadata describes several arrays), the manager performs that
-allocation.
-
-The 'manager' is also responsible for communicating with mdadm and
-assigning spares to replace failed devices.
-
-
-Handling metadata updates
--------------------------
-
-There are a number of cases in which mdadm needs to update the
-metdata which mdmon is managing. These include:
- - creating a new array in an active container
- - adding a device to a container
- - reconfiguring an array
-etc.
-
-To complete these updates, mdadm must send a message to mdmon which
-will merge the update into the metadata as it is at that moment.
-
-To achieve this, mdmon creates a Unix Domain Socket which the manager
-thread listens on. mdadm sends a message over this socket. The
-manager thread examines the message to see if it will require
-allocating any memory and allocates it. This is done in the
-'prepare_update' metadata method.
-
-The update message is then queued for handling by the monitor thread
-which it will do when convenient. The monitor thread calls
-->process_update which should atomically make the required changes to
-the metadata, making use of the pre-allocate memory as required. Any
-memory the is no-longer needed can be placed back in the request and
-the manager thread will free it.
-
-The exact format of a metadata update is up to the implementer of the
-metadata handlers. It will simply describe a change that needs to be
-made. It will sometimes contain fragments of the metadata to be
-copied in to place. However the ->process_update routine must make
-sure not to over-write any field that the monitor thread might have
-updated, such as a 'device failed' or 'array is dirty' state.
-
-When the monitor thread has completed the update and written it to the
-devices, an acknowledgement message is sent back over the socket so
-that mdadm knows it is complete.