diff options
Diffstat (limited to 'doc/cephfs/disaster-recovery-experts.rst')
-rw-r--r-- | doc/cephfs/disaster-recovery-experts.rst | 318 |
1 files changed, 318 insertions, 0 deletions
diff --git a/doc/cephfs/disaster-recovery-experts.rst b/doc/cephfs/disaster-recovery-experts.rst new file mode 100644 index 000000000..11df16e38 --- /dev/null +++ b/doc/cephfs/disaster-recovery-experts.rst @@ -0,0 +1,318 @@ + +.. _disaster-recovery-experts: + +Advanced: Metadata repair tools +=============================== + +.. warning:: + + If you do not have expert knowledge of CephFS internals, you will + need to seek assistance before using any of these tools. + + The tools mentioned here can easily cause damage as well as fixing it. + + It is essential to understand exactly what has gone wrong with your + file system before attempting to repair it. + + If you do not have access to professional support for your cluster, + consult the ceph-users mailing list or the #ceph IRC channel. + + +Journal export +-------------- + +Before attempting dangerous operations, make a copy of the journal like so: + +:: + + cephfs-journal-tool journal export backup.bin + +Note that this command may not always work if the journal is badly corrupted, +in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902). + + +Dentry recovery from journal +---------------------------- + +If a journal is damaged or for any reason an MDS is incapable of replaying it, +attempt to recover what file metadata we can like so: + +:: + + cephfs-journal-tool event recover_dentries summary + +This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks. + +This command will write any inodes/dentries recoverable from the journal +into the backing store, if these inodes/dentries are higher-versioned +than the previous contents of the backing store. If any regions of the journal +are missing/damaged, they will be skipped. + +Note that in addition to writing out dentries and inodes, this command will update +the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers +are now in use. In simple cases, this will result in an entirely valid backing +store state. + +.. warning:: + + The resulting state of the backing store is not guaranteed to be self-consistent, + and an online MDS scrub will be required afterwards. The journal contents + will not be modified by this command, you should truncate the journal + separately after recovering what you can. + +Journal truncation +------------------ + +If the journal is corrupt or MDSs cannot replay it for any reason, you can +truncate it like so: + +:: + + cephfs-journal-tool [--rank=N] journal reset + +Specify the MDS rank using the ``--rank`` option when the file system has/had +multiple active MDS. + +.. warning:: + + Resetting the journal *will* lose metadata unless you have extracted + it by other means such as ``recover_dentries``. It is likely to leave + some orphaned objects in the data pool. It may result in re-allocation + of already-written inodes, such that permissions rules could be violated. + +MDS table wipes +--------------- + +After the journal has been reset, it may no longer be consistent with respect +to the contents of the MDS tables (InoTable, SessionMap, SnapServer). + +To reset the SessionMap (erase all sessions), use: + +:: + + cephfs-table-tool all reset session + +This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS +rank to operate on that rank only. + +The session table is the table most likely to need resetting, but if you know you +also need to reset the other tables then replace 'session' with 'snap' or 'inode'. + +MDS map reset +------------- + +Once the in-RADOS state of the file system (i.e. contents of the metadata pool) +is somewhat recovered, it may be necessary to update the MDS map to reflect +the contents of the metadata pool. Use the following command to reset the MDS +map to a single MDS: + +:: + + ceph fs reset <fs name> --yes-i-really-mean-it + +Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored: +as a result it is possible for this to result in data loss. + +One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The +key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such +that it would overwrite any existing root inode on disk and orphan any existing files. In +contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS +daemon to claim the rank will go ahead and use the existing in-RADOS metadata. + +Recovery from missing metadata objects +-------------------------------------- + +Depending on what objects are missing or corrupt, you may need to +run various commands to regenerate default versions of the +objects. + +:: + + # Session table + cephfs-table-tool 0 reset session + # SnapServer + cephfs-table-tool 0 reset snap + # InoTable + cephfs-table-tool 0 reset inode + # Journal + cephfs-journal-tool --rank=0 journal reset + # Root inodes ("/" and MDS directory) + cephfs-data-scan init + +Finally, you can regenerate metadata objects for missing files +and directories based on the contents of a data pool. This is +a three-phase process. First, scanning *all* objects to calculate +size and mtime metadata for inodes. Second, scanning the first +object from every file to collect this metadata and inject it into +the metadata pool. Third, checking inode linkages and fixing found +errors. + +:: + + cephfs-data-scan scan_extents <data pool> + cephfs-data-scan scan_inodes <data pool> + cephfs-data-scan scan_links + +'scan_extents' and 'scan_inodes' commands may take a *very long* time +if there are many files or very large files in the data pool. + +To accelerate the process, run multiple instances of the tool. + +Decide on a number of workers, and pass each worker a number within +the range 0-(worker_m - 1). + +The example below shows how to run 4 workers simultaneously: + +:: + + # Worker 0 + cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool> + # Worker 1 + cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool> + # Worker 2 + cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool> + # Worker 3 + cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool> + + # Worker 0 + cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool> + # Worker 1 + cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool> + # Worker 2 + cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool> + # Worker 3 + cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool> + +It is **important** to ensure that all workers have completed the +scan_extents phase before any workers enter the scan_inodes phase. + +After completing the metadata recovery, you may want to run cleanup +operation to delete ancillary data geneated during recovery. + +:: + + cephfs-data-scan cleanup <data pool> + + + +Using an alternate metadata pool for recovery +--------------------------------------------- + +.. warning:: + + There has not been extensive testing of this procedure. It should be + undertaken with great care. + +If an existing file system is damaged and inoperative, it is possible to create +a fresh metadata pool and attempt to reconstruct the file system metadata into +this new pool, leaving the old metadata in place. This could be used to make a +safer attempt at recovery since the existing metadata pool would not be +modified. + +.. caution:: + + During this process, multiple metadata pools will contain data referring to + the same data pool. Extreme caution must be exercised to avoid changing the + data pool contents while this is the case. Once recovery is complete, the + damaged metadata pool should be archived or deleted. + +To begin, the existing file system should be taken down, if not done already, +to prevent further modification of the data pool. Unmount all clients and then +mark the file system failed: + +:: + + ceph fs fail <fs_name> + +Next, create a recovery file system in which we will populate a new metadata pool +backed by the original data pool. + +:: + + ceph fs flag set enable_multiple true --yes-i-really-mean-it + ceph osd pool create cephfs_recovery_meta + ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay + + +The recovery file system starts with an MDS rank that will initialize the new +metadata pool with some metadata. This is necessary to bootstrap recovery. +However, now we will take the MDS down as we do not want it interacting with +the metadata pool further. + +:: + + ceph fs fail cephfs_recovery + +Next, we will reset the initial metadata the MDS created: + +:: + + cephfs-table-tool cephfs_recovery:all reset session + cephfs-table-tool cephfs_recovery:all reset snap + cephfs-table-tool cephfs_recovery:all reset inode + +Now perform the recovery of the metadata pool from the data pool: + +:: + + cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta + cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool> + cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool> + cephfs-data-scan scan_links --filesystem cephfs_recovery + +.. note:: + + Each scan procedure above goes through the entire data pool. This may take a + significant amount of time. See the previous section on how to distribute + this task among workers. + +If the damaged file system contains dirty journal data, it may be recovered next +with: + +:: + + cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta + cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force + +After recovery, some recovered directories will have incorrect statistics. +Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are +set to false (the default) to prevent the MDS from checking the statistics: + +:: + + ceph config rm mds mds_verify_scatter + ceph config rm mds mds_debug_scatterstat + +(Note, the config may also have been set globally or via a ceph.conf file.) +Now, allow an MDS to join the recovery file system: + +:: + + ceph fs set cephfs_recovery joinable true + +Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics. +Ensure you have an MDS running and issue: + +:: + + ceph fs status # get active MDS + ceph tell mds.<id> scrub start / recursive repair + +.. note:: + + Symbolic links are recovered as empty regular files. `Symbolic link recovery + <https://tracker.ceph.com/issues/46166>`_ is scheduled to be supported in + Pacific. + +It is recommended to migrate any data from the recovery file system as soon as +possible. Do not restore the old file system while the recovery file system is +operational. + +.. note:: + + If the data pool is also corrupt, some files may not be restored because + backtrace information is lost. If any data objects are missing (due to + issues like lost Placement Groups on the data pool), the recovered files + will contain holes in place of the missing data. + +.. _Symbolic link recovery: https://tracker.ceph.com/issues/46166 |