diff options
Diffstat (limited to 'doc/dev/osd_internals/partial_object_recovery.rst')
-rw-r--r-- | doc/dev/osd_internals/partial_object_recovery.rst | 148 |
1 files changed, 148 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/partial_object_recovery.rst b/doc/dev/osd_internals/partial_object_recovery.rst new file mode 100644 index 000000000..a22f63348 --- /dev/null +++ b/doc/dev/osd_internals/partial_object_recovery.rst @@ -0,0 +1,148 @@ +======================= +Partial Object Recovery +======================= + +Partial Object Recovery improves the efficiency of log-based recovery (vs +backfill). Original log-based recovery calculates missing_set based on pg_log +differences. + +The whole object should be recovery from one OSD to another +if the object is indicated modified by pg_log regardless of how much +content in the object is really modified. That means a 4M object, +which is just modified 4k inside, should recovery the whole 4M object +rather than the modified 4k content. In addition, object map should be +also recovered even if it is not modified at all. + +Partial Object Recovery is designed to solve the problem mentioned above. +In order to achieve the goals, two things should be done: + +1. logging where the object is modified is necessary +2. logging whether the object_map of an object is modified is also necessary + +class ObjectCleanRegion is introduced to do what we want. +clean_offsets is a variable of interval_set<uint64_t> +and is used to indicate the unmodified content in an object. +clean_omap is a variable of bool indicating whether object_map is modified. +new_object means that osd does not exist for an object +max_num_intervals is an upbound of the number of intervals in clean_offsets +so that the memory cost of clean_offsets is always bounded. + +The shortest clean interval will be trimmed if the number of intervals +in clean_offsets exceeds the boundary. + + etc. max_num_intervals=2, clean_offsets:{[5~10], [20~5]} + + then new interval [30~10] will evict out the shortest one [20~5] + + finally, clean_offsets becomes {[5~10], [30~10]} + +Procedures for Partial Object Recovery +====================================== + +Firstly, OpContext and pg_log_entry_t should contain ObjectCleanRegion. +In do_osd_ops(), finish_copyfrom(), finish_promote(), corresponding content +in ObjectCleanRegion should mark dirty so that trace the modification of an object. +Also update ObjectCleanRegion in OpContext to its pg_log_entry_t. + +Secondly, pg_missing_set can build and rebuild correctly. +when calculating pg_missing_set during peering process, +also merge ObjectCleanRegion in each pg_log_entry_t. + + etc. object aa has pg_log: + 26'101 {[0~4096, 8192~MAX], false} + + 26'104 {0~8192, 12288~MAX, false} + + 28'108 {[0~12288, 16384~MAX], true} + + missing_set for object aa: merge pg_log above --> {[0~4096, 16384~MAX], true}. + which means 4096~16384 is modified and object_map is also modified on version 28'108 + +Also, OSD may be crash after merge log. +Therefore, we need to read_log and rebuild pg_missing_set. For example, pg_log is: + + object aa: 26'101 {[0~4096, 8192~MAX], false} + + object bb: 26'102 {[0~4096, 8192~MAX], false} + + object cc: 26'103 {[0~4096, 8192~MAX], false} + + object aa: 26'104 {0~8192, 12288~MAX, false} + + object dd: 26'105 {[0~4096, 8192~MAX], false} + + object aa: 28'108 {[0~12288, 16384~MAX], true} + +Originally, if bb,cc,dd is recovered, and aa is not. +So we need to rebuild pg_missing_set for object aa, +and find aa is modified on version 28'108. +If version in object_info is 26'96 < 28'108, +we don't need to consider 26'104 and 26'101 because the whole object will be recovered. +However, Partial Object Recovery should also require us to rebuild ObjectCleanRegion. + +Knowing whether the object is modified is not enough. + +Therefore, we also need to traverse the pg_log before, +that says 26'104 and 26'101 also > object_info(26'96) +and rebuild pg_missing_set for object aa based on those three logs: 28'108, 26'104, 26'101. +The way how to merge logs is the same as mentioned above + +Finally, finish the push and pull process based on pg_missing_set. +Updating copy_subset in recovery_info based on ObjectCleanRegion in pg_missing_set. +copy_subset indicates the intervals of content need to pull and push. + +The complicated part here is submit_push_data +and serval cases should be considered separately. +what we need to consider is how to deal with the object data, +object data makes up of omap_header, xattrs, omap, data: + +case 1: first && complete: since object recovering is finished in a single PushOp, +we would like to preserve the original object and overwrite on the object directly. +Object will not be removed and touch a new one. + + issue 1: As object is not removed, old xattrs remain in the old object + but maybe updated in new object. Overwriting for the same key or adding new keys is correct, + but removing keys will be wrong. + In order to solve this issue, We need to remove the all original xattrs in the object, and then update new xattrs. + + issue 2: As object is not removed, + object_map may be recovered depending on the clean_omap. + Therefore, if recovering clean_omap, we need to remove old omap of the object for the same reason + since omap updating may also be a deletion. + Thus, in this case, we should do: + + 1) clear xattrs of the object + 2) clear omap of the object if omap recovery is needed + 3) truncate the object into recovery_info.size + 4) recovery omap_header + 5) recovery xattrs, and recover omap if needed + 6) punch zeros for original object if fiemap tells nothing there + 7) overwrite object content which is modified + 8) finish recovery + +case 2: first && !complete: object recovering should be done in multiple times. +Here, target_oid will indicate a new temp_object in pgid_TEMP, +so the issues are a bit difference. + + issue 1: As object is newly created, there is no need to deal with xattrs + + issue 2: As object is newly created, + and object_map may not be transmitted depending on clean_omap. + Therefore, if clean_omap is true, we need to clone object_map from original object. + issue 3: As object is newly created, and unmodified data will not be transmitted. + Therefore, we need to clone unmodified data from the original object. + Thus, in this case, we should do: + + 1) remove the temp object + 2) create a new temp object + 3) set alloc_hint for the new temp object + 4) truncate new temp object to recovery_info.size + 5) recovery omap_header + 6) clone object_map from original object if omap is clean + 7) clone unmodified object_data from original object + 8) punch zeros for the new temp object + 9) recovery xattrs, and recover omap if needed + 10) overwrite object content which is modified + 11) remove the original object + 12) move and rename the new temp object to replace the original object + 13) finish recovery |