summaryrefslogtreecommitdiffstats
path: root/Documentation/mm
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-18 17:35:05 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-18 17:39:31 +0000
commit85c675d0d09a45a135bddd15d7b385f8758c32fb (patch)
tree76267dbc9b9a130337be3640948fe397b04ac629 /Documentation/mm
parentAdding upstream version 6.6.15. (diff)
downloadlinux-85c675d0d09a45a135bddd15d7b385f8758c32fb.tar.xz
linux-85c675d0d09a45a135bddd15d7b385f8758c32fb.zip
Adding upstream version 6.7.7.upstream/6.7.7
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r--Documentation/mm/damon/design.rst26
-rw-r--r--Documentation/mm/overcommit-accounting.rst3
-rw-r--r--Documentation/mm/page_tables.rst127
-rw-r--r--Documentation/mm/vmemmap_dedup.rst2
4 files changed, 151 insertions, 7 deletions
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index a20383d01a..1f7e0586b5 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -154,6 +154,8 @@ The monitoring overhead of this mechanism will arbitrarily increase as the
size of the target workload grows.
+.. _damon_design_region_based_sampling:
+
Region Based Sampling
~~~~~~~~~~~~~~~~~~~~~
@@ -163,9 +165,10 @@ assumption (pages in a region have the same access frequencies) is kept, only
one page in the region is required to be checked. Thus, for each ``sampling
interval``, DAMON randomly picks one page in each region, waits for one
``sampling interval``, checks whether the page is accessed meanwhile, and
-increases the access frequency of the region if so. Therefore, the monitoring
-overhead is controllable by setting the number of regions. DAMON allows users
-to set the minimum and the maximum number of regions for the trade-off.
+increases the access frequency counter of the region if so. The counter is
+called ``nr_regions`` of the region. Therefore, the monitoring overhead is
+controllable by setting the number of regions. DAMON allows users to set the
+minimum and the maximum number of regions for the trade-off.
This scheme, however, cannot preserve the quality of the output if the
assumption is not guaranteed.
@@ -190,6 +193,8 @@ In this way, DAMON provides its best-effort quality and minimal overhead while
keeping the bounds users set for their trade-off.
+.. _damon_design_age_tracking:
+
Age Tracking
~~~~~~~~~~~~
@@ -254,7 +259,8 @@ works, DAMON provides a feature called Data Access Monitoring-based Operation
Schemes (DAMOS). It lets users specify their desired schemes at a high
level. For such specifications, DAMON starts monitoring, finds regions having
the access pattern of interest, and applies the user-desired operation actions
-to the regions as soon as found.
+to the regions, for every user-specified time interval called
+``apply_interval``.
.. _damon_design_damos_action:
@@ -471,3 +477,15 @@ modules for proactive reclamation and LRU lists manipulation are provided. For
more detail, please read the usage documents for those
(:doc:`/admin-guide/mm/damon/reclaim` and
:doc:`/admin-guide/mm/damon/lru_sort`).
+
+
+.. _damon_design_execution_model_and_data_structures:
+
+Execution Model and Data Structures
+===================================
+
+The monitoring-related information including the monitoring request
+specification and DAMON-based operation schemes are stored in a data structure
+called DAMON ``context``. DAMON executes each context with a kernel thread
+called ``kdamond``. Multiple kdamonds could run in parallel, for different
+types of monitoring.
diff --git a/Documentation/mm/overcommit-accounting.rst b/Documentation/mm/overcommit-accounting.rst
index a4895d6fc1..e2263477f6 100644
--- a/Documentation/mm/overcommit-accounting.rst
+++ b/Documentation/mm/overcommit-accounting.rst
@@ -8,8 +8,7 @@ The Linux kernel supports the following overcommit handling modes
Heuristic overcommit handling. Obvious overcommits of address
space are refused. Used for a typical system. It ensures a
seriously wild allocation fails while allowing overcommit to
- reduce swap usage. root is allowed to allocate slightly more
- memory in this mode. This is the default.
+ reduce swap usage. This is the default.
1
Always overcommit. Appropriate for some scientific
diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst
index 7840c18917..be47b192a5 100644
--- a/Documentation/mm/page_tables.rst
+++ b/Documentation/mm/page_tables.rst
@@ -152,3 +152,130 @@ Page table handling code that wishes to be architecture-neutral, such as the
virtual memory manager, will need to be written so that it traverses all of the
currently five levels. This style should also be preferred for
architecture-specific code, so as to be robust to future changes.
+
+
+MMU, TLB, and Page Faults
+=========================
+
+The `Memory Management Unit (MMU)` is a hardware component that handles virtual
+to physical address translations. It may use relatively small caches in hardware
+called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up
+these translations.
+
+When CPU accesses a memory location, it provides a virtual address to the MMU,
+which checks if there is the existing translation in the TLB or in the Page
+Walk Caches (on architectures that support them). If no translation is found,
+MMU uses the page walks to determine the physical address and create the map.
+
+The dirty bit for a page is set (i.e., turned on) when the page is written to.
+Each page of memory has associated permission and dirty bits. The latter
+indicate that the page has been modified since it was loaded into memory.
+
+If nothing prevents it, eventually the physical memory can be accessed and the
+requested operation on the physical frame is performed.
+
+There are several reasons why the MMU can't find certain translations. It could
+happen because the CPU is trying to access memory that the current task is not
+permitted to, or because the data is not present into physical memory.
+
+When these conditions happen, the MMU triggers page faults, which are types of
+exceptions that signal the CPU to pause the current execution and run a special
+function to handle the mentioned exceptions.
+
+There are common and expected causes of page faults. These are triggered by
+process management optimization techniques called "Lazy Allocation" and
+"Copy-on-Write". Page faults may also happen when frames have been swapped out
+to persistent storage (swap partition or file) and evicted from their physical
+locations.
+
+These techniques improve memory efficiency, reduce latency, and minimize space
+occupation. This document won't go deeper into the details of "Lazy Allocation"
+and "Copy-on-Write" because these subjects are out of scope as they belong to
+Process Address Management.
+
+Swapping differentiates itself from the other mentioned techniques because it's
+undesirable since it's performed as a means to reduce memory under heavy
+pressure.
+
+Swapping can't work for memory mapped by kernel logical addresses. These are a
+subset of the kernel virtual space that directly maps a contiguous range of
+physical memory. Given any logical address, its physical address is determined
+with simple arithmetic on an offset. Accesses to logical addresses are fast
+because they avoid the need for complex page table lookups at the expenses of
+frames not being evictable and pageable out.
+
+If the kernel fails to make room for the data that must be present in the
+physical frames, the kernel invokes the out-of-memory (OOM) killer to make room
+by terminating lower priority processes until pressure reduces under a safe
+threshold.
+
+Additionally, page faults may be also caused by code bugs or by maliciously
+crafted addresses that the CPU is instructed to access. A thread of a process
+could use instructions to address (non-shared) memory which does not belong to
+its own address space, or could try to execute an instruction that want to write
+to a read-only location.
+
+If the above-mentioned conditions happen in user-space, the kernel sends a
+`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually
+causes the termination of the thread and of the process it belongs to.
+
+This document is going to simplify and show an high altitude view of how the
+Linux kernel handles these page faults, creates tables and tables' entries,
+check if memory is present and, if not, requests to load data from persistent
+storage or from other devices, and updates the MMU and its caches.
+
+The first steps are architecture dependent. Most architectures jump to
+`do_page_fault()`, whereas the x86 interrupt handler is defined by the
+`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`.
+
+Whatever the routes, all architectures end up to the invocation of
+`handle_mm_fault()` which, in turn, (likely) ends up calling
+`__handle_mm_fault()` to carry out the actual work of allocating the page
+tables.
+
+The unfortunate case of not being able to call `__handle_mm_fault()` means
+that the virtual address is pointing to areas of physical memory which are not
+permitted to be accessed (at least from the current context). This
+condition resolves to the kernel sending the above-mentioned SIGSEGV signal
+to the process and leads to the consequences already explained.
+
+`__handle_mm_fault()` carries out its work by calling several functions to
+find the entry's offsets of the upper layers of the page tables and allocate
+the tables that it may need.
+
+The functions that look for the offset have names like `*_offset()`, where the
+"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the
+corresponding tables, layer by layer, are called `*_alloc`, using the
+above-mentioned convention to name them after the corresponding types of tables
+in the hierarchy.
+
+The page table walk may end at one of the middle or upper layers (PMD, PUD).
+
+Linux supports larger page sizes than the usual 4KB (i.e., the so called
+`huge pages`). When using these kinds of larger pages, higher level pages can
+directly map them, with no need to use lower level page entries (PTE). Huge
+pages contain large contiguous physical regions that usually span from 2MB to
+1GB. They are respectively mapped by the PMD and PUD page entries.
+
+The huge pages bring with them several benefits like reduced TLB pressure,
+reduced page table overhead, memory allocation efficiency, and performance
+improvement for certain workloads. However, these benefits come with
+trade-offs, like wasted memory and allocation challenges.
+
+At the very end of the walk with allocations, if it didn't return errors,
+`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()`
+performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`.
+"read", "cow", "shared" give hints about the reasons and the kind of fault it's
+handling.
+
+The actual implementation of the workflow is very complex. Its design allows
+Linux to handle page faults in a way that is tailored to the specific
+characteristics of each architecture, while still sharing a common overall
+structure.
+
+To conclude this high altitude view of how Linux handles page faults, let's
+add that the page faults handler can be disabled and enabled respectively with
+`pagefault_disable()` and `pagefault_enable()`.
+
+Several code path make use of the latter two functions because they need to
+disable traps into the page faults handler, mostly to prevent deadlocks.
diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index 59891f7242..593ede6d31 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -211,7 +211,7 @@ the device (altmap).
The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
-For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
+For powerpc equivalent details see Documentation/arch/powerpc/vmemmap_dedup.rst
The differences with HugeTLB are relatively minor.