diff options
Diffstat (limited to 'src/backend/utils/mmgr/README')
-rw-r--r-- | src/backend/utils/mmgr/README | 487 |
1 files changed, 487 insertions, 0 deletions
diff --git a/src/backend/utils/mmgr/README b/src/backend/utils/mmgr/README new file mode 100644 index 0000000..221b4bd --- /dev/null +++ b/src/backend/utils/mmgr/README @@ -0,0 +1,487 @@ +src/backend/utils/mmgr/README + +Memory Context System Design Overview +===================================== + +Background +---------- + +We do most of our memory allocation in "memory contexts", which are usually +AllocSets as implemented by src/backend/utils/mmgr/aset.c. The key to +successful memory management without lots of overhead is to define a useful +set of contexts with appropriate lifespans. + +The basic operations on a memory context are: + +* create a context + +* allocate a chunk of memory within a context (equivalent of standard + C library's malloc()) + +* delete a context (including freeing all the memory allocated therein) + +* reset a context (free all memory allocated in the context, but not the + context object itself) + +* inquire about the total amount of memory allocated to the context + (the raw memory from which the context allocates chunks; not the + chunks themselves) + +Given a chunk of memory previously allocated from a context, one can +free it or reallocate it larger or smaller (corresponding to standard C +library's free() and realloc() routines). These operations return memory +to or get more memory from the same context the chunk was originally +allocated in. + +At all times there is a "current" context denoted by the +CurrentMemoryContext global variable. palloc() implicitly allocates space +in that context. The MemoryContextSwitchTo() operation selects a new current +context (and returns the previous context, so that the caller can restore the +previous context before exiting). + +The main advantage of memory contexts over plain use of malloc/free is +that the entire contents of a memory context can be freed easily, without +having to request freeing of each individual chunk within it. This is +both faster and more reliable than per-chunk bookkeeping. We use this +fact to clean up at transaction end: by resetting all the active contexts +of transaction or shorter lifespan, we can reclaim all transient memory. +Similarly, we can clean up at the end of each query, or after each tuple +is processed during a query. + + +Some Notes About the palloc API Versus Standard C Library +--------------------------------------------------------- + +The behavior of palloc and friends is similar to the standard C library's +malloc and friends, but there are some deliberate differences too. Here +are some notes to clarify the behavior. + +* If out of memory, palloc and repalloc exit via elog(ERROR). They +never return NULL, and it is not necessary or useful to test for such +a result. With palloc_extended() that behavior can be overridden +using the MCXT_ALLOC_NO_OOM flag. + +* palloc(0) is explicitly a valid operation. It does not return a NULL +pointer, but a valid chunk of which no bytes may be used. However, the +chunk might later be repalloc'd larger; it can also be pfree'd without +error. Similarly, repalloc allows realloc'ing to zero size. + +* pfree and repalloc do not accept a NULL pointer. This is intentional. + + +The Current Memory Context +-------------------------- + +Because it would be too much notational overhead to always pass an +appropriate memory context to called routines, there always exists the +notion of the current memory context CurrentMemoryContext. Without it, +for example, the copyObject routines would need to be passed a context, as +would function execution routines that return a pass-by-reference +datatype. Similarly for routines that temporarily allocate space +internally, but don't return it to their caller? We certainly don't +want to clutter every call in the system with "here is a context to +use for any temporary memory allocation you might want to do". + +The upshot of that reasoning, though, is that CurrentMemoryContext should +generally point at a short-lifespan context if at all possible. During +query execution it usually points to a context that gets reset after each +tuple. Only in *very* circumscribed code should it ever point at a +context having greater than transaction lifespan, since doing so risks +permanent memory leaks. + + +pfree/repalloc Do Not Depend On CurrentMemoryContext +---------------------------------------------------- + +pfree() and repalloc() can be applied to any chunk whether it belongs +to CurrentMemoryContext or not --- the chunk's owning context will be +invoked to handle the operation, regardless. + + +"Parent" and "Child" Contexts +----------------------------- + +If all contexts were independent, it'd be hard to keep track of them, +especially in error cases. That is solved by creating a tree of +"parent" and "child" contexts. When creating a memory context, the +new context can be specified to be a child of some existing context. +A context can have many children, but only one parent. In this way +the contexts form a forest (not necessarily a single tree, since there +could be more than one top-level context; although in current practice +there is only one top context, TopMemoryContext). + +Deleting a context deletes all its direct and indirect children as +well. When resetting a context it's almost always more useful to +delete child contexts, thus MemoryContextReset() means that, and if +you really do want a tree of empty contexts you need to call +MemoryContextResetOnly() plus MemoryContextResetChildren(). + +These features allow us to manage a lot of contexts without fear that +some will be leaked; we only need to keep track of one top-level +context that we are going to delete at transaction end, and make sure +that any shorter-lived contexts we create are descendants of that +context. Since the tree can have multiple levels, we can deal easily +with nested lifetimes of storage, such as per-transaction, +per-statement, per-scan, per-tuple. Storage lifetimes that only +partially overlap can be handled by allocating from different trees of +the context forest (there are some examples in the next section). + +For convenience we also provide operations like "reset/delete all children +of a given context, but don't reset or delete that context itself". + + +Memory Context Reset/Delete Callbacks +------------------------------------- + +A feature introduced in Postgres 9.5 allows memory contexts to be used +for managing more resources than just plain palloc'd memory. This is +done by registering a "reset callback function" for a memory context. +Such a function will be called, once, just before the context is next +reset or deleted. It can be used to give up resources that are in some +sense associated with an object allocated within the context. Possible +use-cases include +* closing open files associated with a tuplesort object; +* releasing reference counts on long-lived cache objects that are held + by some object within the context being reset; +* freeing malloc-managed memory associated with some palloc'd object. +That last case would just represent bad programming practice for pure +Postgres code; better to have made all the allocations using palloc, +in the target context or some child context. However, it could well +come in handy for code that interfaces to non-Postgres libraries. + +Any number of reset callbacks can be established for a memory context; +they are called in reverse order of registration. Also, callbacks +attached to child contexts are called before callbacks attached to +parent contexts, if a tree of contexts is being reset or deleted. + +The API for this requires the caller to provide a MemoryContextCallback +memory chunk to hold the state for a callback. Typically this should be +allocated in the same context it is logically attached to, so that it +will be released automatically after use. The reason for asking the +caller to provide this memory is that in most usage scenarios, the caller +will be creating some larger struct within the target context, and the +MemoryContextCallback struct can be made "for free" without a separate +palloc() call by including it in this larger struct. + + +Memory Contexts in Practice +=========================== + +Globally Known Contexts +----------------------- + +There are a few widely-known contexts that are typically referenced +through global variables. At any instant the system may contain many +additional contexts, but all other contexts should be direct or indirect +children of one of these contexts to ensure they are not leaked in event +of an error. + +TopMemoryContext --- this is the actual top level of the context tree; +every other context is a direct or indirect child of this one. Allocating +here is essentially the same as "malloc", because this context will never +be reset or deleted. This is for stuff that should live forever, or for +stuff that the controlling module will take care of deleting at the +appropriate time. An example is fd.c's tables of open files. Avoid +allocating stuff here unless really necessary, and especially avoid +running with CurrentMemoryContext pointing here. + +PostmasterContext --- this is the postmaster's normal working context. +After a backend is spawned, it can delete PostmasterContext to free its +copy of memory the postmaster was using that it doesn't need. +Note that in non-EXEC_BACKEND builds, the postmaster's copy of pg_hba.conf +and pg_ident.conf data is used directly during authentication in backend +processes; so backends can't delete PostmasterContext until that's done. +(The postmaster has only TopMemoryContext, PostmasterContext, and +ErrorContext --- the remaining top-level contexts are set up in each +backend during startup.) + +CacheMemoryContext --- permanent storage for relcache, catcache, and +related modules. This will never be reset or deleted, either, so it's +not truly necessary to distinguish it from TopMemoryContext. But it +seems worthwhile to maintain the distinction for debugging purposes. +(Note: CacheMemoryContext has child contexts with shorter lifespans. +For example, a child context is the best place to keep the subsidiary +storage associated with a relcache entry; that way we can free rule +parsetrees and so forth easily, without having to depend on constructing +a reliable version of freeObject().) + +MessageContext --- this context holds the current command message from the +frontend, as well as any derived storage that need only live as long as +the current message (for example, in simple-Query mode the parse and plan +trees can live here). This context will be reset, and any children +deleted, at the top of each cycle of the outer loop of PostgresMain. This +is kept separate from per-transaction and per-portal contexts because a +query string might need to live either a longer or shorter time than any +single transaction or portal. + +TopTransactionContext --- this holds everything that lives until end of the +top-level transaction. This context will be reset, and all its children +deleted, at conclusion of each top-level transaction cycle. In most cases +you don't want to allocate stuff directly here, but in CurTransactionContext; +what does belong here is control information that exists explicitly to manage +status across multiple subtransactions. Note: this context is NOT cleared +immediately upon error; its contents will survive until the transaction block +is exited by COMMIT/ROLLBACK. + +CurTransactionContext --- this holds data that has to survive until the end +of the current transaction, and in particular will be needed at top-level +transaction commit. When we are in a top-level transaction this is the same +as TopTransactionContext, but in subtransactions it points to a child context. +It is important to understand that if a subtransaction aborts, its +CurTransactionContext is thrown away after finishing the abort processing; +but a committed subtransaction's CurTransactionContext is kept until top-level +commit (unless of course one of the intermediate levels of subtransaction +aborts). This ensures that we do not keep data from a failed subtransaction +longer than necessary. Because of this behavior, you must be careful to clean +up properly during subtransaction abort --- the subtransaction's state must be +delinked from any pointers or lists kept in upper transactions, or you will +have dangling pointers leading to a crash at top-level commit. An example of +data kept here is pending NOTIFY messages, which are sent at top-level commit, +but only if the generating subtransaction did not abort. + +PortalContext --- this is not actually a separate context, but a +global variable pointing to the per-portal context of the currently active +execution portal. This can be used if it's necessary to allocate storage +that will live just as long as the execution of the current portal requires. + +ErrorContext --- this permanent context is switched into for error +recovery processing, and then reset on completion of recovery. We arrange +to have a few KB of memory available in it at all times. In this way, we +can ensure that some memory is available for error recovery even if the +backend has run out of memory otherwise. This allows out-of-memory to be +treated as a normal ERROR condition, not a FATAL error. + + +Contexts For Prepared Statements And Portals +-------------------------------------------- + +A prepared-statement object has an associated private context, in which +the parse and plan trees for its query are stored. Because these trees +are read-only to the executor, the prepared statement can be re-used many +times without further copying of these trees. + +An execution-portal object has a private context that is referenced by +PortalContext when the portal is active. In the case of a portal created +by DECLARE CURSOR, this private context contains the query parse and plan +trees (there being no other object that can hold them). Portals created +from prepared statements simply reference the prepared statements' trees, +and don't actually need any storage allocated in their private contexts. + + +Logical Replication Worker Contexts +----------------------------------- + +ApplyContext --- permanent during whole lifetime of apply worker. It +is possible to use TopMemoryContext here as well, but for simplicity +of memory usage analysis we spin up different context. + +ApplyMessageContext --- short-lived context that is reset after each +logical replication protocol message is processed. + + +Transient Contexts During Execution +----------------------------------- + +When creating a prepared statement, the parse and plan trees will be built +in a temporary context that's a child of MessageContext (so that it will +go away automatically upon error). On success, the finished plan is +copied to the prepared statement's private context, and the temp context +is released; this allows planner temporary space to be recovered before +execution begins. (In simple-Query mode we don't bother with the extra +copy step, so the planner temp space stays around till end of query.) + +The top-level executor routines, as well as most of the "plan node" +execution code, will normally run in a context that is created by +ExecutorStart and destroyed by ExecutorEnd; this context also holds the +"plan state" tree built during ExecutorStart. Most of the memory +allocated in these routines is intended to live until end of query, +so this is appropriate for those purposes. The executor's top context +is a child of PortalContext, that is, the per-portal context of the +portal that represents the query's execution. + +The main memory-management consideration in the executor is that +expression evaluation --- both for qual testing and for computation of +targetlist entries --- needs to not leak memory. To do this, each +ExprContext (expression-eval context) created in the executor has a +private memory context associated with it, and we switch into that context +when evaluating expressions in that ExprContext. The plan node that owns +the ExprContext is responsible for resetting the private context to empty +when it no longer needs the results of expression evaluations. Typically +the reset is done at the start of each tuple-fetch cycle in the plan node. + +Note that this design gives each plan node its own expression-eval memory +context. This appears necessary to handle nested joins properly, since +an outer plan node might need to retain expression results it has computed +while obtaining the next tuple from an inner node --- but the inner node +might execute many tuple cycles and many expressions before returning a +tuple. The inner node must be able to reset its own expression context +more often than once per outer tuple cycle. Fortunately, memory contexts +are cheap enough that giving one to each plan node doesn't seem like a +problem. + +A problem with running index accesses and sorts in a query-lifespan context +is that these operations invoke datatype-specific comparison functions, +and if the comparators leak any memory then that memory won't be recovered +till end of query. The comparator functions all return bool or int32, +so there's no problem with their result data, but there can be a problem +with leakage of internal temporary data. In particular, comparator +functions that operate on TOAST-able data types need to be careful +not to leak detoasted versions of their inputs. This is annoying, but +it appeared a lot easier to make the comparators conform than to fix the +index and sort routines, so that's what was done for 7.1. This remains +the state of affairs in btree and hash indexes, so btree and hash support +functions still need to not leak memory. Most of the other index AMs +have been modified to run opclass support functions in short-lived +contexts, so that leakage is not a problem; this is necessary in view +of the fact that their support functions tend to be far more complex. + +There are some special cases, such as aggregate functions. nodeAgg.c +needs to remember the results of evaluation of aggregate transition +functions from one tuple cycle to the next, so it can't just discard +all per-tuple state in each cycle. The easiest way to handle this seems +to be to have two per-tuple contexts in an aggregate node, and to +ping-pong between them, so that at each tuple one is the active allocation +context and the other holds any results allocated by the prior cycle's +transition function. + +Executor routines that switch the active CurrentMemoryContext may need +to copy data into their caller's current memory context before returning. +However, we have minimized the need for that, because of the convention +of resetting the per-tuple context at the *start* of an execution cycle +rather than at its end. With that rule, an execution node can return a +tuple that is palloc'd in its per-tuple context, and the tuple will remain +good until the node is called for another tuple or told to end execution. +This parallels the situation with pass-by-reference values at the table +scan level, since a scan node can return a direct pointer to a tuple in a +disk buffer that is only guaranteed to remain good that long. + +A more common reason for copying data is to transfer a result from +per-tuple context to per-query context; for example, a Unique node will +save the last distinct tuple value in its per-query context, requiring a +copy step. + + +Mechanisms to Allow Multiple Types of Contexts +---------------------------------------------- + +To efficiently allow for different allocation patterns, and for +experimentation, we allow for different types of memory contexts with +different allocation policies but similar external behavior. To +handle this, memory allocation functions are accessed via function +pointers, and we require all context types to obey the conventions +given here. + +A memory context is represented by struct MemoryContextData (see +memnodes.h). This struct identifies the exact type of the context, and +contains information common between the different types of +MemoryContext like the parent and child contexts, and the name of the +context. + +This is essentially an abstract superclass, and the behavior is +determined by the "methods" pointer is its virtual function table +(struct MemoryContextMethods). Specific memory context types will use +derived structs having these fields as their first fields. All the +contexts of a specific type will have methods pointers that point to +the same static table of function pointers. + +While operations like allocating from and resetting a context take the +relevant MemoryContext as a parameter, operations like free and +realloc are trickier. To make those work, we require all memory +context types to produce allocated chunks that are immediately, +without any padding, preceded by a pointer to the corresponding +MemoryContext. + +If a type of allocator needs additional information about its chunks, +like e.g. the size of the allocation, that information can in turn +precede the MemoryContext. This means the only overhead implied by +the memory context mechanism is a pointer to its context, so we're not +constraining context-type designers very much. + +Given this, routines like pfree determine their corresponding context +with an operation like (although that is usually encapsulated in +GetMemoryChunkContext()) + + MemoryContext context = *(MemoryContext*) (((char *) pointer) - sizeof(void *)); + +and then invoke the corresponding method for the context + + context->methods->free_p(pointer); + + +More Control Over aset.c Behavior +--------------------------------- + +By default aset.c always allocates an 8K block upon the first +allocation in a context, and doubles that size for each successive +block request. That's good behavior for a context that might hold +*lots* of data. But if there are dozens if not hundreds of smaller +contexts in the system, we need to be able to fine-tune things a +little better. + +The creator of a context is able to specify an initial block size and +a maximum block size. Selecting smaller values can prevent wastage of +space in contexts that aren't expected to hold very much (an example +is the relcache's per-relation contexts). + +Also, it is possible to specify a minimum context size, in case for some +reason that should be different from the initial size for additional +blocks. An aset.c context will always contain at least one block, +of size minContextSize if that is specified, otherwise initBlockSize. + +We expect that per-tuple contexts will be reset frequently and typically +will not allocate very much space per tuple cycle. To make this usage +pattern cheap, the first block allocated in a context is not given +back to malloc() during reset, but just cleared. This avoids malloc +thrashing. + + +Alternative Memory Context Implementations +------------------------------------------ + +aset.c is our default general-purpose implementation, working fine +in most situations. We also have two implementations optimized for +special use cases, providing either better performance or lower memory +usage compared to aset.c (or both). + +* slab.c (SlabContext) is designed for allocations of fixed-length + chunks, and does not allow allocations of chunks with different size. + +* generation.c (GenerationContext) is designed for cases when chunks + are allocated in groups with similar lifespan (generations), or + roughly in FIFO order. + +Both memory contexts aim to free memory back to the operating system +(unlike aset.c, which keeps the freed chunks in a freelist, and only +returns the memory when reset/deleted). + +These memory contexts were initially developed for ReorderBuffer, but +may be useful elsewhere as long as the allocation patterns match. + + +Memory Accounting +----------------- + +One of the basic memory context operations is determining the amount of +memory used in the context (and its children). We have multiple places +that implement their own ad hoc memory accounting, and this is meant to +provide a unified approach. Ad hoc accounting solutions work for places +with tight control over the allocations or when it's easy to determine +sizes of allocated chunks (e.g. places that only work with tuples). + +The accounting built into the memory contexts is transparent and works +transparently for all allocations as long as they end up in the right +memory context subtree. + +Consider for example aggregate functions - the aggregate state is often +represented by an arbitrary structure, allocated from the transition +function, so the ad hoc accounting is unlikely to work. The built-in +accounting will however handle such cases just fine. + +To minimize overhead, the accounting is done at the block level, not for +individual allocation chunks. + +The accounting is lazy - after a block is allocated (or freed), only the +context owning that block is updated. This means that when inquiring +about the memory usage in a given context, we have to walk all children +contexts recursively. This means the memory accounting is not intended +for cases with too many memory contexts (in the relevant subtree). |