From 293913568e6a7a86fd1479e1cff8e2ecb58d6568 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sat, 13 Apr 2024 15:44:03 +0200 Subject: Adding upstream version 16.2. Signed-off-by: Daniel Baumann --- src/backend/jit/README | 295 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 295 insertions(+) create mode 100644 src/backend/jit/README (limited to 'src/backend/jit/README') diff --git a/src/backend/jit/README b/src/backend/jit/README new file mode 100644 index 0000000..5427bdf --- /dev/null +++ b/src/backend/jit/README @@ -0,0 +1,295 @@ +What is Just-in-Time Compilation? +================================= + +Just-in-Time compilation (JIT) is the process of turning some form of +interpreted program evaluation into a native program, and doing so at +runtime. + +For example, instead of using a facility that can evaluate arbitrary +SQL expressions to evaluate an SQL predicate like WHERE a.col = 3, it +is possible to generate a function than can be natively executed by +the CPU that just handles that expression, yielding a speedup. + +This is JIT, rather than ahead-of-time (AOT) compilation, because it +is done at query execution time, and perhaps only in cases where the +relevant task is repeated a number of times. Given the way JIT +compilation is used in PostgreSQL, the lines between interpretation, +AOT and JIT are somewhat blurry. + +Note that the interpreted program turned into a native program does +not necessarily have to be a program in the classical sense. E.g. it +is highly beneficial to JIT compile tuple deforming into a native +function just handling a specific type of table, despite tuple +deforming not commonly being understood as a "program". + + +Why JIT? +======== + +Parts of PostgreSQL are commonly bottlenecked by comparatively small +pieces of CPU intensive code. In a number of cases that is because the +relevant code has to be very generic (e.g. handling arbitrary SQL +level expressions, over arbitrary tables, with arbitrary extensions +installed). This often leads to a large number of indirect jumps and +unpredictable branches, and generally a high number of instructions +for a given task. E.g. just evaluating an expression comparing a +column in a database to an integer ends up needing several hundred +cycles. + +By generating native code large numbers of indirect jumps can be +removed by either making them into direct branches (e.g. replacing the +indirect call to an SQL operator's implementation with a direct call +to that function), or by removing it entirely (e.g. by evaluating the +branch at compile time because the input is constant). Similarly a lot +of branches can be entirely removed (e.g. by again evaluating the +branch at compile time because the input is constant). The latter is +particularly beneficial for removing branches during tuple deforming. + + +How to JIT +========== + +PostgreSQL, by default, uses LLVM to perform JIT. LLVM was chosen +because it is developed by several large corporations and therefore +unlikely to be discontinued, because it has a license compatible with +PostgreSQL, and because its IR can be generated from C using the Clang +compiler. + + +Shared Library Separation +------------------------- + +To avoid the main PostgreSQL binary directly depending on LLVM, which +would prevent LLVM support being independently installed by OS package +managers, the LLVM dependent code is located in a shared library that +is loaded on-demand. + +An additional benefit of doing so is that it is relatively easy to +evaluate JIT compilation that does not use LLVM, by changing out the +shared library used to provide JIT compilation. + +To achieve this, code intending to perform JIT (e.g. expression evaluation) +calls an LLVM independent wrapper located in jit.c to do so. If the +shared library providing JIT support can be loaded (i.e. PostgreSQL was +compiled with LLVM support and the shared library is installed), the task +of JIT compiling an expression gets handed off to the shared library. This +obviously requires that the function in jit.c is allowed to fail in case +no JIT provider can be loaded. + +Which shared library is loaded is determined by the jit_provider GUC, +defaulting to "llvmjit". + +Cloistering code performing JIT into a shared library unfortunately +also means that code doing JIT compilation for various parts of code +has to be located separately from the code doing so without +JIT. E.g. the JIT version of execExprInterp.c is located in jit/llvm/ +rather than executor/. + + +JIT Context +----------- + +For performance and convenience reasons it is useful to allow JITed +functions to be emitted and deallocated together. It is e.g. very +common to create a number of functions at query initialization time, +use them during query execution, and then deallocate all of them +together at the end of the query. + +Lifetimes of JITed functions are managed via JITContext. Exactly one +such context should be created for work in which all created JITed +function should have the same lifetime. E.g. there's exactly one +JITContext for each query executed, in the query's EState. Only the +release of a JITContext is exposed to the provider independent +facility, as the creation of one is done on-demand by the JIT +implementations. + +Emitting individual functions separately is more expensive than +emitting several functions at once, and emitting them together can +provide additional optimization opportunities. To facilitate that, the +LLVM provider separates defining functions from optimizing and +emitting functions in an executable manner. + +Creating functions into the current mutable module (a module +essentially is LLVM's equivalent of a translation unit in C) is done +using + extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context); +in which it then can emit as much code using the LLVM APIs as it +wants. Whenever a function actually needs to be called + extern void *llvm_get_function(LLVMJitContext *context, const char *funcname); +returns a pointer to it. + +E.g. in the expression evaluation case this setup allows most +functions in a query to be emitted during ExecInitNode(), delaying the +function emission to the time the first time a function is actually +used. + + +Error Handling +-------------- + +There are two aspects of error handling. Firstly, generated (LLVM IR) +and emitted functions (mmap()ed segments) need to be cleaned up both +after a successful query execution and after an error. This is done by +registering each created JITContext with the current resource owner, +and cleaning it up on error / end of transaction. If it is desirable +to release resources earlier, jit_release_context() can be used. + +The second, less pretty, aspect of error handling is OOM handling +inside LLVM itself. The above resowner based mechanism takes care of +cleaning up emitted code upon ERROR, but there's also the chance that +LLVM itself runs out of memory. LLVM by default does *not* use any C++ +exceptions. Its allocations are primarily funneled through the +standard "new" handlers, and some direct use of malloc() and +mmap(). For the former a 'new handler' exists: +http://en.cppreference.com/w/cpp/memory/new/set_new_handler +For the latter LLVM provides callbacks that get called upon failure +(unfortunately mmap() failures are treated as fatal rather than OOM errors). +What we've chosen to do for now is have two functions that LLVM using code +must use: +extern void llvm_enter_fatal_on_oom(void); +extern void llvm_leave_fatal_on_oom(void); +before interacting with LLVM code. + +When a libstdc++ new or LLVM error occurs, the handlers set up by the +above functions trigger a FATAL error. We have to use FATAL rather +than ERROR, as we *cannot* reliably throw ERROR inside a foreign +library without risking corrupting its internal state. + +Users of the above sections do *not* have to use PG_TRY/CATCH blocks, +the handlers instead are reset on toplevel sigsetjmp() level. + +Using a relatively small enter/leave protected section of code, rather +than setting up these handlers globally, avoids negative interactions +with extensions that might use C++ such as PostGIS. As LLVM code +generation should never execute arbitrary code, just setting these +handlers temporarily ought to suffice. + + +Type Synchronization +-------------------- + +To be able to generate code that can perform tasks done by "interpreted" +PostgreSQL, it obviously is required that code generation knows about at +least a few PostgreSQL types. While it is possible to inform LLVM about +type definitions by recreating them manually in C code, that is failure +prone and labor intensive. + +Instead there is one small file (llvmjit_types.c) which references each of +the types required for JITing. That file is translated to bitcode at +compile time, and loaded when LLVM is initialized in a backend. + +That works very well to synchronize the type definition, but unfortunately +it does *not* synchronize offsets as the IR level representation doesn't +know field names. Instead, required offsets are maintained as defines in +the original struct definition, like so: +#define FIELDNO_TUPLETABLESLOT_NVALID 9 + int tts_nvalid; /* # of valid values in tts_values */ +While that still needs to be defined, it's only required for a +relatively small number of fields, and it's bunched together with the +struct definition, so it's easily kept synchronized. + + +Inlining +-------- + +One big advantage of JITing expressions is that it can significantly +reduce the overhead of PostgreSQL's extensible function/operator +mechanism, by inlining the body of called functions/operators. + +It obviously is undesirable to maintain a second implementation of +commonly used functions, just for inlining purposes. Instead we take +advantage of the fact that the Clang compiler can emit LLVM IR. + +The ability to do so allows us to get the LLVM IR for all operators +(e.g. int8eq, float8pl etc), without maintaining two copies. These +bitcode files get installed into the server's + $pkglibdir/bitcode/postgres/ +Using existing LLVM functionality (for parallel LTO compilation), +additionally an index is over these is stored to +$pkglibdir/bitcode/postgres.index.bc + +Similarly extensions can install code into + $pkglibdir/bitcode/[extension]/ +accompanied by + $pkglibdir/bitcode/[extension].index.bc + +just alongside the actual library. An extension's index will be used +to look up symbols when located in the corresponding shared +library. Symbols that are used inside the extension, when inlined, +will be first looked up in the main binary and then the extension's. + + +Caching +------- + +Currently it is not yet possible to cache generated functions, even +though that'd be desirable from a performance point of view. The +problem is that the generated functions commonly contain pointers into +per-execution memory. The expression evaluation machinery needs to +be redesigned a bit to avoid that. Basically all per-execution memory +needs to be referenced as an offset to one block of memory stored in +an ExprState, rather than absolute pointers into memory. + +Once that is addressed, adding an LRU cache that's keyed by the +generated LLVM IR will allow the usage of optimized functions even for +faster queries. + +A longer term project is to move expression compilation to the planner +stage, allowing e.g. to tie compiled expressions to prepared +statements. + +An even more advanced approach would be to use JIT with few +optimizations initially, and build an optimized version in the +background. But that's even further off. + + +What to JIT +=========== + +Currently expression evaluation and tuple deforming are JITed. Those +were chosen because they commonly are major CPU bottlenecks in +analytics queries, but are by no means the only potentially beneficial cases. + +For JITing to be beneficial a piece of code first and foremost has to +be a CPU bottleneck. But also importantly, JITing can only be +beneficial if overhead can be removed by doing so. E.g. in the tuple +deforming case the knowledge about the number of columns and their +types can remove a significant number of branches, and in the +expression evaluation case a lot of indirect jumps/calls can be +removed. If neither of these is the case, JITing is a waste of +resources. + +Future avenues for JITing are tuple sorting, COPY parsing/output +generation, and later compiling larger parts of queries. + + +When to JIT +=========== + +Currently there are a number of GUCs that influence JITing: + +- jit_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost + get JITed, *without* optimization (expensive part), corresponding to + -O0. This commonly already results in significant speedups if + expression/deforming is a bottleneck (removing dynamic branches + mostly). +- jit_optimize_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost + get JITed, *with* optimization (expensive part). +- jit_inline_above_cost = -1, 0-DBL_MAX - inlining is tried if query has + higher cost. + +Whenever a query's total cost is above these limits, JITing is +performed. + +Alternative costing models, e.g. by generating separate paths for +parts of a query with lower cpu_* costs, are also a possibility, but +it's doubtful the overhead of doing so is sufficient. Another +alternative would be to count the number of times individual +expressions are estimated to be evaluated, and perform JITing of these +individual expressions. + +The obvious seeming approach of JITing expressions individually after +a number of execution turns out not to work too well. Primarily +because emitting many small functions individually has significant +overhead. Secondarily because the time until JITing occurs causes +relative slowdowns that eat into the gain of JIT compilation. -- cgit v1.2.3