diff options
Diffstat (limited to 'js/src/vm/PortableBaselineInterpret.h')
-rw-r--r-- | js/src/vm/PortableBaselineInterpret.h | 353 |
1 files changed, 353 insertions, 0 deletions
diff --git a/js/src/vm/PortableBaselineInterpret.h b/js/src/vm/PortableBaselineInterpret.h new file mode 100644 index 0000000000..a9a6cb356e --- /dev/null +++ b/js/src/vm/PortableBaselineInterpret.h @@ -0,0 +1,353 @@ +/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- + * vim: set ts=8 sts=2 et sw=2 tw=80: + * This Source Code Form is subject to the terms of the Mozilla Public + * License, v. 2.0. If a copy of the MPL was not distributed with this + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ + +#ifndef vm_PortableBaselineInterpret_h +#define vm_PortableBaselineInterpret_h + +/* + * [SMDOC] Portable Baseline Interpreter + * ===================================== + * + * The Portable Baseline Interpreter (PBL) is a portable interpreter + * that supports executing ICs by directly interpreting CacheIR. + * + * This interpreter tier fits into the hierarchy between the C++ + * interpreter, which is fully generic and does not specialize with + * ICs, and the native baseline interpreter, which does attach and + * execute ICs but requires native codegen (JIT). The distinguishing + * feature of PBL is that it *does not require codegen*: it can run on + * any platform for which SpiderMonkey supports an interpreter-only + * build. This is useful both for platforms that do not support + * runtime addition of new code (e.g., running within a WebAssembly + * module with a `wasm32-wasi` build) or may disallow it for security + * reasons. + * + * The main idea of PBL is to emulate, as much as possible, how the + * native baseline interpreter works, so that the rest of the engine + * can work the same either way. The main aspect of this "emulation" + * comes with stack frames: unlike the native blinterp and JIT tiers, + * we cannot use the machine stack, because we are still executing in + * portable C++ code and the platform's C++ compiler controls the + * machine stack's layout. Instead, we use an auxiliary stack. + * + * Auxiliary Stack + * --------------- + * + * PBL creates baseline stack frames (see `BaselineFrame` and related + * structs) on an *auxiliary stack*, contiguous memory allocated and + * owned by the JSRuntime. + * + * This stack operates nearly identically to the machine stack: it + * grows downward, we push stack frames, we maintain a linked list of + * frame pointers, and a series of contiguous frames form a + * `JitActivation`, with the most recent activation reachable from the + * `JSContext`. The only actual difference is that the return address + * slots in the frame layouts are always null pointers, because there + * is no need to save return addresses: we always know where we are + * going to return to (either post-IC code -- the return point of + * which is known because we actually do a C++-level call from the + * JSOp interpreter to the IC interpreter -- or to dispatch the next + * JSOp). + * + * The same invariants as for native baseline code apply here: when we + * are in `PortableBaselineInterpret` (the PBL interpreter body) or + * `ICInterpretOps` (the IC interpreter) or related helpers, it is as + * if we are in JIT code, and local state determines the top-of-stack + * and innermost frame. The activation is not "finished" and cannot be + * traversed. When we need to call into the rest of SpiderMonkey, we + * emulate how that would work in JIT code, via an exit frame (that + * would ordinarily be pushed by a trampoline) and saving that frame + * as the exit-frame pointer in the VM state. + * + * To add a little compile-time enforcement of this strategy, and + * ensure that we don't accidentally call something that will want to + * traverse the (in-progress and not-completed) JIT activation, we use + * a helper class `VMFrame` that pushes and pops the exit frame, + * wrapping the callsite into the rest of SM with an RAII idiom. Then, + * we *hide the `JSContext`*, and rely on the idiom that `cx` is + * passed to anything that can GC or otherwise observe the JIT + * state. The `JSContext` is passed in as `cx_`, and we name the + * `VMFrame` local `cx` in the macro that invokes it; this `cx` then + * has an implicit conversion to a `JSContext*` value and reveals the + * real context. + * + * Interpreter Loops + * ----------------- + * + * There are two interpreter loops: the JSOp interpreter and the + * CacheIR interpreter. These closely correspond to (i) the blinterp + * body that is generated at startup for the native baseline + * interpreter, and (ii) an interpreter version of the code generated + * by the `BaselineCacheIRCompiler`, respectively. + * + * Execution begins in the JSOp interpreter, and for any op(*) that + * has an IC site (`JOF_IC` flag), we invoke the IC interpreter. The + * IC interpreter runs a loop that traverses the IC stub chain, either + * reaching CacheIR bytecode and executing it in a virtual machine, or + * reaching the fallback stub and executing that (likely pushing an + * exit frame and calling into the rest of SpiderMonkey). + * + * (*) As an optimization, some opcodes that would have IC sites in + * native baseline skip their IC chains and run generic code instead + * in PBL. See "Hybrid IC mode" below for more details. + * + * IC Interpreter State + * -------------------- + * + * While the JS opcode interpreter's abstract machine model and its + * mapping of those abstract semantics to real machine state are + * well-defined (by the other baseline tiers), the IC interpreter's + * mapping is less so. When executing in native baseline tiers, + * CacheIR is compiled to machine code that undergoes register + * allocation and several optimizations (e.g., handling constants + * specially, and eliding type-checks on values when we know their + * actual types). No other interpreter for CacheIR exists, so we get + * to define how we map the semantics to interpreter state. + * + * We choose to keep an array of uint64_t values as "virtual + * registers", each corresponding to a particular OperandId, and we + * store the same values that would exist in the native machine + * registers. In other words, we do not do any sort of register + * allocation or reclamation of storage slots, because we don't have + * any lookahead in the interpreter. We rely on the typesafe writer + * API, with newtype'd wrappers for different kinds of values + * (`ValOperandId`, `ObjOperandId`, `Int32OperandId`, etc.), producing + * typesafe CacheIR bytecode, in order to properly store and interpret + * unboxed values in the virtual registers. + * + * There are several subtle details usually handled by register + * allocation in the CacheIR compilers that need to be handled here + * too, mainly around input arguments and restoring state when + * chaining to the next IC stub. IC callsites place inputs into the + * first N OperandId registers directly, corresponding to what the + * CacheIR expects. There are some CacheIR opcodes that mutate their + * argument in-place (e.g., guarding that a Value is an Object strips + * the tag-bits from the Value and turns it into a raw pointer), so we + * cannot rely on these remaining unmodified if we need to invoke the + * next IC in the chain; instead, we save and restore the first N + * values in the chain-walking loop (according to the arity of the IC + * kind). + * + * Optimizations + * ------------ + * + * There are several implementation details that are critical for + * performance, and thus should be carefully maintained or verified + * with any changes: + * + * - Caching values in locals: in order to be competitive with "native + * baseline interpreter", which has the advantage of using machine + * registers for commonly-accessed values such as the + * top-of-operand-stack and the JS opcode PC, we are careful to + * ensure that the C++ compiler can keep these values in registers + * in PBL as well. One might naively store `pc`, `sp`, `fp`, and the + * like in a context struct (of "virtual CPU registers") that is + * passed to e.g. the IC interpreter. This would be a mistake: if + * the values exist in memory, the compiler cannot "lift" them to + * locals that can live in registers, and so every push and pop (for + * example) performs a store. This overhead is significant, + * especially when executing more "lightweight" opcodes. + * + * We make use of an important property -- the balanced-stack + * invariant -- so that we can pass SP *into* calls but not take an + * updated SP *from* them. When invoking an IC, we expect that when + * it returns, SP will be at the same location (one could think of + * SP as a "callee-saved register", though it's not usually + * described that way). Thus, we can avoid a dependency on a value + * that would have to be passed back through memory. + * + * - Hybrid IC mode: the fact that we *interpret* ICs now means that + * they are more expensive to invoke. Whereas a small IC that guards + * two int32 arguments, performs an int32 add, and returns might + * have been a handful of instructions before, and the call/ret pair + * would have been very fast (and easy to predict) instructions at + * the machine level, the setup and context transition and the + * CacheIR opcode dispatch overhead would likely be much slower than + * a generic "if both int32, add" fastpath in the interpreter case + * for `JSOp::Add`. + * + * We thus take a hybrid approach, and include these static + * fastpaths for what would have been ICs in "native + * baseline". These are enabled by the `kHybridICs` global and may + * be removed in the future (transitioning back to ICs) if/when we + * can reduce the cost of interpreted ICs further. + * + * Right now, calls and property accesses use ICs: + * + * - Calls can often be special-cased with CacheIR when intrinsics + * are invoked. For example, a call to `String.length` can turn + * into a CacheIR opcode that directly reads a `JSString`'s length + * field. + * - Property accesses are so frequent, and the shape-lookup path + * is slow enough, that it still makes sense to guard on shape + * and quickly return a particular slot. + * + * - Static branch prediction for opcode dispatch: we adopt an + * interpreter optimization we call "static branch prediction": when + * one opcode is often followed by another, it is often more + * efficient to check for those specific cases first and branch + * directly to the case for the following opcode, doing the full + * switch otherwise. This is especially true when the indirect + * branches used by `switch` statements or computed gotos are + * expensive on a given platform, such as Wasm. + * + * - Inlining: on some platforms, calls are expensive, and we want to + * avoid them whenever possible. We have found that it is quite + * important for performance to inline the IC interpreter into the + * JSOp interpreter at IC sites: both functions are quite large, + * with significant local state, and so otherwise, each IC call + * involves a lot of "context switching" as the code generated by + * the C++ compiler saves registers and constructs a new native + * frame. This is certainly a code-size tradeoff, but we have + * optimized for speed here. + * + * - Amortized stack checks: a naive interpreter implementation would + * check for auxiliary stack overflow on every push. We instead do + * this once when we enter a new JS function frame, using the + * script's precomputed "maximum stack depth" value. We keep a small + * stack margin always available, so that we have enough space to + * push an exit frame and invoke the "over-recursed" helper (which + * throws an exception) when we would otherwise overflow. The stack + * checks take this margin into account, failing if there would be + * less than the margin available at any point in the called + * function. + * + * - Fastpaths for calls and returns: we are able to push and pop JS + * stack frames while remaining in one native (C++ interpreter + * function) frame, just as the C++ interpreter does. This means + * that there is a one-to-many mapping from native stack frame to JS + * stack frame. This does create some complications at points that + * pop frames: we might remain in the same C++ frame, or we might + * return at the C++ level. We handle this in a unified way for + * returns and exception unwinding as described below. + * + * Unwinding + * --------- + * + * Because one C++ interpreter frame can correspond to multiple JS + * frames, we need to disambiguate the two cases whenever leaving a + * frame: we may need to return, or we may stay in the current + * function and dispatch the next opcode at the caller's next PC. + * + * Exception unwinding compilcates this further. PBL uses the same + * exception-handling code that native baseline does, and this code + * computes a `ResumeFromException` struct that tells us what our new + * stack pointer and frame pointer must be. These values could be + * arbitrarily far "up" the stack in the current activation. It thus + * wouldn't be sufficient to count how many JS frames we have, and + * return at the C++ level when this reaches zero: we need to "unwind" + * the C++ frames until we reach the appropriate JS frame. + * + * To solve both issues, we remember the "entry frame" when we enter a + * new invocation of `PortableBaselineInterpret()`, and when returning + * or unwinding, if the new frame is *above* this entry frame, we + * return. We have an enum `PBIResult` that can encode, when + * unwinding, *which* kind of unwinding we are doing, because when we + * do eventually reach the C++ frame that owns the newly active JS + * frame, we may resume into a different action depending on this + * information. + * + * Completeness + * ------------ + * + * Whenever a new JSOp is added, the opcode needs to be added to + * PBL. The compiler should enforce this: if no case is implemented + * for an opcode, then the label in the computed-goto table will be + * missing and PBL will not compile. + * + * In contrast, CacheIR opcodes need not be implemented right away, + * and in fact right now most of the less-common ones are not + * implemented by PBL. If the IC interpreter hits an unimplemented + * opcode, it acts as if a guard had failed, and transfers to the next + * stub in the chain. Every chain ends with a fallback stub that can + * handle every case (it does not execute CacheIR at all, but instead + * calls into the runtime), so this will always give the correct + * result, albeit more slowly. Implementing the remainder of the + * CacheIR opcodes, and new ones as they are added, is thus purely a + * performance concern. + * + * PBL currently does not implement async resume into a suspended + * generator. There is no particular reason that this cannot be + * implemented; it just has not been done yet. Such an action will + * currently call back into the C++ interpreter to run the resumed + * generator body. Execution up to the first yield-point can still + * occur in PBL, and PBL can successfully save the suspended state. + */ + +#include "jspubtd.h" + +#include "jit/BaselineFrame.h" +#include "jit/BaselineIC.h" +#include "jit/JitContext.h" +#include "jit/JitScript.h" +#include "vm/Interpreter.h" +#include "vm/Stack.h" + +namespace js { +namespace pbl { + +// Trampoline invoked by EnterJit that sets up PBL state and invokes +// the main interpreter loop. +bool PortableBaselineTrampoline(JSContext* cx, size_t argc, Value* argv, + size_t numActuals, size_t numFormals, + jit::CalleeToken calleeToken, + JSObject* envChain, Value* result); + +// Predicate: are all conditions satisfied to allow execution within +// PBL? This depends only on properties of the function to be invoked, +// and not on other runtime state, like the current stack depth, so if +// it returns `true` once, it can be assumed to always return `true` +// for that function. See `PortableBaselineInterpreterStackCheck` +// below for a complimentary check that does not have this property. +jit::MethodStatus CanEnterPortableBaselineInterpreter(JSContext* cx, + RunState& state); + +// A check for availbale stack space on the PBL auxiliary stack that +// is invoked before the main trampoline. This is required for entry +// into PBL and should be checked before invoking the trampoline +// above. Unlike `CanEnterPortableBaselineInterpreter`, the result of +// this check cannot be cached: it must be checked on each potential +// entry. +bool PortablebaselineInterpreterStackCheck(JSContext* cx, RunState& state, + size_t numActualArgs); + +struct State; +struct Stack; +struct StackVal; +struct StackValNative; +struct ICRegs; +class VMFrameManager; + +enum class PBIResult { + Ok, + Error, + Unwind, + UnwindError, + UnwindRet, +}; + +PBIResult PortableBaselineInterpret(JSContext* cx_, State& state, Stack& stack, + StackVal* sp, JSObject* envChain, + Value* ret); + +enum class ICInterpretOpResult { + NextIC, + Return, + Error, + Unwind, + UnwindError, + UnwindRet, +}; + +ICInterpretOpResult MOZ_ALWAYS_INLINE +ICInterpretOps(jit::BaselineFrame* frame, VMFrameManager& frameMgr, + State& state, ICRegs& icregs, Stack& stack, StackVal* sp, + jit::ICCacheIRStub* cstub, jsbytecode* pc); + +} /* namespace pbl */ +} /* namespace js */ + +#endif /* vm_PortableBaselineInterpret_h */ |