/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- * vim: set ts=8 sts=2 et sw=2 tw=80: * This Source Code Form is subject to the terms of the Mozilla Public * License, v. 2.0. If a copy of the MPL was not distributed with this * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ #ifndef vm_PortableBaselineInterpret_h #define vm_PortableBaselineInterpret_h /* * [SMDOC] Portable Baseline Interpreter * ===================================== * * The Portable Baseline Interpreter (PBL) is a portable interpreter * that supports executing ICs by directly interpreting CacheIR. * * This interpreter tier fits into the hierarchy between the C++ * interpreter, which is fully generic and does not specialize with * ICs, and the native baseline interpreter, which does attach and * execute ICs but requires native codegen (JIT). The distinguishing * feature of PBL is that it *does not require codegen*: it can run on * any platform for which SpiderMonkey supports an interpreter-only * build. This is useful both for platforms that do not support * runtime addition of new code (e.g., running within a WebAssembly * module with a `wasm32-wasi` build) or may disallow it for security * reasons. * * The main idea of PBL is to emulate, as much as possible, how the * native baseline interpreter works, so that the rest of the engine * can work the same either way. The main aspect of this "emulation" * comes with stack frames: unlike the native blinterp and JIT tiers, * we cannot use the machine stack, because we are still executing in * portable C++ code and the platform's C++ compiler controls the * machine stack's layout. Instead, we use an auxiliary stack. * * Auxiliary Stack * --------------- * * PBL creates baseline stack frames (see `BaselineFrame` and related * structs) on an *auxiliary stack*, contiguous memory allocated and * owned by the JSRuntime. * * This stack operates nearly identically to the machine stack: it * grows downward, we push stack frames, we maintain a linked list of * frame pointers, and a series of contiguous frames form a * `JitActivation`, with the most recent activation reachable from the * `JSContext`. The only actual difference is that the return address * slots in the frame layouts are always null pointers, because there * is no need to save return addresses: we always know where we are * going to return to (either post-IC code -- the return point of * which is known because we actually do a C++-level call from the * JSOp interpreter to the IC interpreter -- or to dispatch the next * JSOp). * * The same invariants as for native baseline code apply here: when we * are in `PortableBaselineInterpret` (the PBL interpreter body) or * `ICInterpretOps` (the IC interpreter) or related helpers, it is as * if we are in JIT code, and local state determines the top-of-stack * and innermost frame. The activation is not "finished" and cannot be * traversed. When we need to call into the rest of SpiderMonkey, we * emulate how that would work in JIT code, via an exit frame (that * would ordinarily be pushed by a trampoline) and saving that frame * as the exit-frame pointer in the VM state. * * To add a little compile-time enforcement of this strategy, and * ensure that we don't accidentally call something that will want to * traverse the (in-progress and not-completed) JIT activation, we use * a helper class `VMFrame` that pushes and pops the exit frame, * wrapping the callsite into the rest of SM with an RAII idiom. Then, * we *hide the `JSContext`*, and rely on the idiom that `cx` is * passed to anything that can GC or otherwise observe the JIT * state. The `JSContext` is passed in as `cx_`, and we name the * `VMFrame` local `cx` in the macro that invokes it; this `cx` then * has an implicit conversion to a `JSContext*` value and reveals the * real context. * * Interpreter Loops * ----------------- * * There are two interpreter loops: the JSOp interpreter and the * CacheIR interpreter. These closely correspond to (i) the blinterp * body that is generated at startup for the native baseline * interpreter, and (ii) an interpreter version of the code generated * by the `BaselineCacheIRCompiler`, respectively. * * Execution begins in the JSOp interpreter, and for any op(*) that * has an IC site (`JOF_IC` flag), we invoke the IC interpreter. The * IC interpreter runs a loop that traverses the IC stub chain, either * reaching CacheIR bytecode and executing it in a virtual machine, or * reaching the fallback stub and executing that (likely pushing an * exit frame and calling into the rest of SpiderMonkey). * * (*) As an optimization, some opcodes that would have IC sites in * native baseline skip their IC chains and run generic code instead * in PBL. See "Hybrid IC mode" below for more details. * * IC Interpreter State * -------------------- * * While the JS opcode interpreter's abstract machine model and its * mapping of those abstract semantics to real machine state are * well-defined (by the other baseline tiers), the IC interpreter's * mapping is less so. When executing in native baseline tiers, * CacheIR is compiled to machine code that undergoes register * allocation and several optimizations (e.g., handling constants * specially, and eliding type-checks on values when we know their * actual types). No other interpreter for CacheIR exists, so we get * to define how we map the semantics to interpreter state. * * We choose to keep an array of uint64_t values as "virtual * registers", each corresponding to a particular OperandId, and we * store the same values that would exist in the native machine * registers. In other words, we do not do any sort of register * allocation or reclamation of storage slots, because we don't have * any lookahead in the interpreter. We rely on the typesafe writer * API, with newtype'd wrappers for different kinds of values * (`ValOperandId`, `ObjOperandId`, `Int32OperandId`, etc.), producing * typesafe CacheIR bytecode, in order to properly store and interpret * unboxed values in the virtual registers. * * There are several subtle details usually handled by register * allocation in the CacheIR compilers that need to be handled here * too, mainly around input arguments and restoring state when * chaining to the next IC stub. IC callsites place inputs into the * first N OperandId registers directly, corresponding to what the * CacheIR expects. There are some CacheIR opcodes that mutate their * argument in-place (e.g., guarding that a Value is an Object strips * the tag-bits from the Value and turns it into a raw pointer), so we * cannot rely on these remaining unmodified if we need to invoke the * next IC in the chain; instead, we save and restore the first N * values in the chain-walking loop (according to the arity of the IC * kind). * * Optimizations * ------------ * * There are several implementation details that are critical for * performance, and thus should be carefully maintained or verified * with any changes: * * - Caching values in locals: in order to be competitive with "native * baseline interpreter", which has the advantage of using machine * registers for commonly-accessed values such as the * top-of-operand-stack and the JS opcode PC, we are careful to * ensure that the C++ compiler can keep these values in registers * in PBL as well. One might naively store `pc`, `sp`, `fp`, and the * like in a context struct (of "virtual CPU registers") that is * passed to e.g. the IC interpreter. This would be a mistake: if * the values exist in memory, the compiler cannot "lift" them to * locals that can live in registers, and so every push and pop (for * example) performs a store. This overhead is significant, * especially when executing more "lightweight" opcodes. * * We make use of an important property -- the balanced-stack * invariant -- so that we can pass SP *into* calls but not take an * updated SP *from* them. When invoking an IC, we expect that when * it returns, SP will be at the same location (one could think of * SP as a "callee-saved register", though it's not usually * described that way). Thus, we can avoid a dependency on a value * that would have to be passed back through memory. * * - Hybrid IC mode: the fact that we *interpret* ICs now means that * they are more expensive to invoke. Whereas a small IC that guards * two int32 arguments, performs an int32 add, and returns might * have been a handful of instructions before, and the call/ret pair * would have been very fast (and easy to predict) instructions at * the machine level, the setup and context transition and the * CacheIR opcode dispatch overhead would likely be much slower than * a generic "if both int32, add" fastpath in the interpreter case * for `JSOp::Add`. * * We thus take a hybrid approach, and include these static * fastpaths for what would have been ICs in "native * baseline". These are enabled by the `kHybridICs` global and may * be removed in the future (transitioning back to ICs) if/when we * can reduce the cost of interpreted ICs further. * * Right now, calls and property accesses use ICs: * * - Calls can often be special-cased with CacheIR when intrinsics * are invoked. For example, a call to `String.length` can turn * into a CacheIR opcode that directly reads a `JSString`'s length * field. * - Property accesses are so frequent, and the shape-lookup path * is slow enough, that it still makes sense to guard on shape * and quickly return a particular slot. * * - Static branch prediction for opcode dispatch: we adopt an * interpreter optimization we call "static branch prediction": when * one opcode is often followed by another, it is often more * efficient to check for those specific cases first and branch * directly to the case for the following opcode, doing the full * switch otherwise. This is especially true when the indirect * branches used by `switch` statements or computed gotos are * expensive on a given platform, such as Wasm. * * - Inlining: on some platforms, calls are expensive, and we want to * avoid them whenever possible. We have found that it is quite * important for performance to inline the IC interpreter into the * JSOp interpreter at IC sites: both functions are quite large, * with significant local state, and so otherwise, each IC call * involves a lot of "context switching" as the code generated by * the C++ compiler saves registers and constructs a new native * frame. This is certainly a code-size tradeoff, but we have * optimized for speed here. * * - Amortized stack checks: a naive interpreter implementation would * check for auxiliary stack overflow on every push. We instead do * this once when we enter a new JS function frame, using the * script's precomputed "maximum stack depth" value. We keep a small * stack margin always available, so that we have enough space to * push an exit frame and invoke the "over-recursed" helper (which * throws an exception) when we would otherwise overflow. The stack * checks take this margin into account, failing if there would be * less than the margin available at any point in the called * function. * * - Fastpaths for calls and returns: we are able to push and pop JS * stack frames while remaining in one native (C++ interpreter * function) frame, just as the C++ interpreter does. This means * that there is a one-to-many mapping from native stack frame to JS * stack frame. This does create some complications at points that * pop frames: we might remain in the same C++ frame, or we might * return at the C++ level. We handle this in a unified way for * returns and exception unwinding as described below. * * Unwinding * --------- * * Because one C++ interpreter frame can correspond to multiple JS * frames, we need to disambiguate the two cases whenever leaving a * frame: we may need to return, or we may stay in the current * function and dispatch the next opcode at the caller's next PC. * * Exception unwinding compilcates this further. PBL uses the same * exception-handling code that native baseline does, and this code * computes a `ResumeFromException` struct that tells us what our new * stack pointer and frame pointer must be. These values could be * arbitrarily far "up" the stack in the current activation. It thus * wouldn't be sufficient to count how many JS frames we have, and * return at the C++ level when this reaches zero: we need to "unwind" * the C++ frames until we reach the appropriate JS frame. * * To solve both issues, we remember the "entry frame" when we enter a * new invocation of `PortableBaselineInterpret()`, and when returning * or unwinding, if the new frame is *above* this entry frame, we * return. We have an enum `PBIResult` that can encode, when * unwinding, *which* kind of unwinding we are doing, because when we * do eventually reach the C++ frame that owns the newly active JS * frame, we may resume into a different action depending on this * information. * * Completeness * ------------ * * Whenever a new JSOp is added, the opcode needs to be added to * PBL. The compiler should enforce this: if no case is implemented * for an opcode, then the label in the computed-goto table will be * missing and PBL will not compile. * * In contrast, CacheIR opcodes need not be implemented right away, * and in fact right now most of the less-common ones are not * implemented by PBL. If the IC interpreter hits an unimplemented * opcode, it acts as if a guard had failed, and transfers to the next * stub in the chain. Every chain ends with a fallback stub that can * handle every case (it does not execute CacheIR at all, but instead * calls into the runtime), so this will always give the correct * result, albeit more slowly. Implementing the remainder of the * CacheIR opcodes, and new ones as they are added, is thus purely a * performance concern. * * PBL currently does not implement async resume into a suspended * generator. There is no particular reason that this cannot be * implemented; it just has not been done yet. Such an action will * currently call back into the C++ interpreter to run the resumed * generator body. Execution up to the first yield-point can still * occur in PBL, and PBL can successfully save the suspended state. */ #include "jspubtd.h" #include "jit/BaselineFrame.h" #include "jit/BaselineIC.h" #include "jit/JitContext.h" #include "jit/JitScript.h" #include "vm/Interpreter.h" #include "vm/Stack.h" namespace js { namespace pbl { // Trampoline invoked by EnterJit that sets up PBL state and invokes // the main interpreter loop. bool PortableBaselineTrampoline(JSContext* cx, size_t argc, Value* argv, size_t numActuals, size_t numFormals, jit::CalleeToken calleeToken, JSObject* envChain, Value* result); // Predicate: are all conditions satisfied to allow execution within // PBL? This depends only on properties of the function to be invoked, // and not on other runtime state, like the current stack depth, so if // it returns `true` once, it can be assumed to always return `true` // for that function. See `PortableBaselineInterpreterStackCheck` // below for a complimentary check that does not have this property. jit::MethodStatus CanEnterPortableBaselineInterpreter(JSContext* cx, RunState& state); // A check for availbale stack space on the PBL auxiliary stack that // is invoked before the main trampoline. This is required for entry // into PBL and should be checked before invoking the trampoline // above. Unlike `CanEnterPortableBaselineInterpreter`, the result of // this check cannot be cached: it must be checked on each potential // entry. bool PortablebaselineInterpreterStackCheck(JSContext* cx, RunState& state, size_t numActualArgs); struct State; struct Stack; struct StackVal; struct StackValNative; struct ICRegs; class VMFrameManager; enum class PBIResult { Ok, Error, Unwind, UnwindError, UnwindRet, }; PBIResult PortableBaselineInterpret(JSContext* cx_, State& state, Stack& stack, StackVal* sp, JSObject* envChain, Value* ret); enum class ICInterpretOpResult { NextIC, Return, Error, Unwind, UnwindError, UnwindRet, }; ICInterpretOpResult MOZ_ALWAYS_INLINE ICInterpretOps(jit::BaselineFrame* frame, VMFrameManager& frameMgr, State& state, ICRegs& icregs, Stack& stack, StackVal* sp, jit::ICCacheIRStub* cstub, jsbytecode* pc); } /* namespace pbl */ } /* namespace js */ #endif /* vm_PortableBaselineInterpret_h */