js/src/vm/PortableBaselineInterpret.h


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353

/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*-
 * vim: set ts=8 sts=2 et sw=2 tw=80:
 * This Source Code Form is subject to the terms of the Mozilla Public
 * License, v. 2.0. If a copy of the MPL was not distributed with this
 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */

#ifndef vm_PortableBaselineInterpret_h
#define vm_PortableBaselineInterpret_h

/*
 * [SMDOC] Portable Baseline Interpreter
 * =====================================
 *
 * The Portable Baseline Interpreter (PBL) is a portable interpreter
 * that supports executing ICs by directly interpreting CacheIR.
 *
 * This interpreter tier fits into the hierarchy between the C++
 * interpreter, which is fully generic and does not specialize with
 * ICs, and the native baseline interpreter, which does attach and
 * execute ICs but requires native codegen (JIT). The distinguishing
 * feature of PBL is that it *does not require codegen*: it can run on
 * any platform for which SpiderMonkey supports an interpreter-only
 * build. This is useful both for platforms that do not support
 * runtime addition of new code (e.g., running within a WebAssembly
 * module with a `wasm32-wasi` build) or may disallow it for security
 * reasons.
 *
 * The main idea of PBL is to emulate, as much as possible, how the
 * native baseline interpreter works, so that the rest of the engine
 * can work the same either way. The main aspect of this "emulation"
 * comes with stack frames: unlike the native blinterp and JIT tiers,
 * we cannot use the machine stack, because we are still executing in
 * portable C++ code and the platform's C++ compiler controls the
 * machine stack's layout. Instead, we use an auxiliary stack.
 *
 * Auxiliary Stack
 * ---------------
 *
 * PBL creates baseline stack frames (see `BaselineFrame` and related
 * structs) on an *auxiliary stack*, contiguous memory allocated and
 * owned by the JSRuntime.
 *
 * This stack operates nearly identically to the machine stack: it
 * grows downward, we push stack frames, we maintain a linked list of
 * frame pointers, and a series of contiguous frames form a
 * `JitActivation`, with the most recent activation reachable from the
 * `JSContext`. The only actual difference is that the return address
 * slots in the frame layouts are always null pointers, because there
 * is no need to save return addresses: we always know where we are
 * going to return to (either post-IC code -- the return point of
 * which is known because we actually do a C++-level call from the
 * JSOp interpreter to the IC interpreter -- or to dispatch the next
 * JSOp).
 *
 * The same invariants as for native baseline code apply here: when we
 * are in `PortableBaselineInterpret` (the PBL interpreter body) or
 * `ICInterpretOps` (the IC interpreter) or related helpers, it is as
 * if we are in JIT code, and local state determines the top-of-stack
 * and innermost frame. The activation is not "finished" and cannot be
 * traversed. When we need to call into the rest of SpiderMonkey, we
 * emulate how that would work in JIT code, via an exit frame (that
 * would ordinarily be pushed by a trampoline) and saving that frame
 * as the exit-frame pointer in the VM state.
 *
 * To add a little compile-time enforcement of this strategy, and
 * ensure that we don't accidentally call something that will want to
 * traverse the (in-progress and not-completed) JIT activation, we use
 * a helper class `VMFrame` that pushes and pops the exit frame,
 * wrapping the callsite into the rest of SM with an RAII idiom. Then,
 * we *hide the `JSContext`*, and rely on the idiom that `cx` is
 * passed to anything that can GC or otherwise observe the JIT
 * state. The `JSContext` is passed in as `cx_`, and we name the
 * `VMFrame` local `cx` in the macro that invokes it; this `cx` then
 * has an implicit conversion to a `JSContext*` value and reveals the
 * real context.
 *
 * Interpreter Loops
 * -----------------
 *
 * There are two interpreter loops: the JSOp interpreter and the
 * CacheIR interpreter. These closely correspond to (i) the blinterp
 * body that is generated at startup for the native baseline
 * interpreter, and (ii) an interpreter version of the code generated
 * by the `BaselineCacheIRCompiler`, respectively.
 *
 * Execution begins in the JSOp interpreter, and for any op(*) that
 * has an IC site (`JOF_IC` flag), we invoke the IC interpreter. The
 * IC interpreter runs a loop that traverses the IC stub chain, either
 * reaching CacheIR bytecode and executing it in a virtual machine, or
 * reaching the fallback stub and executing that (likely pushing an
 * exit frame and calling into the rest of SpiderMonkey).
 *
 * (*) As an optimization, some opcodes that would have IC sites in
 * native baseline skip their IC chains and run generic code instead
 * in PBL. See "Hybrid IC mode" below for more details.
 *
 * IC Interpreter State
 * --------------------
 *
 * While the JS opcode interpreter's abstract machine model and its
 * mapping of those abstract semantics to real machine state are
 * well-defined (by the other baseline tiers), the IC interpreter's
 * mapping is less so. When executing in native baseline tiers,
 * CacheIR is compiled to machine code that undergoes register
 * allocation and several optimizations (e.g., handling constants
 * specially, and eliding type-checks on values when we know their
 * actual types). No other interpreter for CacheIR exists, so we get
 * to define how we map the semantics to interpreter state.
 *
 * We choose to keep an array of uint64_t values as "virtual
 * registers", each corresponding to a particular OperandId, and we
 * store the same values that would exist in the native machine
 * registers. In other words, we do not do any sort of register
 * allocation or reclamation of storage slots, because we don't have
 * any lookahead in the interpreter. We rely on the typesafe writer
 * API, with newtype'd wrappers for different kinds of values
 * (`ValOperandId`, `ObjOperandId`, `Int32OperandId`, etc.), producing
 * typesafe CacheIR bytecode, in order to properly store and interpret
 * unboxed values in the virtual registers.
 *
 * There are several subtle details usually handled by register
 * allocation in the CacheIR compilers that need to be handled here
 * too, mainly around input arguments and restoring state when
 * chaining to the next IC stub. IC callsites place inputs into the
 * first N OperandId registers directly, corresponding to what the
 * CacheIR expects. There are some CacheIR opcodes that mutate their
 * argument in-place (e.g., guarding that a Value is an Object strips
 * the tag-bits from the Value and turns it into a raw pointer), so we
 * cannot rely on these remaining unmodified if we need to invoke the
 * next IC in the chain; instead, we save and restore the first N
 * values in the chain-walking loop (according to the arity of the IC
 * kind).
 *
 * Optimizations
 * ------------
 *
 * There are several implementation details that are critical for
 * performance, and thus should be carefully maintained or verified
 * with any changes:
 *
 * - Caching values in locals: in order to be competitive with "native
 *   baseline interpreter", which has the advantage of using machine
 *   registers for commonly-accessed values such as the
 *   top-of-operand-stack and the JS opcode PC, we are careful to
 *   ensure that the C++ compiler can keep these values in registers
 *   in PBL as well. One might naively store `pc`, `sp`, `fp`, and the
 *   like in a context struct (of "virtual CPU registers") that is
 *   passed to e.g. the IC interpreter. This would be a mistake: if
 *   the values exist in memory, the compiler cannot "lift" them to
 *   locals that can live in registers, and so every push and pop (for
 *   example) performs a store. This overhead is significant,
 *   especially when executing more "lightweight" opcodes.
 *
 *   We make use of an important property -- the balanced-stack
 *   invariant -- so that we can pass SP *into* calls but not take an
 *   updated SP *from* them. When invoking an IC, we expect that when
 *   it returns, SP will be at the same location (one could think of
 *   SP as a "callee-saved register", though it's not usually
 *   described that way). Thus, we can avoid a dependency on a value
 *   that would have to be passed back through memory.
 *
 * - Hybrid IC mode: the fact that we *interpret* ICs now means that
 *   they are more expensive to invoke. Whereas a small IC that guards
 *   two int32 arguments, performs an int32 add, and returns might
 *   have been a handful of instructions before, and the call/ret pair
 *   would have been very fast (and easy to predict) instructions at
 *   the machine level, the setup and context transition and the
 *   CacheIR opcode dispatch overhead would likely be much slower than
 *   a generic "if both int32, add" fastpath in the interpreter case
 *   for `JSOp::Add`.
 *
 *   We thus take a hybrid approach, and include these static
 *   fastpaths for what would have been ICs in "native
 *   baseline". These are enabled by the `kHybridICs` global and may
 *   be removed in the future (transitioning back to ICs) if/when we
 *   can reduce the cost of interpreted ICs further.
 *
 *   Right now, calls and property accesses use ICs:
 *
 *   - Calls can often be special-cased with CacheIR when intrinsics
 *     are invoked. For example, a call to `String.length` can turn
 *     into a CacheIR opcode that directly reads a `JSString`'s length
 *     field.
 *   - Property accesses are so frequent, and the shape-lookup path
 *     is slow enough, that it still makes sense to guard on shape
 *     and quickly return a particular slot.
 *
 * - Static branch prediction for opcode dispatch: we adopt an
 *   interpreter optimization we call "static branch prediction": when
 *   one opcode is often followed by another, it is often more
 *   efficient to check for those specific cases first and branch
 *   directly to the case for the following opcode, doing the full
 *   switch otherwise. This is especially true when the indirect
 *   branches used by `switch` statements or computed gotos are
 *   expensive on a given platform, such as Wasm.
 *
 * - Inlining: on some platforms, calls are expensive, and we want to
 *   avoid them whenever possible. We have found that it is quite
 *   important for performance to inline the IC interpreter into the
 *   JSOp interpreter at IC sites: both functions are quite large,
 *   with significant local state, and so otherwise, each IC call
 *   involves a lot of "context switching" as the code generated by
 *   the C++ compiler saves registers and constructs a new native
 *   frame. This is certainly a code-size tradeoff, but we have
 *   optimized for speed here.
 *
 * - Amortized stack checks: a naive interpreter implementation would
 *   check for auxiliary stack overflow on every push. We instead do
 *   this once when we enter a new JS function frame, using the
 *   script's precomputed "maximum stack depth" value. We keep a small
 *   stack margin always available, so that we have enough space to
 *   push an exit frame and invoke the "over-recursed" helper (which
 *   throws an exception) when we would otherwise overflow. The stack
 *   checks take this margin into account, failing if there would be
 *   less than the margin available at any point in the called
 *   function.
 *
 * - Fastpaths for calls and returns: we are able to push and pop JS
 *   stack frames while remaining in one native (C++ interpreter
 *   function) frame, just as the C++ interpreter does. This means
 *   that there is a one-to-many mapping from native stack frame to JS
 *   stack frame. This does create some complications at points that
 *   pop frames: we might remain in the same C++ frame, or we might
 *   return at the C++ level. We handle this in a unified way for
 *   returns and exception unwinding as described below.
 *
 * Unwinding
 * ---------
 *
 * Because one C++ interpreter frame can correspond to multiple JS
 * frames, we need to disambiguate the two cases whenever leaving a
 * frame: we may need to return, or we may stay in the current
 * function and dispatch the next opcode at the caller's next PC.
 *
 * Exception unwinding compilcates this further. PBL uses the same
 * exception-handling code that native baseline does, and this code
 * computes a `ResumeFromException` struct that tells us what our new
 * stack pointer and frame pointer must be. These values could be
 * arbitrarily far "up" the stack in the current activation. It thus
 * wouldn't be sufficient to count how many JS frames we have, and
 * return at the C++ level when this reaches zero: we need to "unwind"
 * the C++ frames until we reach the appropriate JS frame.
 *
 * To solve both issues, we remember the "entry frame" when we enter a
 * new invocation of `PortableBaselineInterpret()`, and when returning
 * or unwinding, if the new frame is *above* this entry frame, we
 * return. We have an enum `PBIResult` that can encode, when
 * unwinding, *which* kind of unwinding we are doing, because when we
 * do eventually reach the C++ frame that owns the newly active JS
 * frame, we may resume into a different action depending on this
 * information.
 *
 * Completeness
 * ------------
 *
 * Whenever a new JSOp is added, the opcode needs to be added to
 * PBL. The compiler should enforce this: if no case is implemented
 * for an opcode, then the label in the computed-goto table will be
 * missing and PBL will not compile.
 *
 * In contrast, CacheIR opcodes need not be implemented right away,
 * and in fact right now most of the less-common ones are not
 * implemented by PBL. If the IC interpreter hits an unimplemented
 * opcode, it acts as if a guard had failed, and transfers to the next
 * stub in the chain. Every chain ends with a fallback stub that can
 * handle every case (it does not execute CacheIR at all, but instead
 * calls into the runtime), so this will always give the correct
 * result, albeit more slowly. Implementing the remainder of the
 * CacheIR opcodes, and new ones as they are added, is thus purely a
 * performance concern.
 *
 * PBL currently does not implement async resume into a suspended
 * generator. There is no particular reason that this cannot be
 * implemented; it just has not been done yet. Such an action will
 * currently call back into the C++ interpreter to run the resumed
 * generator body. Execution up to the first yield-point can still
 * occur in PBL, and PBL can successfully save the suspended state.
 */

#include "jspubtd.h"

#include "jit/BaselineFrame.h"
#include "jit/BaselineIC.h"
#include "jit/JitContext.h"
#include "jit/JitScript.h"
#include "vm/Interpreter.h"
#include "vm/Stack.h"

namespace js {
namespace pbl {

// Trampoline invoked by EnterJit that sets up PBL state and invokes
// the main interpreter loop.
bool PortableBaselineTrampoline(JSContext* cx, size_t argc, Value* argv,
                                size_t numActuals, size_t numFormals,
                                jit::CalleeToken calleeToken,
                                JSObject* envChain, Value* result);

// Predicate: are all conditions satisfied to allow execution within
// PBL? This depends only on properties of the function to be invoked,
// and not on other runtime state, like the current stack depth, so if
// it returns `true` once, it can be assumed to always return `true`
// for that function. See `PortableBaselineInterpreterStackCheck`
// below for a complimentary check that does not have this property.
jit::MethodStatus CanEnterPortableBaselineInterpreter(JSContext* cx,
                                                      RunState& state);

// A check for availbale stack space on the PBL auxiliary stack that
// is invoked before the main trampoline. This is required for entry
// into PBL and should be checked before invoking the trampoline
// above. Unlike `CanEnterPortableBaselineInterpreter`, the result of
// this check cannot be cached: it must be checked on each potential
// entry.
bool PortablebaselineInterpreterStackCheck(JSContext* cx, RunState& state,
                                           size_t numActualArgs);

struct State;
struct Stack;
struct StackVal;
struct StackValNative;
struct ICRegs;
class VMFrameManager;

enum class PBIResult {
  Ok,
  Error,
  Unwind,
  UnwindError,
  UnwindRet,
};

PBIResult PortableBaselineInterpret(JSContext* cx_, State& state, Stack& stack,
                                    StackVal* sp, JSObject* envChain,
                                    Value* ret);

enum class ICInterpretOpResult {
  NextIC,
  Return,
  Error,
  Unwind,
  UnwindError,
  UnwindRet,
};

ICInterpretOpResult MOZ_ALWAYS_INLINE
ICInterpretOps(jit::BaselineFrame* frame, VMFrameManager& frameMgr,
               State& state, ICRegs& icregs, Stack& stack, StackVal* sp,
               jit::ICCacheIRStub* cstub, jsbytecode* pc);

} /* namespace pbl */
} /* namespace js */

#endif /* vm_PortableBaselineInterpret_h */