diff options
Diffstat (limited to 'gfx/docs')
-rw-r--r-- | gfx/docs/AdvancedLayers.rst | 370 | ||||
-rw-r--r-- | gfx/docs/AsyncPanZoom.rst | 687 | ||||
-rw-r--r-- | gfx/docs/AsyncPanZoomArchitecture.png | bin | 0 -> 67837 bytes | |||
-rw-r--r-- | gfx/docs/GraphicsOverview.rst | 159 | ||||
-rw-r--r-- | gfx/docs/LayersHistory.rst | 63 | ||||
-rw-r--r-- | gfx/docs/OffMainThreadPainting.rst | 237 | ||||
-rw-r--r-- | gfx/docs/RenderingOverview.rst | 384 | ||||
-rw-r--r-- | gfx/docs/RenderingOverviewBlurTask.png | bin | 0 -> 16264 bytes | |||
-rw-r--r-- | gfx/docs/RenderingOverviewDetail.png | bin | 0 -> 148839 bytes | |||
-rw-r--r-- | gfx/docs/RenderingOverviewSimple.png | bin | 0 -> 54981 bytes | |||
-rw-r--r-- | gfx/docs/RenderingOverviewTrees.png | bin | 0 -> 80062 bytes | |||
-rw-r--r-- | gfx/docs/Silk.rst | 472 | ||||
-rw-r--r-- | gfx/docs/SilkArchitecture.png | bin | 0 -> 221047 bytes | |||
-rw-r--r-- | gfx/docs/index.rst | 18 |
14 files changed, 2390 insertions, 0 deletions
diff --git a/gfx/docs/AdvancedLayers.rst b/gfx/docs/AdvancedLayers.rst new file mode 100644 index 0000000000..b4bcc132cb --- /dev/null +++ b/gfx/docs/AdvancedLayers.rst @@ -0,0 +1,370 @@ +Advanced Layers +=============== + +Advanced Layers is a new method of compositing layers in Gecko. This +document serves as a technical overview and provides a short +walk-through of its source code. + +Overview +-------- + +Advanced Layers attempts to group as many GPU operations as it can into +a single draw call. This is a common technique in GPU-based rendering +called “batching”. It is not always trivial, as a batching algorithm can +easily waste precious CPU resources trying to build optimal draw calls. + +Advanced Layers reuses the existing Gecko layers system as much as +possible. Huge layer trees do not currently scale well (see the future +work section), so opportunities for batching are currently limited +without expending unnecessary resources elsewhere. However, Advanced +Layers has a few benefits: + +- It submits smaller GPU workloads and buffer uploads than the existing + compositor. +- It needs only a single pass over the layer tree. +- It uses occlusion information more intelligently. +- It is easier to add new specialized rendering paths and new layer + types. +- It separates compositing logic from device logic, unlike the existing + compositor. +- It is much faster at rendering 3d scenes or complex layer trees. +- It has experimental code to use the z-buffer for occlusion culling. + +Because of these benefits we hope that it provides a significant +improvement over the existing compositor. + +Advanced Layers uses the acronym “MLG” and “MLGPU” in many places. This +stands for “Mid-Level Graphics”, the idea being that it is optimized for +Direct3D 11-style rendering systems as opposed to Direct3D 12 or Vulkan. + +LayerManagerMLGPU +----------------- + +Advanced layers does not change client-side rendering at all. Content +still uses Direct2D (when possible), and creates identical layer trees +as it would with a normal Direct3D 11 compositor. In fact, Advanced +Layers re-uses all of the existing texture handling and video +infrastructure as well, replacing only the composite-side layer types. + +Advanced Layers does not create a ``LayerManagerComposite`` - instead, +it creates a ``LayerManagerMLGPU``. This layer manager does not have a +``Compositor`` - instead, it has an ``MLGDevice``, which roughly +abstracts the Direct3D 11 API. (The hope is that this API is easily +interchangeable for something else when cross-platform or software +support is needed.) + +``LayerManagerMLGPU`` also dispenses with the old “composite” layers for +new layer types. For example, ``ColorLayerComposite`` becomes +``ColorLayerMLGPU``. Since these layer types implement ``HostLayer``, +they integrate with ``LayerTransactionParent`` as normal composite +layers would. + +Rendering Overview +------------------ + +The steps for rendering are described in more detail below, but roughly +the process is: + +1. Sort layers front-to-back. +2. Create a dependency tree of render targets (called “views”). +3. Accumulate draw calls for all layers in each view. +4. Upload draw call buffers to the GPU. +5. Execute draw commands for each view. + +Advanced Layers divides the layer tree into “views” +(``RenderViewMLGPU``), which correspond to a render target. The root +layer is represented by a view corresponding to the screen. Layers that +require intermediate surfaces have temporary views. Layers are analyzed +front-to-back, and rendered back-to-front within a view. Views +themselves are rendered front-to-back, to minimize render target +switching. + +Each view contains one or more rendering passes (``RenderPassMLGPU``). A +pass represents a single draw command with one or more rendering items +attached to it. For example, a ``SolidColorPass`` item contains a +rectangle and an RGBA value, and many of these can be drawn with a +single GPU call. + +When considering a layer, views will first try to find an existing +rendering batch that can support it. If so, that pass will accumulate +another draw item for the layer. Otherwise, a new pass will be added. + +When trying to find a matching pass for a layer, there is a tradeoff in +CPU time versus the GPU time saved by not issuing another draw commands. +We generally care more about CPU time, so we do not try too hard in +matching items to an existing batch. + +After all layers have been processed, there is a “prepare” step. This +copies all accumulated draw data and uploads it into vertex and constant +buffers in the GPU. + +Finally, we execute rendering commands. At the end of the frame, all +batches and (most) constant buffers are thrown away. + +Shaders Overview +---------------- + +Advanced Layers currently has five layer-related shader pipelines: + +- Textured (PaintedLayer, ImageLayer, CanvasLayer) +- ComponentAlpha (PaintedLayer with component-alpha) +- YCbCr (ImageLayer with YCbCr video) +- Color (ColorLayers) +- Blend (ContainerLayers with mix-blend modes) + +There are also three special shader pipelines: + +- MaskCombiner, which is used to combine mask layers into a single + texture. +- Clear, which is used for fast region-based clears when not directly + supported by the GPU. +- Diagnostic, which is used to display the diagnostic overlay texture. + +The layer shaders follow a unified structure. Each pipeline has a vertex +and pixel shader. The vertex shader takes a layers ID, a z-buffer depth, +a unit position in either a unit square or unit triangle, and either +rectangular or triangular geometry. Shaders can also have ancillary data +needed like texture coordinates or colors. + +Most of the time, layers have simple rectangular clips with simple +rectilinear transforms, and pixel shaders do not need to perform masking +or clipping. For these layers we use a fast-path pipeline, using +unit-quad shaders that are able to clip geometry so the pixel shader +does not have to. This type of pipeline does not support complex masks. + +If a layer has a complex mask, a rotation or 3d transform, or a complex +operation like blending, then we use shaders capable of handling +arbitrary geometry. Their input is a unit triangle, and these shaders +are generally more expensive. + +All of the shader-specific data is modelled in ShaderDefinitionsMLGPU.h. + +CPU Occlusion Culling +--------------------- + +By default, Advanced Layers performs occlusion culling on the CPU. Since +layers are visited front-to-back, this is simply a matter of +accumulating the visible region of opaque layers, and subtracting it +from the visible region of subsequent layers. There is a major +difference between this occlusion culling and PostProcessLayers of the +old compositor: AL performs culling after invalidation, not before. +Completely valid layers will have an empty visible region. + +Most layer types (with the exception of images) will intelligently split +their draw calls into a batch of individual rectangles, based on their +visible region. + +Z-Buffering and Occlusion +------------------------- + +Advanced Layers also supports occlusion culling on the GPU, using a +z-buffer. This is disabled by default currently since it is +significantly costly on integrated GPUs. When using the z-buffer, we +separate opaque layers into a separate list of passes. The render +process then uses the following steps: + +1. The depth buffer is set to read-write. +2. Opaque batches are executed., +3. The depth buffer is set to read-only. +4. Transparent batches are executed. + +The problem we have observed is that the depth buffer increases writes +to the GPU, and on integrated GPUs this is expensive - we have seen draw +call times increase by 20-30%, which is the wrong direction we want to +take on battery life. In particular on a full screen video, the call to +ClearDepthStencilView plus the actual depth buffer write of the video +can double GPU time. + +For now the depth-buffer is disabled until we can find a compelling case +for it on non-integrated hardware. + +Clipping +-------- + +Clipping is a bit tricky in Advanced Layers. We cannot use the hardware +“scissor” feature, since the clip can change from instance to instance +within a batch. And if using the depth buffer, we cannot write +transparent pixels for the clipped area. As a result we always clip +opaque draw rects in the vertex shader (and sometimes even on the CPU, +as is needed for sane texture coordinates). Only transparent items are +clipped in the pixel shader. As a result, masked layers and layers with +non-rectangular transforms are always considered transparent, and use a +more flexible clipping pipeline. + +Plane Splitting +--------------- + +Plane splitting is when a 3D transform causes a layer to be split - for +example, one transparent layer may intersect another on a separate +plane. When this happens, Gecko sorts layers using a BSP tree and +produces a list of triangles instead of draw rects. + +These layers cannot use the “unit quad” shaders that support the fast +clipping pipeline. Instead they always use the full triangle-list +shaders that support extended vertices and clipping. + +This is the slowest path we can take when building a draw call, since we +must interact with the polygon clipping and texturing code. + +Masks +----- + +For each layer with a mask attached, Advanced Layers builds a +``MaskOperation``. These operations must resolve to a single mask +texture, as well as a rectangular area to which the mask applies. All +batched pixel shaders will automatically clip pixels to the mask if a +mask texture is bound. (Note that we must use separate batches if the +mask texture changes.) + +Some layers have multiple mask textures. In this case, the MaskOperation +will store the list of masks, and right before rendering, it will invoke +a shader to combine these masks into a single texture. + +MaskOperations are shared across layers when possible, but are not +cached across frames. + +BigImage Support +---------------- + +ImageLayers and CanvasLayers can be tiled with many individual textures. +This happens in rare cases where the underlying buffer is too big for +the GPU. Early on this caused problems for Advanced Layers, since AL +required one texture per layer. We implemented BigImage support by +creating temporary ImageLayers for each visible tile, and throwing those +layers away at the end of the frame. + +Advanced Layers no longer has a 1:1 layer:texture restriction, but we +retain the temporary layer solution anyway. It is not much code and it +means we do not have to split ``TexturedLayerMLGPU`` methods into +iterated and non-iterated versions. + +Texture Locking +--------------- + +Advanced Layers has a different texture locking scheme than the existing +compositor. If a texture needs to be locked, then it is locked by the +MLGDevice automatically when bound to the current pipeline. The +MLGDevice keeps a set of the locked textures to avoid double-locking. At +the end of the frame, any textures in the locked set are unlocked. + +We cannot easily replicate the locking scheme in the old compositor, +since the duration of using the texture is not scoped to when we visit +the layer. + +Buffer Measurements +------------------- + +Advanced Layers uses constant buffers to send layer information and +extended instance data to the GPU. We do this by pre-allocating large +constant buffers and mapping them with ``MAP_DISCARD`` at the beginning +of the frame. Batches may allocate into this up to the maximum bindable +constant buffer size of the device (currently, 64KB). + +There are some downsides to this approach. Constant buffers are +difficult to work with - they have specific alignment requirements, and +care must be taken not too run over the maximum number of constants in a +buffer. Another approach would be to store constants in a 2D texture and +use vertex shader texture fetches. Advanced Layers implemented this and +benchmarked it to decide which approach to use. Textures seemed to skew +better on GPU performance, but worse on CPU, but this varied depending +on the GPU. Overall constant buffers performed best and most +consistently, so we have kept them. + +Additionally, we tested different ways of performing buffer uploads. +Buffer creation itself is costly, especially on integrated GPUs, and +especially so for immutable, immediate-upload buffers. As a result we +aggressively cache buffer objects and always allocate them as +MAP_DISCARD unless they are write-once and long-lived. + +Buffer Types +------------ + +Advanced Layers has a few different classes to help build and upload +buffers to the GPU. They are: + +- ``MLGBuffer``. This is the low-level shader resource that + ``MLGDevice`` exposes. It is the building block for buffer helper + classes, but it can also be used to make one-off, immutable, + immediate-upload buffers. MLGBuffers, being a GPU resource, are + reference counted. +- ``SharedBufferMLGPU``. These are large, pre-allocated buffers that + are read-only on the GPU and write-only on the CPU. They usually + exceed the maximum bindable buffer size. There are three shared + buffers created by default and they are automatically unmapped as + needed: one for vertices, one for vertex shader constants, and one + for pixel shader constants. When callers allocate into a shared + buffer they get back a mapped pointer, a GPU resource, and an offset. + When the underlying device supports offsetable buffers (like + ``ID3D11DeviceContext1`` does), this results in better GPU + utilization, as there are less resources and fewer upload commands. +- ``ConstantBufferSection`` and ``VertexBufferSection``. These are + “views” into a ``SharedBufferMLGPU``. They contain the underlying + ``MLGBuffer``, and when offsetting is supported, the offset + information necessary for resource binding. Sections are not + reference counted. +- ``StagingBuffer``. A dynamically sized CPU buffer where items can be + appended in a free-form manner. The stride of a single “item” is + computed by the first item written, and successive items must have + the same stride. The buffer must be uploaded to the GPU manually. + Staging buffers are appropriate for creating general constant or + vertex buffer data. They can also write items in reverse, which is + how we render back-to-front when layers are visited front-to-back. + They can be uploaded to a ``SharedBufferMLGPU`` or an immutabler + ``MLGBuffer`` very easily. Staging buffers are not reference counted. + +Unsupported Features +-------------------- + +Currently, these features of the old compositor are not yet implemented. + +- OpenGL and software support (currently AL only works on D3D11). +- APZ displayport overlay. +- Diagnostic/developer overlays other than the FPS/timing overlay. +- DEAA. It was never ported to the D3D11 compositor, but we would like + it. +- Component alpha when used inside an opaque intermediate surface. +- Effects prefs. Possibly not needed post-B2G removal. +- Widget overlays and underlays used by macOS and Android. +- DefaultClearColor. This is Android specific, but is easy to added + when needed. +- Frame uniformity info in the profiler. Possibly not needed post-B2G + removal. +- LayerScope. There are no plans to make this work. + +Future Work +----------- + +- Refactor for D3D12/Vulkan support (namely, split MLGDevice into + something less stateful and something else more low-level). +- Remove “MLG” moniker and namespace everything. +- Other backends (D3D12/Vulkan, OpenGL, Software) +- Delete CompositorD3D11 +- Add DEAA support +- Re-enable the depth buffer by default for fast GPUs +- Re-enable right-sizing of inaccurately sized containers +- Drop constant buffers for ancillary vertex data +- Fast shader paths for simple video/painted layer cases + +History +------- + +Advanced Layers has gone through four major design iterations. The +initial version used tiling - each render view divided the screen into +128x128 tiles, and layers were assigned to tiles based on their +screen-space draw area. This approach proved not to scale well to 3d +transforms, and so tiling was eliminated. + +We replaced it with a simple system of accumulating draw regions to each +batch, thus ensuring that items could be assigned to batches while +maintaining correct z-ordering. This second iteration also coincided +with plane-splitting support. + +On large layer trees, accumulating the affected regions of batches +proved to be quite expensive. This led to a third iteration, using depth +buffers and separate opaque and transparent batch lists to achieve +z-ordering and occlusion culling. + +Finally, depth buffers proved to be too expensive, and we introduced a +simple CPU-based occlusion culling pass. This iteration coincided with +using more precise draw rects and splitting pipelines into unit-quad, +cpu-clipped and triangle-list, gpu-clipped variants. diff --git a/gfx/docs/AsyncPanZoom.rst b/gfx/docs/AsyncPanZoom.rst new file mode 100644 index 0000000000..01bf2776df --- /dev/null +++ b/gfx/docs/AsyncPanZoom.rst @@ -0,0 +1,687 @@ +.. _apz: + +Asynchronous Panning and Zooming +================================ + +**This document is a work in progress. Some information may be missing +or incomplete.** + +.. image:: AsyncPanZoomArchitecture.png + +Goals +----- + +We need to be able to provide a visual response to user input with +minimal latency. In particular, on devices with touch input, content +must track the finger exactly while panning, or the user experience is +very poor. According to the UX team, 120ms is an acceptable latency +between user input and response. + +Context and surrounding architecture +------------------------------------ + +The fundamental problem we are trying to solve with the Asynchronous +Panning and Zooming (APZ) code is that of responsiveness. By default, +web browsers operate in a “game loop” that looks like this: + +:: + + while true: + process input + do computations + repaint content + display repainted content + +In browsers the “do computation” step can be arbitrarily expensive +because it can involve running event handlers in web content. Therefore, +there can be an arbitrary delay between the input being received and the +on-screen display getting updated. + +Responsiveness is always good, and with touch-based interaction it is +even more important than with mouse or keyboard input. In order to +ensure responsiveness, we split the “game loop” model of the browser +into a multithreaded variant which looks something like this: + +:: + + Thread 1 (compositor thread) + while true: + receive input + send a copy of input to thread 2 + adjust painted content based on input + display adjusted painted content + + Thread 2 (main thread) + while true: + receive input from thread 1 + do computations + repaint content + update the copy of painted content in thread 1 + +This multithreaded model is called off-main-thread compositing (OMTC), +because the compositing (where the content is displayed on-screen) +happens on a separate thread from the main thread. Note that this is a +very very simplified model, but in this model the “adjust painted +content based on input” is the primary function of the APZ code. + +The “painted content” is stored on a set of “layers”, that are +conceptually double-buffered. That is, when the main thread does its +repaint, it paints into one set of layers (the “client” layers). The +update that is sent to the compositor thread copies all the changes from +the client layers into another set of layers that the compositor holds. +These layers are called the “shadow” layers or the “compositor” layers. +The compositor in theory can continuously composite these shadow layers +to the screen while the main thread is busy doing other things and +painting a new set of client layers. + +The APZ code takes the input events that are coming in from the hardware +and uses them to figure out what the user is trying to do (e.g. pan the +page, zoom in). It then expresses this user intention in the form of +translation and/or scale transformation matrices. These transformation +matrices are applied to the shadow layers at composite time, so that +what the user sees on-screen reflects what they are trying to do as +closely as possible. + +Technical overview +------------------ + +As per the heavily simplified model described above, the fundamental +purpose of the APZ code is to take input events and produce +transformation matrices. This section attempts to break that down and +identify the different problems that make this task non-trivial. + +Checkerboarding +~~~~~~~~~~~~~~~ + +The content area that is painted and stored in a shadow layer is called +the “displayport”. The APZ code is responsible for determining how large +the displayport should be. On the one hand, we want the displayport to +be as large as possible. At the very least it needs to be larger than +what is visible on-screen, because otherwise, as soon as the user pans, +there will be some unpainted area of the page exposed. However, we +cannot always set the displayport to be the entire page, because the +page can be arbitrarily long and this would require an unbounded amount +of memory to store. Therefore, a good displayport size is one that is +larger than the visible area but not so large that it is a huge drain on +memory. Because the displayport is usually smaller than the whole page, +it is always possible for the user to scroll so fast that they end up in +an area of the page outside the displayport. When this happens, they see +unpainted content; this is referred to as “checkerboarding”, and we try +to avoid it where possible. + +There are many possible ways to determine what the displayport should be +in order to balance the tradeoffs involved (i.e. having one that is too +big is bad for memory usage, and having one that is too small results in +excessive checkerboarding). Ideally, the displayport should cover +exactly the area that we know the user will make visible. Although we +cannot know this for sure, we can use heuristics based on current +panning velocity and direction to ensure a reasonably-chosen displayport +area. This calculation is done in the APZ code, and a new desired +displayport is frequently sent to the main thread as the user is panning +around. + +Multiple layers +~~~~~~~~~~~~~~~ + +Consider, for example, a scrollable page that contains an iframe which +itself is scrollable. The iframe can be scrolled independently of the +top-level page, and we would like both the page and the iframe to scroll +responsively. This means that we want independent asynchronous panning +for both the top-level page and the iframe. In addition to iframes, +elements that have the overflow:scroll CSS property set are also +scrollable, and also end up on separate scrollable layers. In the +general case, the layers are arranged in a tree structure, and so within +the APZ code we have a matching tree of AsyncPanZoomController (APZC) +objects, one for each scrollable layer. To manage this tree of APZC +instances, we have a single APZCTreeManager object. Each APZC is +relatively independent and handles the scrolling for its associated +layer, but there are some cases in which they need to interact; these +cases are described in the sections below. + +Hit detection +~~~~~~~~~~~~~ + +Consider again the case where we have a scrollable page that contains an +iframe which itself is scrollable. As described above, we will have two +APZC instances - one for the page and one for the iframe. When the user +puts their finger down on the screen and moves it, we need to do some +sort of hit detection in order to determine whether their finger is on +the iframe or on the top-level page. Based on where their finger lands, +the appropriate APZC instance needs to handle the input. This hit +detection is also done in the APZCTreeManager, as it has the necessary +information about the sizes and positions of the layers. Currently this +hit detection is not perfect, as it uses rects and does not account for +things like rounded corners and opacity. + +Also note that for some types of input (e.g. when the user puts two +fingers down to do a pinch) we do not want the input to be “split” +across two different APZC instances. In the case of a pinch, for +example, we find a “common ancestor” APZC instance - one that is +zoomable and contains all of the touch input points, and direct the +input to that APZC instance. + +Scroll Handoff +~~~~~~~~~~~~~~ + +Consider yet again the case where we have a scrollable page that +contains an iframe which itself is scrollable. Say the user scrolls the +iframe so that it reaches the bottom. If the user continues panning on +the iframe, the expectation is that the top-level page will start +scrolling. However, as discussed in the section on hit detection, the +APZC instance for the iframe is separate from the APZC instance for the +top-level page. Thus, we need the two APZC instances to communicate in +some way such that input events on the iframe result in scrolling on the +top-level page. This behaviour is referred to as “scroll handoff” (or +“fling handoff” in the case where analogous behaviour results from the +scrolling momentum of the page after the user has lifted their finger). + +Input event untransformation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The APZC architecture by definition results in two copies of a “scroll +position” for each scrollable layer. There is the original copy on the +main thread that is accessible to web content and the layout and +painting code. And there is a second copy on the compositor side, which +is updated asynchronously based on user input, and corresponds to what +the user visually sees on the screen. Although these two copies may +diverge temporarily, they are reconciled periodically. In particular, +they diverge while the APZ code is performing an async pan or zoom +action on behalf of the user, and are reconciled when the APZ code +requests a repaint from the main thread. + +Because of the way input events are stored, this has some unfortunate +consequences. Input events are stored relative to the device screen - so +if the user touches at the same physical spot on the device, the same +input events will be delivered regardless of the content scroll +position. When the main thread receives a touch event, it combines that +with the content scroll position in order to figure out what DOM element +the user touched. However, because we now have two different scroll +positions, this process may not work perfectly. A concrete example +follows: + +Consider a device with screen size 600 pixels tall. On this device, a +user is viewing a document that is 1000 pixels tall, and that is +scrolled down by 200 pixels. That is, the vertical section of the +document from 200px to 800px is visible. Now, if the user touches a +point 100px from the top of the physical display, the hardware will +generate a touch event with y=100. This will get sent to the main +thread, which will add the scroll position (200) and get a +document-relative touch event with y=300. This new y-value will be used +in hit detection to figure out what the user touched. If the document +had a absolute-positioned div at y=300, then that would receive the +touch event. + +Now let us add some async scrolling to this example. Say that the user +additionally scrolls the document by another 10 pixels asynchronously +(i.e. only on the compositor thread), and then does the same touch +event. The same input event is generated by the hardware, and as before, +the document will deliver the touch event to the div at y=300. However, +visually, the document is scrolled by an additional 10 pixels so this +outcome is wrong. What needs to happen is that the APZ code needs to +intercept the touch event and account for the 10 pixels of asynchronous +scroll. Therefore, the input event with y=100 gets converted to y=110 in +the APZ code before being passed on to the main thread. The main thread +then adds the scroll position it knows about and determines that the +user touched at a document-relative position of y=310. + +Analogous input event transformations need to be done for horizontal +scrolling and zooming. + +Content independently adjusting scrolling +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As described above, there are two copies of the scroll position in the +APZ architecture - one on the main thread and one on the compositor +thread. Usually for architectures like this, there is a single “source +of truth” value and the other value is simply a copy. However, in this +case that is not easily possible to do. The reason is that both of these +values can be legitimately modified. On the compositor side, the input +events the user is triggering modify the scroll position, which is then +propagated to the main thread. However, on the main thread, web content +might be running Javascript code that programmatically sets the scroll +position (via window.scrollTo, for example). Scroll changes driven from +the main thread are just as legitimate and need to be propagated to the +compositor thread, so that the visual display updates in response. + +Because the cross-thread messaging is asynchronous, reconciling the two +types of scroll changes is a tricky problem. Our design solves this +using various flags and generation counters. The general heuristic we +have is that content-driven scroll position changes (e.g. scrollTo from +JS) are never lost. For instance, if the user is doing an async scroll +with their finger and content does a scrollTo in the middle, then some +of the async scroll would occur before the “jump” and the rest after the +“jump”. + +Content preventing default behaviour of input events +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Another problem that we need to deal with is that web content is allowed +to intercept touch events and prevent the “default behaviour” of +scrolling. This ability is defined in web standards and is +non-negotiable. Touch event listeners in web content are allowed call +preventDefault() on the touchstart or first touchmove event for a touch +point; doing this is supposed to “consume” the event and prevent +touch-based panning. As we saw in a previous section, the input event +needs to be untransformed by the APZ code before it can be delivered to +content. But, because of the preventDefault problem, we cannot fully +process the touch event in the APZ code until content has had a chance +to handle it. Web browsers in general solve this problem by inserting a +delay of up to 300ms before processing the input - that is, web content +is allowed up to 300ms to process the event and call preventDefault on +it. If web content takes longer than 300ms, or if it completes handling +of the event without calling preventDefault, then the browser +immediately starts processing the events. + +The way the APZ implementation deals with this is that upon receiving a +touch event, it immediately returns an untransformed version that can be +dispatched to content. It also schedules a 400ms timeout (600ms on +Android) during which content is allowed to prevent scrolling. There is +an API that allows the main-thread event dispatching code to notify the +APZ as to whether or not the default action should be prevented. If the +APZ content response timeout expires, or if the main-thread event +dispatching code notifies the APZ of the preventDefault status, then the +APZ continues with the processing of the events (which may involve +discarding the events). + +The touch-action CSS property from the pointer-events spec is intended +to allow eliminating this 400ms delay in many cases (although for +backwards compatibility it will still be needed for a while). Note that +even with touch-action implemented, there may be cases where the APZ +code does not know the touch-action behaviour of the point the user +touched. In such cases, the APZ code will still wait up to 400ms for the +main thread to provide it with the touch-action behaviour information. + +Technical details +----------------- + +This section describes various pieces of the APZ code, and goes into +more specific detail on APIs and code than the previous sections. The +primary purpose of this section is to help people who plan on making +changes to the code, while also not going into so much detail that it +needs to be updated with every patch. + +Overall flow of input events +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section describes how input events flow through the APZ code. + +1. Input events arrive from the hardware/widget code into the APZ via + APZCTreeManager::ReceiveInputEvent. The thread that invokes this is + called the input thread, and may or may not be the same as the Gecko + main thread. +2. Conceptually the first thing that the APZCTreeManager does is to + associate these events with “input blocks”. An input block is a set + of events that share certain properties, and generally are intended + to represent a single gesture. For example with touch events, all + events following a touchstart up to but not including the next + touchstart are in the same block. All of the events in a given block + will go to the same APZC instance and will either all be processed + or all be dropped. +3. Using the first event in the input block, the APZCTreeManager does a + hit-test to see which APZC it hits. This hit-test uses the event + regions populated on the layers, which may be larger than the true + hit area of the layer. If no APZC is hit, the events are discarded + and we jump to step 6. Otherwise, the input block is tagged with the + hit APZC as a tentative target and put into a global APZ input + queue. +4. + + i. If the input events landed outside the dispatch-to-content event + region for the layer, any available events in the input block + are processed. These may trigger behaviours like scrolling or + tap gestures. + ii. If the input events landed inside the dispatch-to-content event + region for the layer, the events are left in the queue and a + 400ms timeout is initiated. If the timeout expires before step 9 + is completed, the APZ assumes the input block was not cancelled + and the tentative target is correct, and processes them as part + of step 10. + +5. The call stack unwinds back to APZCTreeManager::ReceiveInputEvent, + which does an in-place modification of the input event so that any + async transforms are removed. +6. The call stack unwinds back to the widget code that called + ReceiveInputEvent. This code now has the event in the coordinate + space Gecko is expecting, and so can dispatch it to the Gecko main + thread. +7. Gecko performs its own usual hit-testing and event dispatching for + the event. As part of this, it records whether any touch listeners + cancelled the input block by calling preventDefault(). It also + activates inactive scrollframes that were hit by the input events. +8. The call stack unwinds back to the widget code, which sends two + notifications to the APZ code on the input thread. The first + notification is via APZCTreeManager::ContentReceivedInputBlock, and + informs the APZ whether the input block was cancelled. The second + notification is via APZCTreeManager::SetTargetAPZC, and informs the + APZ of the results of the Gecko hit-test during event dispatch. Note + that Gecko may report that the input event did not hit any + scrollable frame at all. The SetTargetAPZC notification happens only + once per input block, while the ContentReceivedInputBlock + notification may happen once per block, or multiple times per block, + depending on the input type. +9. + + i. If the events were processed as part of step 4(i), the + notifications from step 8 are ignored and step 10 is skipped. + ii. If events were queued as part of step 4(ii), and steps 5-8 take + less than 400ms, the arrival of both notifications from step 8 + will mark the input block ready for processing. + iii. If events were queued as part of step 4(ii), but steps 5-8 take + longer than 400ms, the notifications from step 8 will be + ignored and step 10 will already have happened. + +10. If events were queued as part of step 4(ii) they are now either + processed (if the input block was not cancelled and Gecko detected a + scrollframe under the input event, or if the timeout expired) or + dropped (all other cases). Note that the APZC that processes the + events may be different at this step than the tentative target from + step 3, depending on the SetTargetAPZC notification. Processing the + events may trigger behaviours like scrolling or tap gestures. + +If the CSS touch-action property is enabled, the above steps are +modified as follows: \* In step 4, the APZC also requires the allowed +touch-action behaviours for the input event. This might have been +determined as part of the hit-test in APZCTreeManager; if not, the +events are queued. \* In step 6, the widget code determines the content +element at the point under the input element, and notifies the APZ code +of the allowed touch-action behaviours. This notification is sent via a +call to APZCTreeManager::SetAllowedTouchBehavior on the input thread. \* +In step 9(ii), the input block will only be marked ready for processing +once all three notifications arrive. + +Threading considerations +^^^^^^^^^^^^^^^^^^^^^^^^ + +The bulk of the input processing in the APZ code happens on what we call +“the input thread”. In practice the input thread could be the Gecko main +thread, the compositor thread, or some other thread. There are obvious +downsides to using the Gecko main thread - that is, “asynchronous” +panning and zooming is not really asynchronous as input events can only +be processed while Gecko is idle. In an e10s environment, using the +Gecko main thread of the chrome process is acceptable, because the code +running in that process is more controllable and well-behaved than +arbitrary web content. Using the compositor thread as the input thread +could work on some platforms, but may be inefficient on others. For +example, on Android (Fennec) we receive input events from the system on +a dedicated UI thread. We would have to redispatch the input events to +the compositor thread if we wanted to the input thread to be the same as +the compositor thread. This introduces a potential for higher latency, +particularly if the compositor does any blocking operations - blocking +SwapBuffers operations, for example. As a result, the APZ code itself +does not assume that the input thread will be the same as the Gecko main +thread or the compositor thread. + +Active vs. inactive scrollframes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The number of scrollframes on a page is potentially unbounded. However, +we do not want to create a separate layer for each scrollframe right +away, as this would require large amounts of memory. Therefore, +scrollframes as designated as either “active” or “inactive”. Active +scrollframes are the ones that do have their contents put on a separate +layer (or set of layers), and inactive ones do not. + +Consider a page with a scrollframe that is initially inactive. When +layout generates the layers for this page, the content of the +scrollframe will be flattened into some other PaintedLayer (call it P). +The layout code also adds the area (or bounding region in case of weird +shapes) of the scrollframe to the dispatch-to-content region of P. + +When the user starts interacting with that content, the hit-test in the +APZ code finds the dispatch-to-content region of P. The input block +therefore has a tentative target of P when it goes into step 4(ii) in +the flow above. When gecko processes the input event, it must detect the +inactive scrollframe and activate it, as part of step 7. Finally, the +widget code sends the SetTargetAPZC notification in step 8 to notify the +APZ that the input block should really apply to this new layer. The +issue here is that the layer transaction containing the new layer must +reach the compositor and APZ before the SetTargetAPZC notification. If +this does not occur within the 400ms timeout, the APZ code will be +unable to update the tentative target, and will continue to use P for +that input block. Input blocks that start after the layer transaction +will get correctly routed to the new layer as there will now be a layer +and APZC instance for the active scrollframe. + +This model implies that when the user initially attempts to scroll an +inactive scrollframe, it may end up scrolling an ancestor scrollframe. +(This is because in the absence of the SetTargetAPZC notification, the +input events will get applied to the closest ancestor scrollframe’s +APZC.) Only after the round-trip to the gecko thread is complete is +there a layer for async scrolling to actually occur on the scrollframe +itself. At that point the scrollframe will start receiving new input +blocks and will scroll normally. + +WebRender Integration +~~~~~~~~~~~~~~~~~~~~~ + +The APZ code was originally written to work with the "layers" graphics +backend. Many of the concepts (and therefore variable/function names) +stem from the integration with the layers backend. After the WebRender +backend was added, the existing code evolved over time to integrate +with that backend as well, resulting in a bit of a hodge-podge effect. +With that cautionary note out of the way, there are three main pieces +that need to be understood to grasp the integration between the APZ +code and WebRender. These are detailed below. + +HitTestingTree +^^^^^^^^^^^^^^ + +The APZCTreeManager keeps as part of its internal state a tree of +HitTestingTreeNode instances. This is referred to as the HitTestingTree. +As the name implies, this was used for hit-testing purposes, so that +APZ could determine which scrollframe a particular incoming input event +would be targeting. Doing the hit-test requires access to a bunch of state, +such as CSS transforms and clip rects, as well as ancillary data like +event regions, which affect how APZ reacts to input events. + +With the layers backend, all this information was provided by a layer tree +update, and so the HitTestingTree was created to mirror the layer tree, +allowing APZ access to that information from other threads. The structure +of the tree was identical to the layer tree. But with WebRender, there +is no "layer tree" per se, and instead we "fake it" by creating a +HitTestingTree structure that is similar to what it would be like on the +equivalent layer tree. But the bigger difference is that with WebRender, +the HitTestingTree is not actually used for hit-testing at all; instead +we get WebRender to do the hit-test for us, as it can do so using its +own internal state and produce a more precise result. + +Information stored in the HitTestingTree (e.g. CSS transforms) is still +used by other pieces of APZ (e.g. some of the scrollbar manipulation code) +so it is still needed, even with the WebRender backend. For this reason, +and for consistency between the two backends, we try to populate as much +information in the HitTestingTree that we can, even with the WebRender +backend. + +With the layers backend, the way the HitTestingTree is created is by +walking the layer tree with a LayerMetricsWrapper class. This wraps +a layer tree but also expands layers with multiple ScrollMetadata into +multiple nodes. The equivalent in the WebRender world is the +WebRenderScrollDataWrapper, which wraps a WebRenderScrollData object. The +WebRenderScrollData object is roughly analogous to a layer tree, but +is something that is constructed deliberately rather than being a natural +output from the WebRender paint transaction (i.e. we create it explicitly +for APZ's consumption, rather than something that we would create anyway +for WebRender's consumption). + +The WebRenderScrollData structure contains within it a tree of +WebRenderLayerScrollData instances, which are analogous to individual +layers in a layer tree. These instances contain various fields like +CSS transforms, fixed/sticky position info, etc. that would normally be +found on individual layers in the layer tree. This allows the code +that builds the HitTestingTree to consume either a WebRenderScrollData +or a layer tree in a more-or-less unified fashion. + +Working backwards a bit more, the WebRenderLayerScrollData instances +are created as we traverse the Gecko display list and build the +WebRender display list. In the layers world, the code in FrameLayerBuilder +was responsible for building the layer tree from the Gecko display list, +but in the WebRender world, this happens primarily in WebRenderCommandBuilder. +As of this writing, the architecture for this is that, as we walk +the Gecko display list, we query it to see if it contains any information +that APZ might need to know (e.g. CSS transforms) via a call to +`nsDisplayItem::UpdateScrollData(nullptr, nullptr)`. If this call +returns true, we create a WebRenderLayerScrollData instance for the item, +and populate it with the necessary information in +`WebRenderLayerScrollData::Initialize`. We also create +WebRenderLayerScrollData instances if we detect (via ASR changes) that +we are now processing a Gecko display item that is in a different scrollframe +than the previous item. This is equivalent to how FrameLayerBuilder will +flatten items with different ASRs into different layers, so that it +is cheap to scroll scrollframes in the compositor. + +The main sources of complexity in this code come from: + +1. Ensuring the ScrollMetadata instances end on the proper + WebRenderLayerScrollData instances (such that every path from a leaf + WebRenderLayerScrollData node to the root has a consistent ordering of + scrollframes without duplications). +2. The deferred-transform optimization that is described in more detail + at the declaration of StackingContextHelper::mDeferredTransformItem. + +Hit-testing +^^^^^^^^^^^ + +Since the HitTestingTree is not used for actual hit-testing purposes +with the WebRender backend (see previous section), this section describes +how hit-testing actually works with WebRender. + +With both layers and WebRender, the Gecko display list contains display items +(`nsDisplayCompositorHitTestInfo`) that store hit-testing state. These +items implement the `CreateWebRenderCommands` method and generate a "hit-test +item" into the WebRender display list. This is basically just a rectangle +item in the WebRender display list that is no-op for painting purposes, +but contains information that should be returned by the hit-test (specifically +the hit info flags and the scrollId of the enclosing scrollframe). The +hit-test item gets clipped and transformed in the same way that all the other +items in the WebRender display list do, via clip chains and enclosing +reference frame/stacking context items. + +When WebRender needs to do a hit-test, it goes through its display list, +taking into account the current clips and transforms, adjusted for the +most recent async scroll/zoom, and determines which hit-test item(s) are under +the target point, and returns those items. APZ can then take the frontmost +item from that list (or skip over it if it happens to be inside a OOP +subdocument that's pointer-events:none) and use that as the hit target. +It's important to note that when APZ does hit-testing for the layers backend, +it uses the most recent available async transforms, even if those transforms +have not yet been composited. With WebRender, the hit-test uses the last +transform provided by the `SampleForWebRender` API (see next section) which +generally reflects the last composite, and doesn't take into account further +changes to the transforms that have occurred since then. + +When debugging hit-test issues, it is often useful to apply the patches +on bug 1656260, which introduce a guid on Gecko display items and propagate +it all the way through to where APZ gets the hit-test result. This allows +answering the question "which nsDisplayCompositorHitTestInfo was responsible +for this hit-test result?" which is often a very good first step in +solving the bug. From there, one can determine if there was some other +display item in front that should have generated a +nsDisplayCompositorHitTestInfo but didn't, or if display item itself had +incorrect information. The second patch on that bug further allows exposing +hand-written debug info to the APZ code, so that the WR hit-testing +mechanism itself can be more effectively debugged, in case there is a problem +with the WR display items getting improperly transformed or clipped. + +Sampling +^^^^^^^^ + +With both the layers and WebRender backend, the compositing step needs to +read the latest async transforms from APZ in order to ensure scrollframes +are rendered at the right position. In both cases, the API for this is +exposed via the `APZSampler` class. The difference is that with the layers +backend, the `AsyncCompositionManager` walks the layer tree and queries +the transform components for each layer individually via the various getters +on `APZSampler`. In contrast, with the WebRender backend, there is a single +`APZSampler::SampleForWebRender` API that returns all the information needed +for all the scrollframes, scrollthumbs, etc. Conceptually though, the +functionality is pretty similar, because the compositor needs the same +information from APZ regardless of which backend is in use. + +Along with sampling the APZ transforms, the compositor also triggers APZ +animations to advance to the next timestep (usually the next vsync). Again, +with both the WebRender and layers backend, this happens just before reading +the APZ transforms. The only difference is that with the layers backend, +the `AsyncCompositionManager` invokes the `APZSampler::AdvanceAnimations` API +directly, whereas with the WebRender backend this happens as part of the +`APZSampler::SampleForWebRender` implementation. + +Threading / Locking Overview +---------------------------- + +Threads +~~~~~~~ + +There are three threads relevant to APZ: the **controller thread**, +the **updater thread**, and the **sampler thread**. This table lists +which threads play these roles on each platform / configuration: + +===================== ========== =========== ============= ============== ========== ============= +APZ Thread Name Desktop Desktop+GPU Desktop+WR Desktop+WR+GPU Android Android+WR +===================== ========== =========== ============= ============== ========== ============= +**controller thread** UI main GPU main UI main GPU main Java UI Java UI +**updater thread** Compositor Compositor SceneBuilder SceneBuilder Compositor SceneBuilder +**sampler thread** Compositor Compositor RenderBackend RenderBackend Compositor RenderBackend +===================== ========== =========== ============= ============== ========== ============= + +Locks +~~~~~ + +There are also a number of locks used in APZ code: + +======================= ============================== +Lock type How many instances +======================= ============================== +APZ tree lock one per APZCTreeManager +APZC map lock one per APZCTreeManager +APZC instance lock one per AsyncPanZoomController +APZ test lock one per APZCTreeManager +Checkerboard event lock one per AsyncPanZoomController +======================= ============================== + +Thread / Lock Ordering +~~~~~~~~~~~~~~~~~~~~~~ + +To avoid deadlocks, the threads and locks have a global **ordering** +which must be respected. + +Respecting the ordering means the following: + +- Let "A < B" denote that A occurs earlier than B in the ordering +- Thread T may only acquire lock L, if T < L +- A thread may only acquire lock L2 while holding lock L1, if L1 < L2 +- A thread may only block on a response from another thread T while holding a lock L, if L < T + +**The lock ordering is as follows**: + +1. UI main +2. GPU main (only if GPU enabled) +3. Compositor thread +4. SceneBuilder thread (only if WR enabled) +5. **APZ tree lock** +6. RenderBackend thread (only if WR enabled) +7. **APZC map lock** +8. **APZC instance lock** +9. **APZ test lock** +10. **Checkerboard event lock** + +Example workflows +^^^^^^^^^^^^^^^^^ + +Here are some example APZ workflows. Observe how they all obey +the global thread/lock ordering. Feel free to add others: + +- **Input handling** (in WR+GPU) case: UI main -> GPU main -> APZ tree lock -> RenderBackend thread +- **Sync messages** in ``PCompositorBridge.ipdl``: UI main thread -> Compositor thread +- **GetAPZTestData**: Compositor thread -> SceneBuilder thread -> test lock +- **Scene swap**: SceneBuilder thread -> APZ tree lock -> RenderBackend thread +- **Updating hit-testing tree**: SceneBuilder thread -> APZ tree lock -> APZC instance lock +- **Updating APZC map**: SceneBuilder thread -> APZ tree lock -> APZC map lock +- **Sampling and animation deferred tasks** [1]_: RenderBackend thread -> APZC map lock -> APZC instance lock +- **Advancing animations**: RenderBackend thread -> APZC instance lock + +.. [1] It looks like there are two deferred tasks that actually need the tree lock, + ``AsyncPanZoomController::HandleSmoothScrollOverscroll`` and + ``AsyncPanZoomController::HandleFlingOverscroll``. We should be able to rewrite + these to use the map lock instead of the tree lock. + This will allow us to continue running the deferred tasks on the sampler + thread rather than having to bounce them to another thread. diff --git a/gfx/docs/AsyncPanZoomArchitecture.png b/gfx/docs/AsyncPanZoomArchitecture.png Binary files differnew file mode 100644 index 0000000000..d19dcb7c8b --- /dev/null +++ b/gfx/docs/AsyncPanZoomArchitecture.png diff --git a/gfx/docs/GraphicsOverview.rst b/gfx/docs/GraphicsOverview.rst new file mode 100644 index 0000000000..a065101a8d --- /dev/null +++ b/gfx/docs/GraphicsOverview.rst @@ -0,0 +1,159 @@ +Graphics Overview +========================= + +Work in progress. Possibly incorrect or incomplete. +--------------------------------------------------- + +Jargon +------ + +There's a lot of jargon in the graphics stack. We try to maintain a list +of common words and acronyms `here <https://wiki.mozilla.org/Platform/GFX/Jargon>`__. + +Overview +-------- + +The graphics systems is responsible for rendering (painting, drawing) +the frame tree (rendering tree) elements as created by the layout +system. Each leaf in the tree has content, either bounded by a rectangle +(or perhaps another shape, in the case of SVG.) + +The simple approach for producing the result would thus involve +traversing the frame tree, in a correct order, drawing each frame into +the resulting buffer and displaying (printing non-withstanding) that +buffer when the traversal is done. It is worth spending some time on the +“correct order” note above. If there are no overlapping frames, this is +fairly simple - any order will do, as long as there is no background. If +there is background, we just have to worry about drawing that first. +Since we do not control the content, chances are the page is more +complicated. There are overlapping frames, likely with transparency, so +we need to make sure the elements are draw “back to front”, in layers, +so to speak. Layers are an important concept, and we will revisit them +shortly, as they are central to fixing a major issue with the above +simple approach. + +While the above simple approach will work, the performance will suffer. +Each time anything changes in any of the frames, the complete process +needs to be repeated, everything needs to be redrawn. Further, there is +very little space to take advantage of the modern graphics (GPU) +hardware, or multi-core computers. If you recall from the previous +sections, the frame tree is only accessible from the UI thread, so while +we’re doing all this work, the UI is basically blocked. + +(Retained) Layers +~~~~~~~~~~~~~~~~~ + +Layers framework was introduced to address the above performance issues, +by having a part of the design address each item. At the high level: + +1. We create a layer tree. The leaf elements of the tree contain all + frames (possibly multiple frames per leaf). +2. We render each layer tree element and cache (retain) the result. +3. We composite (combine) all the leaf elements into the final result. + +Let’s examine each of these steps, in reverse order. + +Compositing +~~~~~~~~~~~ + +We use the term composite as it implies that the order is important. If +the elements being composited overlap, whether there is transparency +involved or not, the order in which they are combined will effect the +result. Compositing is where we can use some of the power of the modern +graphics hardware. It is optimal for doing this job. In the scenarios +where only the position of individual frames changes, without the +content inside them changing, we see why caching each layer would be +advantageous - we only need to repeat the final compositing step, +completely skipping the layer tree creation and the rendering of each +leaf, thus speeding up the process considerably. + +Another benefit is equally apparent in the context of the stated +deficiencies of the simple approach. We can use the available graphics +hardware accelerated APIs to do the compositing step. Direct3D, OpenGL +can be used on different platforms and are well suited to accelerate +this step. + +Finally, we can now envision performing the compositing step on a +separate thread, unblocking the UI thread for other work, and doing more +work in parallel. More on this below. + +It is important to note that the number of operations in this step is +proportional to the number of layer tree (leaf) elements, so there is +additional work and complexity involved, when the layer tree is large. + +Render and retain layer elements +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As we saw, the compositing step benefits from caching the intermediate +result. This does result in the extra memory usage, so needs to be +considered during the layer tree creation. Beyond the caching, we can +accelerate the rendering of each element by (indirectly) using the +available platform APIs (e.g., Direct2D, CoreGraphics, even some of the +3D APIs like OpenGL or Direct3D) as available. This is actually done +through a platform independent API (see Moz2D) below, but is important +to realize it does get accelerated appropriately. + +Creating the layer tree +~~~~~~~~~~~~~~~~~~~~~~~ + +We need to create a layer tree (from the frames tree), which will give +us the correct result while striking the right balance between a layer +per frame element and a single layer for the complete frames tree. As +was mentioned above, there is an overhead in traversing the whole tree +and caching each of the elements, balanced by the performance +improvements. Some of the performance improvements are only noticed when +something changes (e.g., one element is moving, we only need to redo the +compositing step). + +Refresh Driver +~~~~~~~~~~~~~~ + +Layers +~~~~~~ + +Rendering each layer +~~~~~~~~~~~~~~~~~~~~ + +Tiling vs. Buffer Rotation vs. Full paint +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Compositing for the final result +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Graphics API +~~~~~~~~~~~~ + +Moz2D +~~~~~ + +- The Moz2D graphics API, part of the Azure project, is a + cross-platform interface onto the various graphics backends that + Gecko uses for rendering such as Direct2D (1.0 and 1.1), Skia, Cairo, + Quartz, and NV Path. Adding a new graphics platform to Gecko is + accomplished by adding a backend to Moz2D. + See `Moz2D documentation on wiki <https://wiki.mozilla.org/Platform/GFX/Moz2D>`__. + +Compositing +~~~~~~~~~~~ + +Image Decoding +~~~~~~~~~~~~~~ + +Image Animation +~~~~~~~~~~~~~~~ + +`Historical Documents <http://www.youtube.com/watch?v=lLZQz26-kms>`__ +--------------------------------------------------------------------- + +A number of posts and blogs that will give you more details or more +background, or reasoning that led to different solutions and approaches. + +- 2010-01 `Layers: Cross Platform Acceleration <http://www.basschouten.com/blog1.php/layers-cross-platform-acceleration>`__ +- 2010-04 `Layers <http://robert.ocallahan.org/2010/04/layers_01.html>`__ +- 2010-07 `Retained Layers <http://robert.ocallahan.org/2010/07/retained-layers_16.html>`__ +- 2011-04 `Introduction <https://web.archive.org/web/20140604005804/https://blog.mozilla.org/joe/2011/04/26/introducing-the-azure-project/>`__ +- 2011-07 `Layers <http://chrislord.net/index.php/2011/07/25/shadow-layers-and-learning-by-failing/%20Shadow>`__ +- 2011-09 `Graphics API Design <http://robert.ocallahan.org/2011/09/graphics-api-design.html>`__ +- 2012-04 `Moz2D Canvas on OSX <http://muizelaar.blogspot.ca/2012/04/azure-canvas-on-os-x.html>`__ +- 2012-05 `Mask Layers <http://featherweightmusings.blogspot.co.uk/2012/05/mask-layers_26.html>`__ +- 2013-07 `Graphics related <http://www.basschouten.com/blog1.php>`__ diff --git a/gfx/docs/LayersHistory.rst b/gfx/docs/LayersHistory.rst new file mode 100644 index 0000000000..360df9b37d --- /dev/null +++ b/gfx/docs/LayersHistory.rst @@ -0,0 +1,63 @@ +Layers History +============== + +This is an overview of the major events in the history of our Layers +infrastructure. + +- iPhone released in July 2007 (Built on a toolkit called LayerKit) + +- Core Animation (October 2007) LayerKit was publicly renamed to OS X + 10.5 + +- Webkit CSS 3d transforms (July 2009) + +- Original layers API (March 2010) Introduced the idea of a layer + manager that would composite. One of the first use cases for this was + hardware accelerated YUV conversion for video. + +- Retained layers (July 7 2010 - Bug 564991) This was an important + concept that introduced the idea of persisting the layer content + across paints in gecko controlled buffers instead of just by the OS. + This introduced the concept of buffer rotation to deal with scrolling + instead of using the native scrolling APIs like ScrollWindowEx + +- Layers IPC (July 2010 - Bug 570294) This introduced shadow layers and + edit lists and was originally done for e10s v1 + +- 3D transforms (September 2011 - Bug 505115) + +- OMTC (December 2011 - Bug 711168) This was prototyped on OS X but + shipped first for Fennec + +- Tiling v1 (April 2012 - Bug 739679) Originally done for Fennec. This + was done to avoid situations where we had to do a bunch of work for + scrolling a small amount. i.e. buffer rotation. It allowed us to have + a variety of interesting features like progressive painting and lower + resolution painting. + +- C++ Async pan zoom controller (July 2012 - Bug 750974) The existing + APZ code was in Java for Fennec so this was reimplemented. + +- Streaming WebGL Buffers (February 2013 - Bug 716859) Infrastructure + to allow OMTC WebGL and avoid the need to glFinish() every frame. + +- Compositor API (April 2013 - Bug 825928) The planning for this + started around November 2012. Layers refactoring created a compositor + API that abstracted away the differences between the D3D vs OpenGL. + The main piece of API is DrawQuad. + +- Tiling v2 (Mar 7 2014 - Bug 963073) Tiling for B2G. This work is + mainly porting tiled layers to new textures, implementing + double-buffered tiles and implementing a texture client pool, to be + used by tiled content clients. + + A large motivation for the pool was the very slow performance of + allocating tiles because of the sync messages to the compositor. + + The slow performance of allocating was directly addressed by bug 959089 + which allowed us to allocate gralloc buffers without sync messages to + the compositor thread. + +- B2G WebGL performance (May 2014 - Bug 1006957, 1001417, 1024144) This + work improved the synchronization mechanism between the compositor + and the producer. diff --git a/gfx/docs/OffMainThreadPainting.rst b/gfx/docs/OffMainThreadPainting.rst new file mode 100644 index 0000000000..c5a75f6025 --- /dev/null +++ b/gfx/docs/OffMainThreadPainting.rst @@ -0,0 +1,237 @@ +Off Main Thread Painting +======================== + +OMTP, or ‘off main thread painting’, is the component of Gecko that +allows us to perform painting of web content off of the main thread. +This gives us more time on the main thread for javascript, layout, +display list building, and other tasks which allows us to increase our +responsiveness. + +Take a look at this `blog +post <https://mozillagfx.wordpress.com/2017/12/05/off-main-thread-painting/>`__ +for an introduction. + +Background +---------- + +Painting (or rasterization) is the last operation that happens in a +layer transaction before we forward it to the compositor. At this point, +all display items have been assigned to a layer and invalid regions have +been calculated and assigned to each layer. + +The painted layer uses a content client to acquire a buffer for +painting. The main purpose of the content client is to allow us to +retain already painted content when we are scrolling a layer. We have +two main strategies for this, rotated buffer and tiling. + +This is implemented with two class hierarchies. ``ContentClient`` for +rotated buffer and ``TiledContentClient`` for tiling. Additionally we +have two different painted layer implementations, ``ClientPaintedLayer`` +and ``ClientTiledPaintedLayer``. + +The main distinction between rotated buffer and tiling is the amount of +graphics surfaces required. Rotated buffer utilizes just a single buffer +for a frame but potentially requires painting into it multiple times. +Tiling uses multiple buffers but doesn’t require painting into the +buffers multiple times. + +Once the painted layer has a surface (or surfaces with tiling) to paint +into, they are wrapped in a ``DrawTarget`` of some form and a callback +to ``FrameLayerBuilder`` is called. This callback uses the assigned +display items and invalid regions to trigger rasterization. Each +``nsDisplayItem`` has their ``Paint`` method called with the provided +``DrawTarget`` that represents the surface, and they paint into it. + +High level +---------- + +The key abstraction that allows us to paint off of the main thread is +``DrawTargetCapture`` [1]_. ``DrawTargetCapture`` is a special +``DrawTarget`` which records all draw commands for replaying to another +draw target in the local process. This is similar to +``DrawTargetRecording``, but only holds a reference to resources instead +of copying them into the command stream. This allows the command stream +to be much more lightweight than ``DrawTargetRecording``. + +OMTP works by instrumenting the content clients to use a capture target +for all painting [2]_ [3]_ [4]_ [5]_. This capture draw target records all +the operations that would normally be performed directly on the +surface’s draw target. Once we have all of the commands, we send the +capture and surface draw target to the ``PaintThread`` [6]_ where the +commands are replayed onto the surface. Once the rasterization is done, +we forward the layer transaction to the compositor. + +Tiling and parallel painting +---------------------------- + +We can make one additional improvement if we are using tiling as our +content client backend. + +When we are tiling, the screen is subdivided into a grid of equally +sized surfaces and draw commands are performed on the tiles they affect. +Each tile is independent of the others, so we’re able to parallelize +painting by using a worker thread pool and dispatching a task for each +tile individually. + +This is commonly referred to as P-OMTP or parallel painting. + +Main thread rasterization +------------------------- + +Even with OMTP it’s still possible for the main thread to perform +rasterization. A common pattern for painting code is to create a +temporary draw target, perform drawing with it, take a snapshot, and +then draw the snapshot onto the main draw target. This is done for +blurs, box shadows, text shadows, and with the basic layer manager +fallback. + +If the temporary draw target is not a draw target capture, then this +will perform rasterization on the main thread. This can be bad as it +lowers our parallelism and can cause contention with content backends, +like Direct2D, that use locking around shared resources. + +To work around this, we changed the main thread painting code to use a +draw target capture for these operations and added a source surface +capture [7]_ which only resolves the painting of the draw commands when +needed on the paint thread. + +There are still possible cases we can perform main thread rasterization, +but we try and address them when they come up. + +Out of memory issues +-------------------- + +The web is very complex, and so we can sometimes have a large amount of +draw commands for a content paint. We’ve observed OOM errors for capture +command lists that have grown to be 200MiB large. + +We initially tried to mitigate this by lowering the overhead of capture +command lists. We do this by filtering commands that don’t actually +change the draw target state and folding consecutive transform changes, +but that was not always enough. So we added the ability for our draw +target capture’s to flush their command lists to the surface draw target +while we are capturing on the main thread [8]_. + +This is triggered by a configurable memory limit. Because this +introduces a new source of main thread rasterization we try to balance +setting this too low and suffering poor performance, or setting this too +high and suffering crashes. + +Synchronization +--------------- + +OMTP is conceptually simple, but in practice it relies on subtle code to +ensure thread safety. This was the most arguably the most difficult part +of the project. + +There are roughly four areas that are critical. + +1. Compositor message ordering + + Immediately after we queue the async paints to be asynchronously + completed, we have a problem. We need to forward the layer + transaction at some point, but the compositor cannot process the + transaction until all async paints have finished. If it did, it could + access unfinished painted content. + + We obviously can’t block on the async paints completing as that would + beat the whole point of OMTP. We also can’t hold off on sending the + layer transaction to ``IPDL``, as we’d trigger race conditions for + messages sent after the layer transaction is built but before it is + forwarded. Reftest and other code assumes that messages sent after a + layer transaction to the compositor are processed after that layer + transaction is processed. + + The solution is to forward the layer transaction to the compositor + over ``IPDL``, but flag the message channel to start postponing + messages [9]_. Then once all async paints have completed, we unflag + the message channel and all postponed messages are sent [10]_. This + allows us to keep our message ordering guarantees and not have to + worry about scheduling a runnable in the future. + +2. Texture clients + + The backing store for content surfaces is managed by texture client. + While async paints are executing, it’s possible for shutdown or any + number of things to happen that could cause layer manager, all + layers, all content clients, and therefore all texture clients to be + destroyed. Therefore it’s important that we keep these texture + clients alive throughout async painting. Texture clients also manage + IPC resources and must be destroyed on the main thread, so we are + careful to do that [11]_. + +3. Double buffering + + We currently double buffer our content painting - our content clients + only ever have zero or one texture that is available to be painted + into at any moment. + + This implies that we cannot start async painting a layer tree while + previous async paints are still active as this would lead to awful + races. We also don’t support multiple nested sets of postponed IPC + messages to allow sending the first layer transaction to the + compositor, but not the second. + + To prevent issues with this, we flush all active async paints before + we begin to paint a new layer transaction [12]_. + + There was some initial debate about implementing triple buffering for + content painting, but we have not seen evidence it would help us + significantly. + +4. Moz2D thread safety + + Finally, most Moz2D objects were not thread safe. We had to insert + special locking into draw target and source surface as they have a + special copy on write relationship that must be consistent even if + they are on different threads. + + Some platform specific resources like fonts needed locking added in + order to be thread safe. We also did some work to make filter nodes + work with multiple threads executing them at the same time. + +Browser process +--------------- + +Currently only content processes are able to use OMTP. + +This restriction was added because of concern about message ordering +between ``APZ`` and OMTP. It might be able to lifted in the future. + +Important bugs +-------------- + +1. `OMTP Meta <https://bugzilla.mozilla.org/show_bug.cgi?id=omtp>`__ +2. `Enable on + Windows <https://bugzilla.mozilla.org/show_bug.cgi?id=1403935>`__ +3. `Enable on + OSX <https://bugzilla.mozilla.org/show_bug.cgi?id=1422392>`__ +4. `Enable on + Linux <https://bugzilla.mozilla.org/show_bug.cgi?id=1432531>`__ +5. `Parallel + painting <https://bugzilla.mozilla.org/show_bug.cgi?id=1425056>`__ + +Code links +---------- + +.. [1] `DrawTargetCapture <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/2d/DrawTargetCapture.h#22>`__ +.. [2] `Creating DrawTargetCapture for rotated + buffer <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/ContentClient.cpp#185>`__ +.. [3] `Dispatch DrawTargetCapture for rotated + buffer <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/ClientPaintedLayer.cpp#99>`__ +.. [4] `Creating DrawTargetCapture for + tiling <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/TiledContentClient.cpp#714>`__ +.. [5] `Dispatch DrawTargetCapture for + tiling <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/MultiTiledContentClient.cpp#288>`__ +.. [6] `PaintThread <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/PaintThread.h#53>`__ +.. [7] `SourceSurfaceCapture <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/2d/SourceSurfaceCapture.h#19>`__ +.. [8] `Sync flushing draw + commands <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/2d/DrawTargetCapture.h#165>`__ +.. [9] `Postponing messages for + PCompositorBridge <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/ipc/CompositorBridgeChild.cpp#1319>`__ +.. [10] `Releasing messages for + PCompositorBridge <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/ipc/CompositorBridgeChild.cpp#1303>`__ +.. [11] `Releasing texture clients on main + thread <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/ipc/CompositorBridgeChild.cpp#1170>`__ +.. [12] `Flushing async + paints <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/ClientLayerManager.cpp#289>`__ diff --git a/gfx/docs/RenderingOverview.rst b/gfx/docs/RenderingOverview.rst new file mode 100644 index 0000000000..50b146d9b9 --- /dev/null +++ b/gfx/docs/RenderingOverview.rst @@ -0,0 +1,384 @@ +Rendering Overview +================== + +This document is an overview of the steps to render a webpage, and how HTML +gets transformed and broken down, step by step, into commands that can execute +on the GPU. + +If you're coming into the graphics team with not a lot of background +in browsers, start here :) + +.. contents:: + +High level overview +------------------- + +.. image:: RenderingOverviewSimple.png + :width: 100% + +Layout +~~~~~~ +Starting at the left in the above image, we have a document +represented by a DOM - a Document Object Model. A Javascript engine +will execute JS code, either to make changes to the DOM, or to respond to +events generated by the DOM (or do both). + +The DOM is a high level description and we don't know what to draw or +where until it is combined with a Cascading Style Sheet (CSS). +Combining these two and figuring out what, where and how to draw +things is the responsibility of the Layout team. The +DOM is converted into a hierarchical Frame Tree, which nests visual +elements (boxes). Each element points to some node in a Style Tree +that describes what it should look like -- color, transparency, etc. +The result is that now we know exactly what to render where, what goes +on top of what (layering and blending) and at what pixel coordinate. +This is the Display List. + +The Display List is a light-weight data structure because it's shallow +-- it mostly points back to the Frame Tree. There are two problems +with this. First, we want to cross process boundaries at this point. +Everything up until now happens in a Content Process (of which there are +several). Actual GPU rendering happens in a GPU Process (on some +platforms). Second, everything up until now was written in C++; but +WebRender is written in Rust. Thus the shallow Display List needs to +be serialized in a completely self-contained binary blob that will +survive Interprocess Communication (IPC) and a language switch (C++ to +Rust). The result is the WebRender Display List. + +WebRender +~~~~~~~~~ + +The GPU process receives the WebRender Display List blob and +de-serializes it into a Scene. This Scene contains more than the +strictly visible elements; for example, to anticipate scrolling, we +might have several paragraphs of text extending past the visible page. + +For a given viewport, the Scene gets culled and stripped down to a +Frame. This is also where we start preparing data structures for GPU +rendering, for example getting some font glyphs into an atlas for +rasterizing text. + +The final step takes the Frame and submits commands to the GPU to +actually render it. The GPU will execute the commands and composite +the final page. + +Software +~~~~~~~~ + +The above is the new WebRender-enabled way to do things. But in the +schematic you'll note a second branch towards the bottom: this is the +legacy code path which does not use WebRender (nor Rust). In this +case, the Display List is converted into a Layer Tree. The purpose of +this Tree is to try and avoid having to re-render absolutely +everything when the page needs to be refreshed. For example, when +scrolling we should be able to redraw the page by mostly shifting +things around. However that requires those 'things' to still be around +from last time we drew the page. In other words, visual elements that +are likely to be static and reusable need to be drawn into their own +private "page" (a cache). Then we can recombine (composite) all of +these when redrawing the actual page. + +Figuring out which elements would be good candidates for this, and +striking a balance between good performance versus excessive memory +use, is the purpose of the Layer Tree. Each 'layer' is a cached image +of some element(s). This logic also takes occlusion into account, eg. +don't allocate and render a layer for elements that are known to be +completely obscured by something in front of them. + +Redrawing the page by combining the Layer Tree with any newly +rasterized elements is the job of the Compositor. + + +Even when a layer cannot be reused in its entirety, it is likely +that only a small part of it was invalidated. Thus there is an +elaborate system for tracking dirty rectangles, starting an update by +copying the area that can be salvaged, and then redrawing only what +cannot. + +In fact, this idea can be extended to delta-tracking of display lists +themselves. Traversing the layout tree and building a display list is +also not cheap, so the code tries to partially invalidate and rebuild +the display list incrementally when possible. +This optimization is used both for non-WebRender and WebRender in +fact. + + +Asynchronous Panning And Zooming +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Earlier we mentioned that a Scene might contain more elements than are +strictly necessary for rendering what's visible (the Frame). The +reason for that is Asynchronous Panning and Zooming, or APZ for short. +The browser will feel much more responsive if scrolling & zooming can +short-circuit all of these data transformations and IPC boundaries, +and instead directly update an offset of some layer and recomposite. +(Think of late-latching in a VR context) + +This simple idea introduces a lot of complexity: how much extra do you +rasterize, and in which direction? How much memory can we afford? +What about Javascript that responds to scroll events and perhaps does +something 'interesting' with the page in return? What about nested +frames or nested scrollbars? What if we scroll so much that we go +past the boundaries of the Scene that we know about? + +See AsyncPanZoom.rst for all that and more. + +A Few More Details +~~~~~~~~~~~~~~~~~~ + +Here's another schematic which basically repeats the previous one, but +showing a little bit more detail. Note that the direction is reversed +-- the data flow starts at the right. Sorry about that :) + +.. image:: RenderingOverviewDetail.png + :width: 100% + +Some things to note: + +- there are multiple content processes, currently 4 of them. This is + for security reasons (sandboxing), stability (isolate crashes) and + performance (multi-core machines); +- ideally each "webpage" would run in its own process for security; + this is being developed under the term 'fission'; +- there is only a single GPU process, if there is one at all; + some platforms have it as part of the Parent; +- not shown here is the Extension process that isolates WebExtensions; +- for non-WebRender, rasterization happens in the Content Process, and + we send entire Layers to the GPU/Compositor process (via shared + memory, only using actual IPC for its metadata like width & height); +- if the GPU process crashes (a bug or a driver issue) we can simply + restart it, resend the display list, and the browser itself doesn't crash; +- the browser UI is just another set of DOM+JS, albeit one that runs + with elevated privileges. That is, its JS can do things that + normal JS cannot. It lives in the Parent Process, which then uses + IPC to get it rendered, same as regular Content. (the IPC arrow also + goes to WebRender Display List but is omitted to reduce clutter); +- UI events get routed to APZ first, to minimize latency. By running + inside the GPU process, we may have access to data such + as rasterized clipping masks that enables finer grained hit testing; +- the GPU process talks back to the content process; in particular, + when APZ scrolls out of bounds, it asks Content to enlarge/shift the + Scene with a new "display port"; +- we still use the GPU when we can for compositing even in the + non-WebRender case; + + +WebRender In Detail +------------------- + +Converting a display list into GPU commands is broken down into a +number of steps and intermediate data structures. + + +.. image:: RenderingOverviewTrees.png + :width: 75% + :align: center + +.. + + *Each element in the picture tree points to exactly one node in the spatial + tree. Only a few of these links are shown for clarity (the dashed lines).* + +The Picture Tree +~~~~~~~~~~~~~~~~ + +The incoming display list uses "stacking contexts". For example, to +render some text with a drop shadow, a display list will contain three +items: + +- "enable shadow" with some parameters such as shadow color, blur size, and offset; +- the text item; +- "pop all shadows" to deactivate shadows; + +WebRender will break this down into two distinct elements, or +"pictures". The first represents the shadow, so it contains a copy of the +text item, but modified to use the shadow's color, and to shift the +text by the shadow's offset. The second picture contains the original text +to draw on top of the shadow. + +The fact that the first picture, the shadow, needs to be blurred, is a +"compositing" property of the picture which we'll deal with later. + +Thus, the stack-based display list gets converted into a list of pictures +-- or more generally, a hierarchy of pictures, since items are nested +as per the original HTML. + +Example visual elements are a TextRun, a LineDecoration, or an Image +(like a .png file). + +Compared to 3D rendering, the picture tree is similar to a scenegraph: it's a +parent/child hierarchy of all the drawable elements that make up the "scene", in +this case the webpage. One important difference is that the transformations are +stored in a separate tree, the spatial tree. + +The Spatial Tree +~~~~~~~~~~~~~~~~ + +The nodes in the spatial tree represent coordinate transforms. Every time the +DOM hierarchy needs child elements to be transformed relative to their parent, +we add a new Spatial Node to the tree. All those child elements will then point +to this node as their "local space" reference (aka coordinate frame). In +traditional 3D terms, it's a scenegraph but only containing transform nodes. + +The nodes are called frames, as in "coordinate frame": + +- a Reference Frame corresponds to a ``<div>``; +- a Scrolling Frame corresponds to a scrollable part of the page; +- a Sticky Frame corresponds to some fixed position CSS style. + +Each element in the picture tree then points to a spatial node inside this tree, +so by walking up and down the tree we can find the absolute position of where +each element should render (traversing down) and how large each element needs to +be (traversing up). Originally the transform information was part of the +picture tree, as in a traditional scenegraph, but visual elements and their +transforms were split apart for technical reasons. + +Some of these nodes are dynamic. A scroll-frame can obviously scroll, but a +Reference Frame might also use a property binding to enable a live link with +JavaScript, for dynamic updates of (currently) the transform and opacity. + +Axis-aligned transformations (scales and translations) are considered "simple", +and are conceptually combined into a single "CoordinateSystem". When we +encounter a non-axis-aligned transform, we start a new CoordinateSystem. We +start in CoordinateSystem 0 at the root, and would bump this to CoordinateSystem +1 when we encounter a Reference Frame with a rotation or 3D transform, for +example. This would then be the CoordinateSystem index for all its children, +until we run into another (nested) non-simple transform, and so on. Roughly +speaking, as long as we're in the same CoordinateSystem, the transform stack is +simple enough that we have a reasonable chance of being able to flatten it. That +lets us directly rasterize text at its final scale for example, optimizing +away some of the intermediate pictures (offscreen textures). + +The layout code positions elements relative to their parent. Thus to position +the element on the actual page, we need to walk the Spatial Tree all the way to +the root and apply each transform; the result is a ``LayoutToWorldTransform``. + +One final step transforms from World to Device coordinates, which deals with +DPI scaling and such. + +.. csv-table:: + :header: "WebRender term", "Rough analogy" + + Spatial Tree, Scenegraph -- transforms only + Picture Tree, Scenegraph -- drawables only (grouping) + Spatial Tree Rootnode, World Space + Layout space, Local/Object Space + Picture, RenderTarget (sort of; see RenderTask below) + Layout-To-World transform, Local-To-World transform + World-To-Device transform, World-To-Clipspace transform + + +The Clip Tree +~~~~~~~~~~~~~ + +Finally, we also have a Clip Tree, which contains Clip Shapes. For +example, a rounded corner div will produce a clip shape, and since +divs can be nested, you end up with another tree. By pointing at a Clip Shape, +visual elements will be clipped against this shape plus all parent shapes above it +in the Clip Tree. + +As with CoordinateSystems, a chain of simple 2D clip shapes can be collapsed +into something that can be handled in the vertex shader, at very little extra +cost. More complex clips must be rasterized into a mask first, which we then +sample from to ``discard`` in the pixel shader as needed. + +In summary, at the end of scene building the display list turned into +a picture tree, plus a spatial tree that tells us what goes where +relative to what, plus a clip tree. + +RenderTask Tree +~~~~~~~~~~~~~~~ + +Now in a perfect world we could simply traverse the picture tree and start +drawing things: one drawcall per picture to render its contents, plus one +drawcall to draw the picture into its parent. However, recall that the first +picture in our example is a "text shadow" that needs to be blurred. We can't +just rasterize blurry text directly, so we need a number of steps or "render +passes" to get the intended effect: + +.. image:: RenderingOverviewBlurTask.png + :align: right + :height: 400px + +- rasterize the text into an offscreen rendertarget; +- apply one or more downscaling passes until the blur radius is reasonable; +- apply a horizontal Gaussian blur; +- apply a vertical Gaussian blur; +- use the result as an input for whatever comes next, or blit it to + its final position on the page (or more generally, on the containing + parent surface/picture). + +In the general case, which passes we need and how many of them depends +on how the picture is supposed to be composited (CSS filters, SVG +filters, effects) and its parameters (very large vs. small blur +radius, say). + +Thus, we walk the picture tree and build a render task tree: each high +level abstraction like "blur me" gets broken down into the necessary +render passes to get the effect. The result is again a tree because a +render pass can have multiple input dependencies (eg. blending). + +(Cfr. games, this has echoes of the Frostbite Framegraph in that it +dynamically builds up a renderpass DAG and dynamically allocates storage +for the outputs). + +If there are complicated clip shapes that need to be rasterized first, +so their output can be sampled as a texture for clip/discard +operations, that would also end up in this tree as a dependency... (I think?). + +Once we have the entire tree of dependencies, we analyze it to see +which tasks can be combined into a single pass for efficiency. We +ping-pong rendertargets when we can, but sometimes the dependencies +cut across more than one level of the rendertask tree, and some +copying is necessary. + +Once we've figured out the passes and allocated storage for anything +we wish to persist in the texture cache, we finally start rendering. + +When rasterizing the elements into the Picture's offscreen texture, we'd +position them by walking the transform hierarchy as far up as the picture's +transform node, resulting in a ``Layout To Picture`` transform. The picture +would then go onto the page using a ``Picture To World`` coordinate transform. + +Caching +``````` + +Just as with layers in the software rasterizer, it is not always necessary to +redraw absolutely everything when parts of a document change. The webrender +equivalent of layers is Slices -- a grouping of pictures that are expected to +render and update together. Slices are automatically created based on +heuristics and layout hints/flags. + +Implementation wise, slices re-use a lot of the existing machinery for Pictures; +in fact they're implemented as a "Virtual picture" of sorts. The similarities +make sense: both need to allocate offscreen textures in a cache, both will +position and render all their children into it, and both then draw themselves +into their parent as part of the parent's draw. + +If a slice isn't expected to change much, we give it a TileCacheInstance. It is +itself made up of Tiles, where each tile will track what's in it, what's +changing, and if it needs to be invalidated and redrawn or not as a result. +Thus the "damage" from changes can be localized to single tiles, while we +salvage the rest of the cache. If tiles keep seeing a lot of invalidations, +they will recursively divide themselves in a quad-tree like structure to try and +localize the invalidations. (And conversely, they'll recombine children if +nothing is invalidating them "for a while"). + +Interning +````````` + +To spot invalidated tiles, we need a fast way to compare its contents from the +previous frame with the current frame. To speed this up, we use interning; +similar to string-interning, this means that each ``TextRun``, ``Decoration``, +``Image`` and so on is registered in a repository (a ``DataStore``) and +consequently referred to by its unique ID. Cache contents can then be encoded as a +list of IDs (one such list per internable element type). Diffing is then just a +fast list comparison. + + +Callbacks +````````` +GPU text rendering assumes that the individual font-glyphs are already +available in a texture atlas. Likewise SVG is not being rendered on +the GPU. Both inputs are prepared during scene building; glyph +rasterization via a thread pool from within Rust itself, and SVG via +opaque callbacks (back to C++) that produce blobs. diff --git a/gfx/docs/RenderingOverviewBlurTask.png b/gfx/docs/RenderingOverviewBlurTask.png Binary files differnew file mode 100644 index 0000000000..baffc08f32 --- /dev/null +++ b/gfx/docs/RenderingOverviewBlurTask.png diff --git a/gfx/docs/RenderingOverviewDetail.png b/gfx/docs/RenderingOverviewDetail.png Binary files differnew file mode 100644 index 0000000000..2909a811e4 --- /dev/null +++ b/gfx/docs/RenderingOverviewDetail.png diff --git a/gfx/docs/RenderingOverviewSimple.png b/gfx/docs/RenderingOverviewSimple.png Binary files differnew file mode 100644 index 0000000000..43c0a59439 --- /dev/null +++ b/gfx/docs/RenderingOverviewSimple.png diff --git a/gfx/docs/RenderingOverviewTrees.png b/gfx/docs/RenderingOverviewTrees.png Binary files differnew file mode 100644 index 0000000000..ffdf0812fa --- /dev/null +++ b/gfx/docs/RenderingOverviewTrees.png diff --git a/gfx/docs/Silk.rst b/gfx/docs/Silk.rst new file mode 100644 index 0000000000..45ec627a1e --- /dev/null +++ b/gfx/docs/Silk.rst @@ -0,0 +1,472 @@ +Silk Overview +========================== + +.. image:: SilkArchitecture.png + +Architecture +------------ + +Our current architecture is to align three components to hardware vsync +timers: + +1. Compositor +2. RefreshDriver / Painting +3. Input Events + +The flow of our rendering engine is as follows: + +1. Hardware Vsync event occurs on an OS specific *Hardware Vsync Thread* + on a per monitor basis. +2. The *Hardware Vsync Thread* attached to the monitor notifies the + ``CompositorVsyncDispatchers`` and ``RefreshTimerVsyncDispatcher``. +3. For every Firefox window on the specific monitor, notify a + ``CompositorVsyncDispatcher``. The ``CompositorVsyncDispatcher`` is + specific to one window. +4. The ``CompositorVsyncDispatcher`` notifies a + ``CompositorWidgetVsyncObserver`` when remote compositing, or a + ``CompositorVsyncScheduler::Observer`` when compositing in-process. +5. If remote compositing, a vsync notification is sent from the + ``CompositorWidgetVsyncObserver`` to the ``VsyncBridgeChild`` on the + UI process, which sends an IPDL message to the ``VsyncBridgeParent`` + on the compositor thread of the GPU process, which then dispatches to + ``CompositorVsyncScheduler::Observer``. +6. The ``RefreshTimerVsyncDispatcher`` notifies the Chrome + ``RefreshTimer`` that a vsync has occurred. +7. The ``RefreshTimerVsyncDispatcher`` sends IPC messages to all content + processes to tick their respective active ``RefreshTimer``. +8. The ``Compositor`` dispatches input events on the *Compositor + Thread*, then composites. Input events are only dispatched on the + *Compositor Thread* on b2g. +9. The ``RefreshDriver`` paints on the *Main Thread*. + +Hardware Vsync +-------------- + +Hardware vsync events from (1), occur on a specific ``Display`` Object. +The ``Display`` object is responsible for enabling / disabling vsync on +a per connected display basis. For example, if two monitors are +connected, two ``Display`` objects will be created, each listening to +vsync events for their respective displays. We require one ``Display`` +object per monitor as each monitor may have different vsync rates. As a +fallback solution, we have one global ``Display`` object that can +synchronize across all connected displays. The global ``Display`` is +useful if a window is positioned halfway between the two monitors. Each +platform will have to implement a specific ``Display`` object to hook +and listen to vsync events. As of this writing, both Firefox OS and OS X +create their own hardware specific *Hardware Vsync Thread* that executes +after a vsync has occurred. OS X creates one *Hardware Vsync Thread* per +``CVDisplayLinkRef``. We do not currently support multiple displays, so +we use one global ``CVDisplayLinkRef`` that works across all active +displays. On Windows, we have to create a new platform ``thread`` that +waits for DwmFlush(), which works across all active displays. Once the +thread wakes up from DwmFlush(), the actual vsync timestamp is retrieved +from DwmGetCompositionTimingInfo(), which is the timestamp that is +actually passed into the compositor and refresh driver. + +When a vsync occurs on a ``Display``, the *Hardware Vsync Thread* +callback fetches all ``CompositorVsyncDispatchers`` associated with the +``Display``. Each ``CompositorVsyncDispatcher`` is notified that a vsync +has occurred with the vsync’s timestamp. It is the responsibility of the +``CompositorVsyncDispatcher`` to notify the ``Compositor`` that is +awaiting vsync notifications. The ``Display`` will then notify the +associated ``RefreshTimerVsyncDispatcher``, which should notify all +active ``RefreshDrivers`` to tick. + +All ``Display`` objects are encapsulated in a ``VsyncSource`` object. +The ``VsyncSource`` object lives in ``gfxPlatform`` and is instantiated +only on the parent process when ``gfxPlatform`` is created. The +``VsyncSource`` is destroyed when ``gfxPlatform`` is destroyed. It can +also be destroyed when the layout frame rate pref (or other prefs that +influence frame rate) are changed. This may mean we switch from hardware +to software vsync (or vice versa) at runtime. During the switch, there +may briefly be 2 vsync sources. Otherwise, there is only one +``VsyncSource`` object throughout the entire lifetime of Firefox. Each +platform is expected to implement their own ``VsyncSource`` to manage +vsync events. On OS X, this is through ``CVDisplayLinkRef``. On +Windows, it should be through ``DwmGetCompositionTimingInfo``. + +Compositor +---------- + +When the ``CompositorVsyncDispatcher`` is notified of the vsync event, +the ``CompositorVsyncScheduler::Observer`` associated with the +``CompositorVsyncDispatcher`` begins execution. Since the +``CompositorVsyncDispatcher`` executes on the *Hardware Vsync Thread* +and the ``Compositor`` composites on the ``CompositorThread``, the +``CompositorVsyncScheduler::Observer`` posts a task to the +``CompositorThread``. The ``CompositorBridgeParent`` then composites. +The model where the ``CompositorVsyncDispatcher`` notifies components on +the *Hardware Vsync Thread*, and the component schedules the task on the +appropriate thread is used everywhere. + +The ``CompositorVsyncScheduler::Observer`` listens to vsync events as +needed and stops listening to vsync when composites are no longer +scheduled or required. Every ``CompositorBridgeParent`` is associated +and tied to one ``CompositorVsyncScheduler::Observer``, which is +associated with the ``CompositorVsyncDispatcher``. Each +``CompositorBridgeParent`` is associated with one widget and is created +when a new platform window or ``nsBaseWidget`` is created. The +``CompositorBridgeParent``, ``CompositorVsyncDispatcher``, +``CompositorVsyncScheduler::Observer``, and ``nsBaseWidget`` all have +the same lifetimes, which are created and destroyed together. + +Out-of-process Compositors +-------------------------- + +When compositing out-of-process, this model changes slightly. In this +case there are effectively two observers: a UI process observer +(``CompositorWidgetVsyncObserver``), and the +``CompositorVsyncScheduler::Observer`` in the GPU process. There are +also two dispatchers: the widget dispatcher in the UI process +(``CompositorVsyncDispatcher``), and the IPDL-based dispatcher in the +GPU process (``CompositorBridgeParent::NotifyVsync``). The UI process +observer and the GPU process dispatcher are linked via an IPDL protocol +called PVsyncBridge. ``PVsyncBridge`` is a top-level protocol for +sending vsync notifications to the compositor thread in the GPU process. +The compositor controls vsync observation through a separate actor, +``PCompositorWidget``, which (as a subactor for +``CompositorBridgeChild``) links the compositor thread in the GPU +process to the main thread in the UI process. + +Out-of-process compositors do not go through +``CompositorVsyncDispatcher`` directly. Instead, the +``CompositorWidgetDelegate`` in the UI process creates one, and gives it +a ``CompositorWidgetVsyncObserver``. This observer forwards +notifications to a Vsync I/O thread, where ``VsyncBridgeChild`` then +forwards the notification again to the compositor thread in the GPU +process. The notification is received by a ``VsyncBridgeParent``. The +GPU process uses the layers ID in the notification to find the correct +compositor to dispatch the notification to. + +CompositorVsyncDispatcher +------------------------- + +The ``CompositorVsyncDispatcher`` executes on the *Hardware Vsync +Thread*. It contains references to the ``nsBaseWidget`` it is associated +with and has a lifetime equal to the ``nsBaseWidget``. The +``CompositorVsyncDispatcher`` is responsible for notifying the +``CompositorBridgeParent`` that a vsync event has occurred. There can be +multiple ``CompositorVsyncDispatchers`` per ``Display``, one +``CompositorVsyncDispatcher`` per window. The only responsibility of the +``CompositorVsyncDispatcher`` is to notify components when a vsync event +has occurred, and to stop listening to vsync when no components require +vsync events. We require one ``CompositorVsyncDispatcher`` per window so +that we can handle multiple ``Displays``. When compositing in-process, +the ``CompositorVsyncDispatcher`` is attached to the CompositorWidget +for the window. When out-of-process, it is attached to the +CompositorWidgetDelegate, which forwards observer notifications over +IPDL. In the latter case, its lifetime is tied to a CompositorSession +rather than the nsIWidget. + +Multiple Displays +----------------- + +The ``VsyncSource`` has an API to switch a ``CompositorVsyncDispatcher`` +from one ``Display`` to another ``Display``. For example, when one +window either goes into full screen mode or moves from one connected +monitor to another. When one window moves to another monitor, we expect +a platform specific notification to occur. The detection of when a +window enters full screen mode or moves is not covered by Silk itself, +but the framework is built to support this use case. The expected flow +is that the OS notification occurs on ``nsIWidget``, which retrieves the +associated ``CompositorVsyncDispatcher``. The +``CompositorVsyncDispatcher`` then notifies the ``VsyncSource`` to +switch to the correct ``Display`` the ``CompositorVsyncDispatcher`` is +connected to. Because the notification works through the ``nsIWidget``, +the actual switching of the ``CompositorVsyncDispatcher`` to the correct +``Display`` should occur on the *Main Thread*. The current +implementation of Silk does not handle this case and needs to be built +out. + +CompositorVsyncScheduler::Observer +---------------------------------- + +The ``CompositorVsyncScheduler::Observer`` handles the vsync +notifications and interactions with the ``CompositorVsyncDispatcher``. +When the ``Compositor`` requires a scheduled composite, it notifies the +``CompositorVsyncScheduler::Observer`` that it needs to listen to vsync. +The ``CompositorVsyncScheduler::Observer`` then observes / unobserves +vsync as needed from the ``CompositorVsyncDispatcher`` to enable +composites. + +GeckoTouchDispatcher +-------------------- + +The ``GeckoTouchDispatcher`` is a singleton that resamples touch events +to smooth out jank while tracking a user’s finger. Because input and +composite are linked together, the +``CompositorVsyncScheduler::Observer`` has a reference to the +``GeckoTouchDispatcher`` and vice versa. + +Input Events +------------ + +One large goal of Silk is to align touch events with vsync events. On +Firefox OS, touchscreens often have different touch scan rates than the +display refreshes. A Flame device has a touch refresh rate of 75 HZ, +while a Nexus 4 has a touch refresh rate of 100 HZ, while the device’s +display refresh rate is 60HZ. When a vsync event occurs, we resample +touch events, and then dispatch the resampled touch event to APZ. Touch +events on Firefox OS occur on a *Touch Input Thread* whereas they are +processed by APZ on the *APZ Controller Thread*. We use `Google +Android’s touch +resampling <https://web.archive.org/web/20200909082458/http://www.masonchang.com/blog/2014/8/25/androids-touch-resampling-algorithm>`__ +algorithm to resample touch events. + +Currently, we have a strict ordering between Composites and touch +events. When a touch event occurs on the *Touch Input Thread*, we store +the touch event in a queue. When a vsync event occurs, the +``CompositorVsyncDispatcher`` notifies the ``Compositor`` of a vsync +event, which notifies the ``GeckoTouchDispatcher``. The +``GeckoTouchDispatcher`` processes the touch event first on the *APZ +Controller Thread*, which is the same as the *Compositor Thread* on b2g, +then the ``Compositor`` finishes compositing. We require this strict +ordering because if a vsync notification is dispatched to both the +``Compositor`` and ``GeckoTouchDispatcher`` at the same time, a race +condition occurs between processing the touch event and therefore +position versus compositing. In practice, this creates very janky +scrolling. As of this writing, we have not analyzed input events on +desktop platforms. + +One slight quirk is that input events can start a composite, for example +during a scroll and after the ``Compositor`` is no longer listening to +vsync events. In these cases, we notify the ``Compositor`` to observe +vsync so that it dispatches touch events. If touch events were not +dispatched, and since the ``Compositor`` is not listening to vsync +events, the touch events would never be dispatched. The +``GeckoTouchDispatcher`` handles this case by always forcing the +``Compositor`` to listen to vsync events while touch events are +occurring. + +Widget, Compositor, CompositorVsyncDispatcher, GeckoTouchDispatcher Shutdown Procedure +-------------------------------------------------------------------------------------- + +When the `nsBaseWidget shuts +down <https://hg.mozilla.org/mozilla-central/file/0df249a0e4d3/widget/nsBaseWidget.cpp#l182>`__ +- It calls nsBaseWidget::DestroyCompositor on the *Gecko Main Thread*. +During nsBaseWidget::DestroyCompositor, it first destroys the +CompositorBridgeChild. CompositorBridgeChild sends a sync IPC call to +CompositorBridgeParent::RecvStop, which calls +`CompositorBridgeParent::Destroy <https://hg.mozilla.org/mozilla-central/file/ab0490972e1e/gfx/layers/ipc/CompositorParent.cpp#l509>`__. +During this time, the *main thread* is blocked on the parent process. +CompositorBridgeParent::RecvStop runs on the *Compositor thread* and +cleans up some resources, including setting the +``CompositorVsyncScheduler::Observer`` to nullptr. +CompositorBridgeParent::RecvStop also explicitly keeps the +CompositorBridgeParent alive and posts another task to run +CompositorBridgeParent::DeferredDestroy on the Compositor loop so that +all ipdl code can finish executing. The +``CompositorVsyncScheduler::Observer`` also unobserves from vsync and +cancels any pending composite tasks. Once +CompositorBridgeParent::RecvStop finishes, the *main thread* in the +parent process continues shutting down the nsBaseWidget. + +At the same time, the *Compositor thread* is executing tasks until +CompositorBridgeParent::DeferredDestroy runs, which flushes the +compositor message loop. Now we have two tasks as both the nsBaseWidget +releases a reference to the Compositor on the *main thread* during +destruction and the CompositorBridgeParent::DeferredDestroy releases a +reference to the CompositorBridgeParent on the *Compositor Thread*. +Finally, the CompositorBridgeParent itself is destroyed on the *main +thread* once both references are gone due to explicit `main thread +destruction <https://hg.mozilla.org/mozilla-central/file/50b95032152c/gfx/layers/ipc/CompositorParent.h#l148>`__. + +With the ``CompositorVsyncScheduler::Observer``, any accesses to the +widget after nsBaseWidget::DestroyCompositor executes are invalid. Any +accesses to the compositor between the time the +nsBaseWidget::DestroyCompositor runs and the +CompositorVsyncScheduler::Observer’s destructor runs aren’t safe yet a +hardware vsync event could occur between these times. Since any tasks +posted on the Compositor loop after +CompositorBridgeParent::DeferredDestroy is posted are invalid, we make +sure that no vsync tasks can be posted once +CompositorBridgeParent::RecvStop executes and DeferredDestroy is posted +on the Compositor thread. When the sync call to +CompositorBridgeParent::RecvStop executes, we explicitly set the +CompositorVsyncScheduler::Observer to null to prevent vsync +notifications from occurring. If vsync notifications were allowed to +occur, since the ``CompositorVsyncScheduler::Observer``\ ’s vsync +notification executes on the *hardware vsync thread*, it would post a +task to the Compositor loop and may execute after +CompositorBridgeParent::DeferredDestroy. Thus, we explicitly shut down +vsync events in the ``CompositorVsyncDispatcher`` and +``CompositorVsyncScheduler::Observer`` during nsBaseWidget::Shutdown to +prevent any vsync tasks from executing after +CompositorBridgeParent::DeferredDestroy. + +The ``CompositorVsyncDispatcher`` may be destroyed on either the *main +thread* or *Compositor Thread*, since both the nsBaseWidget and +``CompositorVsyncScheduler::Observer`` race to destroy on different +threads. nsBaseWidget is destroyed on the *main thread* and releases a +reference to the ``CompositorVsyncDispatcher`` during destruction. The +``CompositorVsyncScheduler::Observer`` has a race to be destroyed either +during CompositorBridgeParent shutdown or from the +``GeckoTouchDispatcher`` which is destroyed on the main thread with +`ClearOnShutdown <https://hg.mozilla.org/mozilla-central/file/21567e9a6e40/xpcom/base/ClearOnShutdown.h#l15>`__. +Whichever object, the CompositorBridgeParent or the +``GeckoTouchDispatcher`` is destroyed last will hold the last reference +to the ``CompositorVsyncDispatcher``, which destroys the object. + +Refresh Driver +-------------- + +The Refresh Driver is ticked from a `single active +timer <https://hg.mozilla.org/mozilla-central/file/ab0490972e1e/layout/base/nsRefreshDriver.cpp#l11>`__. +The assumption is that there are multiple ``RefreshDrivers`` connected +to a single ``RefreshTimer``. There are two ``RefreshTimers``: an active +and an inactive ``RefreshTimer``. Each Tab has its own +``RefreshDriver``, which connects to one of the global +``RefreshTimers``. The ``RefreshTimers`` execute on the *Main Thread* +and tick their connected ``RefreshDrivers``. We do not want to break +this model of multiple ``RefreshDrivers`` per a set of two global +``RefreshTimers``. Each ``RefreshDriver`` switches between the active +and inactive ``RefreshTimer``. + +Instead, we create a new ``RefreshTimer``, the ``VsyncRefreshTimer`` +which ticks based on vsync messages. We replace the current active timer +with a ``VsyncRefreshTimer``. All tabs will then tick based on this new +active timer. Since the ``RefreshTimer`` has a lifetime of the process, +we only need to create a single ``RefreshTimerVsyncDispatcher`` per +``Display`` when Firefox starts. Even if we do not have any content +processes, the Chrome process will still need a ``VsyncRefreshTimer``, +thus we can associate the ``RefreshTimerVsyncDispatcher`` with each +``Display``. + +When Firefox starts, we initially create a new ``VsyncRefreshTimer`` in +the Chrome process. The ``VsyncRefreshTimer`` will listen to vsync +notifications from ``RefreshTimerVsyncDispatcher`` on the global +``Display``. When nsRefreshDriver::Shutdown executes, it will delete the +``VsyncRefreshTimer``. This creates a problem as all the +``RefreshTimers`` are currently manually memory managed whereas +``VsyncObservers`` are ref counted. To work around this problem, we +create a new ``RefreshDriverVsyncObserver`` as an inner class to +``VsyncRefreshTimer``, which actually receives vsync notifications. It +then ticks the ``RefreshDrivers`` inside ``VsyncRefreshTimer``. + +With Content processes, the start up process is more complicated. We +send vsync IPC messages via the use of the PBackground thread on the +parent process, which allows us to send messages from the Parent +process’ without waiting on the *main thread*. This sends messages from +the Parent::\ *PBackground Thread* to the Child::\ *Main Thread*. The +*main thread* receiving IPC messages on the content process is +acceptable because ``RefreshDrivers`` must execute on the *main thread*. +However, there is some amount of time required to setup the IPC +connection upon process creation and during this time, the +``RefreshDrivers`` must tick to set up the process. To get around this, +we initially use software ``RefreshTimers`` that already exist during +content process startup and swap in the ``VsyncRefreshTimer`` once the +IPC connection is created. + +During nsRefreshDriver::ChooseTimer, we create an async PBackground IPC +open request to create a ``VsyncParent`` and ``VsyncChild``. At the same +time, we create a software ``RefreshTimer`` and tick the +``RefreshDrivers`` as normal. Once the PBackground callback is executed +and an IPC connection exists, we swap all ``RefreshDrivers`` currently +associated with the active ``RefreshTimer`` and swap the +``RefreshDrivers`` to use the ``VsyncRefreshTimer``. Since all +interactions on the content process occur on the main thread, there are +no need for locks. The ``VsyncParent`` listens to vsync events through +the ``VsyncRefreshTimerDispatcher`` on the parent side and sends vsync +IPC messages to the ``VsyncChild``. The ``VsyncChild`` notifies the +``VsyncRefreshTimer`` on the content process. + +During the shutdown process of the content process, ActorDestroy is +called on the ``VsyncChild`` and ``VsyncParent`` due to the normal +PBackground shutdown process. Once ActorDestroy is called, no IPC +messages should be sent across the channel. After ActorDestroy is +called, the IPDL machinery will delete the **VsyncParent/Child** pair. +The ``VsyncParent``, due to being a ``VsyncObserver``, is ref counted. +After ``VsyncParent::ActorDestroy`` is called, it unregisters itself +from the ``RefreshTimerVsyncDispatcher``, which holds the last reference +to the ``VsyncParent``, and the object will be deleted. + +Thus the overall flow during normal execution is: + +1. VsyncSource::Display::RefreshTimerVsyncDispatcher receives a Vsync + notification from the OS in the parent process. +2. RefreshTimerVsyncDispatcher notifies + VsyncRefreshTimer::RefreshDriverVsyncObserver that a vsync occurred on + the parent process on the hardware vsync thread. +3. RefreshTimerVsyncDispatcher notifies the VsyncParent on the hardware + vsync thread that a vsync occurred. +4. The VsyncRefreshTimer::RefreshDriverVsyncObserver in the parent + process posts a task to the main thread that ticks the refresh + drivers. +5. VsyncParent posts a task to the PBackground thread to send a vsync + IPC message to VsyncChild. +6. VsyncChild receive a vsync notification on the content process on the + main thread and ticks their respective RefreshDrivers. + +Compressing Vsync Messages +-------------------------- + +Vsync messages occur quite often and the *main thread* can be busy for +long periods of time due to JavaScript. Consistently sending vsync +messages to the refresh driver timer can flood the *main thread* with +refresh driver ticks, causing even more delays. To avoid this problem, +we compress vsync messages on both the parent and child processes. + +On the parent process, newer vsync messages update a vsync timestamp but +do not actually queue any tasks on the *main thread*. Once the parent +process’ *main thread* executes the refresh driver tick, it uses the +most updated vsync timestamp to tick the refresh driver. After the +refresh driver has ticked, one single vsync message is queued for +another refresh driver tick task. On the content process, the IPDL +``compress`` keyword automatically compresses IPC messages. + +Multiple Monitors +----------------- + +In order to have multiple monitor support for the ``RefreshDrivers``, we +have multiple active ``RefreshTimers``. Each ``RefreshTimer`` is +associated with a specific ``Display`` via an id and tick when it’s +respective ``Display`` vsync occurs. We have **N RefreshTimers**, where +N is the number of connected displays. Each ``RefreshTimer`` still has +multiple ``RefreshDrivers``. + +When a tab or window changes monitors, the ``nsIWidget`` receives a +display changed notification. Based on which display the window is on, +the window switches to the correct ``RefreshTimerVsyncDispatcher`` and +``CompositorVsyncDispatcher`` on the parent process based on the display +id. Each ``TabParent`` should also send a notification to their child. +Each ``TabChild``, given the display ID, switches to the correct +``RefreshTimer`` associated with the display ID. When each display vsync +occurs, it sends one IPC message to notify vsync. The vsync message +contains a display ID, to tick the appropriate ``RefreshTimer`` on the +content process. There is still only one **VsyncParent/VsyncChild** +pair, just each vsync notification will include a display ID, which maps +to the correct ``RefreshTimer``. + +Object Lifetime +--------------- + +1. CompositorVsyncDispatcher - Lives as long as the nsBaseWidget + associated with the VsyncDispatcher +2. CompositorVsyncScheduler::Observer - Lives and dies the same time as + the CompositorBridgeParent. +3. RefreshTimerVsyncDispatcher - As long as the associated display + object, which is the lifetime of Firefox. +4. VsyncSource - Lives as long as the gfxPlatform on the chrome process, + which is the lifetime of Firefox. +5. VsyncParent/VsyncChild - Lives as long as the content process +6. RefreshTimer - Lives as long as the process + +Threads +------- + +All ``VsyncObservers`` are notified on the *Hardware Vsync Thread*. It +is the responsibility of the ``VsyncObservers`` to post tasks to their +respective correct thread. For example, the +``CompositorVsyncScheduler::Observer`` will be notified on the *Hardware +Vsync Thread*, and post a task to the *Compositor Thread* to do the +actual composition. + +1. Compositor Thread - Nothing changes +2. Main Thread - PVsyncChild receives IPC messages on the main thread. + We also enable/disable vsync on the main thread. +3. PBackground Thread - Creates a connection from the PBackground thread + on the parent process to the main thread in the content process. +4. Hardware Vsync Thread - Every platform is different, but we always + have the concept of a hardware vsync thread. Sometimes this is + actually created by the host OS. On Windows, we have to create a + separate platform thread that blocks on DwmFlush(). diff --git a/gfx/docs/SilkArchitecture.png b/gfx/docs/SilkArchitecture.png Binary files differnew file mode 100644 index 0000000000..938c585e40 --- /dev/null +++ b/gfx/docs/SilkArchitecture.png diff --git a/gfx/docs/index.rst b/gfx/docs/index.rst new file mode 100644 index 0000000000..223ae0f02a --- /dev/null +++ b/gfx/docs/index.rst @@ -0,0 +1,18 @@ +Graphics +======== + +This collection of linked pages contains design documents for the +Mozilla graphics architecture. The design documents live in gfx/docs directory. + +This `wiki page <https://wiki.mozilla.org/Platform/GFX>`__ contains +information about graphics and the graphics team at Mozilla. + +.. toctree:: + :maxdepth: 1 + + GraphicsOverview + LayersHistory + OffMainThreadPainting + AsyncPanZoom + AdvancedLayers + Silk |