14 files changed, 2390 insertions, 0 deletions
diff --git a/gfx/docs/AdvancedLayers.rst b/gfx/docs/AdvancedLayers.rst
new file mode 100644
index 0000000000..b4bcc132cb
--- /dev/null
+++ b/gfx/docs/AdvancedLayers.rst
@@ -0,0 +1,370 @@
+Advanced Layers
+===============
+
+Advanced Layers is a new method of compositing layers in Gecko. This
+document serves as a technical overview and provides a short
+walk-through of its source code.
+
+Overview
+--------
+
+Advanced Layers attempts to group as many GPU operations as it can into
+a single draw call. This is a common technique in GPU-based rendering
+called “batching”. It is not always trivial, as a batching algorithm can
+easily waste precious CPU resources trying to build optimal draw calls.
+
+Advanced Layers reuses the existing Gecko layers system as much as
+possible. Huge layer trees do not currently scale well (see the future
+work section), so opportunities for batching are currently limited
+without expending unnecessary resources elsewhere. However, Advanced
+Layers has a few benefits:
+
+-  It submits smaller GPU workloads and buffer uploads than the existing
+   compositor.
+-  It needs only a single pass over the layer tree.
+-  It uses occlusion information more intelligently.
+-  It is easier to add new specialized rendering paths and new layer
+   types.
+-  It separates compositing logic from device logic, unlike the existing
+   compositor.
+-  It is much faster at rendering 3d scenes or complex layer trees.
+-  It has experimental code to use the z-buffer for occlusion culling.
+
+Because of these benefits we hope that it provides a significant
+improvement over the existing compositor.
+
+Advanced Layers uses the acronym “MLG” and “MLGPU” in many places. This
+stands for “Mid-Level Graphics”, the idea being that it is optimized for
+Direct3D 11-style rendering systems as opposed to Direct3D 12 or Vulkan.
+
+LayerManagerMLGPU
+-----------------
+
+Advanced layers does not change client-side rendering at all. Content
+still uses Direct2D (when possible), and creates identical layer trees
+as it would with a normal Direct3D 11 compositor. In fact, Advanced
+Layers re-uses all of the existing texture handling and video
+infrastructure as well, replacing only the composite-side layer types.
+
+Advanced Layers does not create a ``LayerManagerComposite`` - instead,
+it creates a ``LayerManagerMLGPU``. This layer manager does not have a
+``Compositor`` - instead, it has an ``MLGDevice``, which roughly
+abstracts the Direct3D 11 API. (The hope is that this API is easily
+interchangeable for something else when cross-platform or software
+support is needed.)
+
+``LayerManagerMLGPU`` also dispenses with the old “composite” layers for
+new layer types. For example, ``ColorLayerComposite`` becomes
+``ColorLayerMLGPU``. Since these layer types implement ``HostLayer``,
+they integrate with ``LayerTransactionParent`` as normal composite
+layers would.
+
+Rendering Overview
+------------------
+
+The steps for rendering are described in more detail below, but roughly
+the process is:
+
+1. Sort layers front-to-back.
+2. Create a dependency tree of render targets (called “views”).
+3. Accumulate draw calls for all layers in each view.
+4. Upload draw call buffers to the GPU.
+5. Execute draw commands for each view.
+
+Advanced Layers divides the layer tree into “views”
+(``RenderViewMLGPU``), which correspond to a render target. The root
+layer is represented by a view corresponding to the screen. Layers that
+require intermediate surfaces have temporary views. Layers are analyzed
+front-to-back, and rendered back-to-front within a view. Views
+themselves are rendered front-to-back, to minimize render target
+switching.
+
+Each view contains one or more rendering passes (``RenderPassMLGPU``). A
+pass represents a single draw command with one or more rendering items
+attached to it. For example, a ``SolidColorPass`` item contains a
+rectangle and an RGBA value, and many of these can be drawn with a
+single GPU call.
+
+When considering a layer, views will first try to find an existing
+rendering batch that can support it. If so, that pass will accumulate
+another draw item for the layer. Otherwise, a new pass will be added.
+
+When trying to find a matching pass for a layer, there is a tradeoff in
+CPU time versus the GPU time saved by not issuing another draw commands.
+We generally care more about CPU time, so we do not try too hard in
+matching items to an existing batch.
+
+After all layers have been processed, there is a “prepare” step. This
+copies all accumulated draw data and uploads it into vertex and constant
+buffers in the GPU.
+
+Finally, we execute rendering commands. At the end of the frame, all
+batches and (most) constant buffers are thrown away.
+
+Shaders Overview
+----------------
+
+Advanced Layers currently has five layer-related shader pipelines:
+
+-  Textured (PaintedLayer, ImageLayer, CanvasLayer)
+-  ComponentAlpha (PaintedLayer with component-alpha)
+-  YCbCr (ImageLayer with YCbCr video)
+-  Color (ColorLayers)
+-  Blend (ContainerLayers with mix-blend modes)
+
+There are also three special shader pipelines:
+
+-  MaskCombiner, which is used to combine mask layers into a single
+   texture.
+-  Clear, which is used for fast region-based clears when not directly
+   supported by the GPU.
+-  Diagnostic, which is used to display the diagnostic overlay texture.
+
+The layer shaders follow a unified structure. Each pipeline has a vertex
+and pixel shader. The vertex shader takes a layers ID, a z-buffer depth,
+a unit position in either a unit square or unit triangle, and either
+rectangular or triangular geometry. Shaders can also have ancillary data
+needed like texture coordinates or colors.
+
+Most of the time, layers have simple rectangular clips with simple
+rectilinear transforms, and pixel shaders do not need to perform masking
+or clipping. For these layers we use a fast-path pipeline, using
+unit-quad shaders that are able to clip geometry so the pixel shader
+does not have to. This type of pipeline does not support complex masks.
+
+If a layer has a complex mask, a rotation or 3d transform, or a complex
+operation like blending, then we use shaders capable of handling
+arbitrary geometry. Their input is a unit triangle, and these shaders
+are generally more expensive.
+
+All of the shader-specific data is modelled in ShaderDefinitionsMLGPU.h.
+
+CPU Occlusion Culling
+---------------------
+
+By default, Advanced Layers performs occlusion culling on the CPU. Since
+layers are visited front-to-back, this is simply a matter of
+accumulating the visible region of opaque layers, and subtracting it
+from the visible region of subsequent layers. There is a major
+difference between this occlusion culling and PostProcessLayers of the
+old compositor: AL performs culling after invalidation, not before.
+Completely valid layers will have an empty visible region.
+
+Most layer types (with the exception of images) will intelligently split
+their draw calls into a batch of individual rectangles, based on their
+visible region.
+
+Z-Buffering and Occlusion
+-------------------------
+
+Advanced Layers also supports occlusion culling on the GPU, using a
+z-buffer. This is disabled by default currently since it is
+significantly costly on integrated GPUs. When using the z-buffer, we
+separate opaque layers into a separate list of passes. The render
+process then uses the following steps:
+
+1. The depth buffer is set to read-write.
+2. Opaque batches are executed.,
+3. The depth buffer is set to read-only.
+4. Transparent batches are executed.
+
+The problem we have observed is that the depth buffer increases writes
+to the GPU, and on integrated GPUs this is expensive - we have seen draw
+call times increase by 20-30%, which is the wrong direction we want to
+take on battery life. In particular on a full screen video, the call to
+ClearDepthStencilView plus the actual depth buffer write of the video
+can double GPU time.
+
+For now the depth-buffer is disabled until we can find a compelling case
+for it on non-integrated hardware.
+
+Clipping
+--------
+
+Clipping is a bit tricky in Advanced Layers. We cannot use the hardware
+“scissor” feature, since the clip can change from instance to instance
+within a batch. And if using the depth buffer, we cannot write
+transparent pixels for the clipped area. As a result we always clip
+opaque draw rects in the vertex shader (and sometimes even on the CPU,
+as is needed for sane texture coordinates). Only transparent items are
+clipped in the pixel shader. As a result, masked layers and layers with
+non-rectangular transforms are always considered transparent, and use a
+more flexible clipping pipeline.
+
+Plane Splitting
+---------------
+
+Plane splitting is when a 3D transform causes a layer to be split - for
+example, one transparent layer may intersect another on a separate
+plane. When this happens, Gecko sorts layers using a BSP tree and
+produces a list of triangles instead of draw rects.
+
+These layers cannot use the “unit quad” shaders that support the fast
+clipping pipeline. Instead they always use the full triangle-list
+shaders that support extended vertices and clipping.
+
+This is the slowest path we can take when building a draw call, since we
+must interact with the polygon clipping and texturing code.
+
+Masks
+-----
+
+For each layer with a mask attached, Advanced Layers builds a
+``MaskOperation``. These operations must resolve to a single mask
+texture, as well as a rectangular area to which the mask applies. All
+batched pixel shaders will automatically clip pixels to the mask if a
+mask texture is bound. (Note that we must use separate batches if the
+mask texture changes.)
+
+Some layers have multiple mask textures. In this case, the MaskOperation
+will store the list of masks, and right before rendering, it will invoke
+a shader to combine these masks into a single texture.
+
+MaskOperations are shared across layers when possible, but are not
+cached across frames.
+
+BigImage Support
+----------------
+
+ImageLayers and CanvasLayers can be tiled with many individual textures.
+This happens in rare cases where the underlying buffer is too big for
+the GPU. Early on this caused problems for Advanced Layers, since AL
+required one texture per layer. We implemented BigImage support by
+creating temporary ImageLayers for each visible tile, and throwing those
+layers away at the end of the frame.
+
+Advanced Layers no longer has a 1:1 layer:texture restriction, but we
+retain the temporary layer solution anyway. It is not much code and it
+means we do not have to split ``TexturedLayerMLGPU`` methods into
+iterated and non-iterated versions.
+
+Texture Locking
+---------------
+
+Advanced Layers has a different texture locking scheme than the existing
+compositor. If a texture needs to be locked, then it is locked by the
+MLGDevice automatically when bound to the current pipeline. The
+MLGDevice keeps a set of the locked textures to avoid double-locking. At
+the end of the frame, any textures in the locked set are unlocked.
+
+We cannot easily replicate the locking scheme in the old compositor,
+since the duration of using the texture is not scoped to when we visit
+the layer.
+
+Buffer Measurements
+-------------------
+
+Advanced Layers uses constant buffers to send layer information and
+extended instance data to the GPU. We do this by pre-allocating large
+constant buffers and mapping them with ``MAP_DISCARD`` at the beginning
+of the frame. Batches may allocate into this up to the maximum bindable
+constant buffer size of the device (currently, 64KB).
+
+There are some downsides to this approach. Constant buffers are
+difficult to work with - they have specific alignment requirements, and
+care must be taken not too run over the maximum number of constants in a
+buffer. Another approach would be to store constants in a 2D texture and
+use vertex shader texture fetches. Advanced Layers implemented this and
+benchmarked it to decide which approach to use. Textures seemed to skew
+better on GPU performance, but worse on CPU, but this varied depending
+on the GPU. Overall constant buffers performed best and most
+consistently, so we have kept them.
+
+Additionally, we tested different ways of performing buffer uploads.
+Buffer creation itself is costly, especially on integrated GPUs, and
+especially so for immutable, immediate-upload buffers. As a result we
+aggressively cache buffer objects and always allocate them as
+MAP_DISCARD unless they are write-once and long-lived.
+
+Buffer Types
+------------
+
+Advanced Layers has a few different classes to help build and upload
+buffers to the GPU. They are:
+
+-  ``MLGBuffer``. This is the low-level shader resource that
+   ``MLGDevice`` exposes. It is the building block for buffer helper
+   classes, but it can also be used to make one-off, immutable,
+   immediate-upload buffers. MLGBuffers, being a GPU resource, are
+   reference counted.
+-  ``SharedBufferMLGPU``. These are large, pre-allocated buffers that
+   are read-only on the GPU and write-only on the CPU. They usually
+   exceed the maximum bindable buffer size. There are three shared
+   buffers created by default and they are automatically unmapped as
+   needed: one for vertices, one for vertex shader constants, and one
+   for pixel shader constants. When callers allocate into a shared
+   buffer they get back a mapped pointer, a GPU resource, and an offset.
+   When the underlying device supports offsetable buffers (like
+   ``ID3D11DeviceContext1`` does), this results in better GPU
+   utilization, as there are less resources and fewer upload commands.
+-  ``ConstantBufferSection`` and ``VertexBufferSection``. These are
+   “views” into a ``SharedBufferMLGPU``. They contain the underlying
+   ``MLGBuffer``, and when offsetting is supported, the offset
+   information necessary for resource binding. Sections are not
+   reference counted.
+-  ``StagingBuffer``. A dynamically sized CPU buffer where items can be
+   appended in a free-form manner. The stride of a single “item” is
+   computed by the first item written, and successive items must have
+   the same stride. The buffer must be uploaded to the GPU manually.
+   Staging buffers are appropriate for creating general constant or
+   vertex buffer data. They can also write items in reverse, which is
+   how we render back-to-front when layers are visited front-to-back.
+   They can be uploaded to a ``SharedBufferMLGPU`` or an immutabler
+   ``MLGBuffer`` very easily. Staging buffers are not reference counted.
+
+Unsupported Features
+--------------------
+
+Currently, these features of the old compositor are not yet implemented.
+
+-  OpenGL and software support (currently AL only works on D3D11).
+-  APZ displayport overlay.
+-  Diagnostic/developer overlays other than the FPS/timing overlay.
+-  DEAA. It was never ported to the D3D11 compositor, but we would like
+   it.
+-  Component alpha when used inside an opaque intermediate surface.
+-  Effects prefs. Possibly not needed post-B2G removal.
+-  Widget overlays and underlays used by macOS and Android.
+-  DefaultClearColor. This is Android specific, but is easy to added
+   when needed.
+-  Frame uniformity info in the profiler. Possibly not needed post-B2G
+   removal.
+-  LayerScope. There are no plans to make this work.
+
+Future Work
+-----------
+
+-  Refactor for D3D12/Vulkan support (namely, split MLGDevice into
+   something less stateful and something else more low-level).
+-  Remove “MLG” moniker and namespace everything.
+-  Other backends (D3D12/Vulkan, OpenGL, Software)
+-  Delete CompositorD3D11
+-  Add DEAA support
+-  Re-enable the depth buffer by default for fast GPUs
+-  Re-enable right-sizing of inaccurately sized containers
+-  Drop constant buffers for ancillary vertex data
+-  Fast shader paths for simple video/painted layer cases
+
+History
+-------
+
+Advanced Layers has gone through four major design iterations. The
+initial version used tiling - each render view divided the screen into
+128x128 tiles, and layers were assigned to tiles based on their
+screen-space draw area. This approach proved not to scale well to 3d
+transforms, and so tiling was eliminated.
+
+We replaced it with a simple system of accumulating draw regions to each
+batch, thus ensuring that items could be assigned to batches while
+maintaining correct z-ordering. This second iteration also coincided
+with plane-splitting support.
+
+On large layer trees, accumulating the affected regions of batches
+proved to be quite expensive. This led to a third iteration, using depth
+buffers and separate opaque and transparent batch lists to achieve
+z-ordering and occlusion culling.
+
+Finally, depth buffers proved to be too expensive, and we introduced a
+simple CPU-based occlusion culling pass. This iteration coincided with
+using more precise draw rects and splitting pipelines into unit-quad,
+cpu-clipped and triangle-list, gpu-clipped variants.
diff --git a/gfx/docs/AsyncPanZoom.rst b/gfx/docs/AsyncPanZoom.rst
new file mode 100644
index 0000000000..01bf2776df
--- /dev/null
+++ b/gfx/docs/AsyncPanZoom.rst
@@ -0,0 +1,687 @@
+.. _apz:
+
+Asynchronous Panning and Zooming
+================================
+
+**This document is a work in progress. Some information may be missing
+or incomplete.**
+
+.. image:: AsyncPanZoomArchitecture.png
+
+Goals
+-----
+
+We need to be able to provide a visual response to user input with
+minimal latency. In particular, on devices with touch input, content
+must track the finger exactly while panning, or the user experience is
+very poor. According to the UX team, 120ms is an acceptable latency
+between user input and response.
+
+Context and surrounding architecture
+------------------------------------
+
+The fundamental problem we are trying to solve with the Asynchronous
+Panning and Zooming (APZ) code is that of responsiveness. By default,
+web browsers operate in a “game loop” that looks like this:
+
+::
+
+       while true:
+           process input
+           do computations
+           repaint content
+           display repainted content
+
+In browsers the “do computation” step can be arbitrarily expensive
+because it can involve running event handlers in web content. Therefore,
+there can be an arbitrary delay between the input being received and the
+on-screen display getting updated.
+
+Responsiveness is always good, and with touch-based interaction it is
+even more important than with mouse or keyboard input. In order to
+ensure responsiveness, we split the “game loop” model of the browser
+into a multithreaded variant which looks something like this:
+
+::
+
+       Thread 1 (compositor thread)
+       while true:
+           receive input
+           send a copy of input to thread 2
+           adjust painted content based on input
+           display adjusted painted content
+       
+       Thread 2 (main thread)
+       while true:
+           receive input from thread 1
+           do computations
+           repaint content
+           update the copy of painted content in thread 1
+
+This multithreaded model is called off-main-thread compositing (OMTC),
+because the compositing (where the content is displayed on-screen)
+happens on a separate thread from the main thread. Note that this is a
+very very simplified model, but in this model the “adjust painted
+content based on input” is the primary function of the APZ code.
+
+The “painted content” is stored on a set of “layers”, that are
+conceptually double-buffered. That is, when the main thread does its
+repaint, it paints into one set of layers (the “client” layers). The
+update that is sent to the compositor thread copies all the changes from
+the client layers into another set of layers that the compositor holds.
+These layers are called the “shadow” layers or the “compositor” layers.
+The compositor in theory can continuously composite these shadow layers
+to the screen while the main thread is busy doing other things and
+painting a new set of client layers.
+
+The APZ code takes the input events that are coming in from the hardware
+and uses them to figure out what the user is trying to do (e.g. pan the
+page, zoom in). It then expresses this user intention in the form of
+translation and/or scale transformation matrices. These transformation
+matrices are applied to the shadow layers at composite time, so that
+what the user sees on-screen reflects what they are trying to do as
+closely as possible.
+
+Technical overview
+------------------
+
+As per the heavily simplified model described above, the fundamental
+purpose of the APZ code is to take input events and produce
+transformation matrices. This section attempts to break that down and
+identify the different problems that make this task non-trivial.
+
+Checkerboarding
+~~~~~~~~~~~~~~~
+
+The content area that is painted and stored in a shadow layer is called
+the “displayport”. The APZ code is responsible for determining how large
+the displayport should be. On the one hand, we want the displayport to
+be as large as possible. At the very least it needs to be larger than
+what is visible on-screen, because otherwise, as soon as the user pans,
+there will be some unpainted area of the page exposed. However, we
+cannot always set the displayport to be the entire page, because the
+page can be arbitrarily long and this would require an unbounded amount
+of memory to store. Therefore, a good displayport size is one that is
+larger than the visible area but not so large that it is a huge drain on
+memory. Because the displayport is usually smaller than the whole page,
+it is always possible for the user to scroll so fast that they end up in
+an area of the page outside the displayport. When this happens, they see
+unpainted content; this is referred to as “checkerboarding”, and we try
+to avoid it where possible.
+
+There are many possible ways to determine what the displayport should be
+in order to balance the tradeoffs involved (i.e. having one that is too
+big is bad for memory usage, and having one that is too small results in
+excessive checkerboarding). Ideally, the displayport should cover
+exactly the area that we know the user will make visible. Although we
+cannot know this for sure, we can use heuristics based on current
+panning velocity and direction to ensure a reasonably-chosen displayport
+area. This calculation is done in the APZ code, and a new desired
+displayport is frequently sent to the main thread as the user is panning
+around.
+
+Multiple layers
+~~~~~~~~~~~~~~~
+
+Consider, for example, a scrollable page that contains an iframe which
+itself is scrollable. The iframe can be scrolled independently of the
+top-level page, and we would like both the page and the iframe to scroll
+responsively. This means that we want independent asynchronous panning
+for both the top-level page and the iframe. In addition to iframes,
+elements that have the overflow:scroll CSS property set are also
+scrollable, and also end up on separate scrollable layers. In the
+general case, the layers are arranged in a tree structure, and so within
+the APZ code we have a matching tree of AsyncPanZoomController (APZC)
+objects, one for each scrollable layer. To manage this tree of APZC
+instances, we have a single APZCTreeManager object. Each APZC is
+relatively independent and handles the scrolling for its associated
+layer, but there are some cases in which they need to interact; these
+cases are described in the sections below.
+
+Hit detection
+~~~~~~~~~~~~~
+
+Consider again the case where we have a scrollable page that contains an
+iframe which itself is scrollable. As described above, we will have two
+APZC instances - one for the page and one for the iframe. When the user
+puts their finger down on the screen and moves it, we need to do some
+sort of hit detection in order to determine whether their finger is on
+the iframe or on the top-level page. Based on where their finger lands,
+the appropriate APZC instance needs to handle the input. This hit
+detection is also done in the APZCTreeManager, as it has the necessary
+information about the sizes and positions of the layers. Currently this
+hit detection is not perfect, as it uses rects and does not account for
+things like rounded corners and opacity.
+
+Also note that for some types of input (e.g. when the user puts two
+fingers down to do a pinch) we do not want the input to be “split”
+across two different APZC instances. In the case of a pinch, for
+example, we find a “common ancestor” APZC instance - one that is
+zoomable and contains all of the touch input points, and direct the
+input to that APZC instance.
+
+Scroll Handoff
+~~~~~~~~~~~~~~
+
+Consider yet again the case where we have a scrollable page that
+contains an iframe which itself is scrollable. Say the user scrolls the
+iframe so that it reaches the bottom. If the user continues panning on
+the iframe, the expectation is that the top-level page will start
+scrolling. However, as discussed in the section on hit detection, the
+APZC instance for the iframe is separate from the APZC instance for the
+top-level page. Thus, we need the two APZC instances to communicate in
+some way such that input events on the iframe result in scrolling on the
+top-level page. This behaviour is referred to as “scroll handoff” (or
+“fling handoff” in the case where analogous behaviour results from the
+scrolling momentum of the page after the user has lifted their finger).
+
+Input event untransformation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The APZC architecture by definition results in two copies of a “scroll
+position” for each scrollable layer. There is the original copy on the
+main thread that is accessible to web content and the layout and
+painting code. And there is a second copy on the compositor side, which
+is updated asynchronously based on user input, and corresponds to what
+the user visually sees on the screen. Although these two copies may
+diverge temporarily, they are reconciled periodically. In particular,
+they diverge while the APZ code is performing an async pan or zoom
+action on behalf of the user, and are reconciled when the APZ code
+requests a repaint from the main thread.
+
+Because of the way input events are stored, this has some unfortunate
+consequences. Input events are stored relative to the device screen - so
+if the user touches at the same physical spot on the device, the same
+input events will be delivered regardless of the content scroll
+position. When the main thread receives a touch event, it combines that
+with the content scroll position in order to figure out what DOM element
+the user touched. However, because we now have two different scroll
+positions, this process may not work perfectly. A concrete example
+follows:
+
+Consider a device with screen size 600 pixels tall. On this device, a
+user is viewing a document that is 1000 pixels tall, and that is
+scrolled down by 200 pixels. That is, the vertical section of the
+document from 200px to 800px is visible. Now, if the user touches a
+point 100px from the top of the physical display, the hardware will
+generate a touch event with y=100. This will get sent to the main
+thread, which will add the scroll position (200) and get a
+document-relative touch event with y=300. This new y-value will be used
+in hit detection to figure out what the user touched. If the document
+had a absolute-positioned div at y=300, then that would receive the
+touch event.
+
+Now let us add some async scrolling to this example. Say that the user
+additionally scrolls the document by another 10 pixels asynchronously
+(i.e. only on the compositor thread), and then does the same touch
+event. The same input event is generated by the hardware, and as before,
+the document will deliver the touch event to the div at y=300. However,
+visually, the document is scrolled by an additional 10 pixels so this
+outcome is wrong. What needs to happen is that the APZ code needs to
+intercept the touch event and account for the 10 pixels of asynchronous
+scroll. Therefore, the input event with y=100 gets converted to y=110 in
+the APZ code before being passed on to the main thread. The main thread
+then adds the scroll position it knows about and determines that the
+user touched at a document-relative position of y=310.
+
+Analogous input event transformations need to be done for horizontal
+scrolling and zooming.
+
+Content independently adjusting scrolling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As described above, there are two copies of the scroll position in the
+APZ architecture - one on the main thread and one on the compositor
+thread. Usually for architectures like this, there is a single “source
+of truth” value and the other value is simply a copy. However, in this
+case that is not easily possible to do. The reason is that both of these
+values can be legitimately modified. On the compositor side, the input
+events the user is triggering modify the scroll position, which is then
+propagated to the main thread. However, on the main thread, web content
+might be running Javascript code that programmatically sets the scroll
+position (via window.scrollTo, for example). Scroll changes driven from
+the main thread are just as legitimate and need to be propagated to the
+compositor thread, so that the visual display updates in response.
+
+Because the cross-thread messaging is asynchronous, reconciling the two
+types of scroll changes is a tricky problem. Our design solves this
+using various flags and generation counters. The general heuristic we
+have is that content-driven scroll position changes (e.g. scrollTo from
+JS) are never lost. For instance, if the user is doing an async scroll
+with their finger and content does a scrollTo in the middle, then some
+of the async scroll would occur before the “jump” and the rest after the
+“jump”.
+
+Content preventing default behaviour of input events
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Another problem that we need to deal with is that web content is allowed
+to intercept touch events and prevent the “default behaviour” of
+scrolling. This ability is defined in web standards and is
+non-negotiable. Touch event listeners in web content are allowed call
+preventDefault() on the touchstart or first touchmove event for a touch
+point; doing this is supposed to “consume” the event and prevent
+touch-based panning. As we saw in a previous section, the input event
+needs to be untransformed by the APZ code before it can be delivered to
+content. But, because of the preventDefault problem, we cannot fully
+process the touch event in the APZ code until content has had a chance
+to handle it. Web browsers in general solve this problem by inserting a
+delay of up to 300ms before processing the input - that is, web content
+is allowed up to 300ms to process the event and call preventDefault on
+it. If web content takes longer than 300ms, or if it completes handling
+of the event without calling preventDefault, then the browser
+immediately starts processing the events.
+
+The way the APZ implementation deals with this is that upon receiving a
+touch event, it immediately returns an untransformed version that can be
+dispatched to content. It also schedules a 400ms timeout (600ms on
+Android) during which content is allowed to prevent scrolling. There is
+an API that allows the main-thread event dispatching code to notify the
+APZ as to whether or not the default action should be prevented. If the
+APZ content response timeout expires, or if the main-thread event
+dispatching code notifies the APZ of the preventDefault status, then the
+APZ continues with the processing of the events (which may involve
+discarding the events).
+
+The touch-action CSS property from the pointer-events spec is intended
+to allow eliminating this 400ms delay in many cases (although for
+backwards compatibility it will still be needed for a while). Note that
+even with touch-action implemented, there may be cases where the APZ
+code does not know the touch-action behaviour of the point the user
+touched. In such cases, the APZ code will still wait up to 400ms for the
+main thread to provide it with the touch-action behaviour information.
+
+Technical details
+-----------------
+
+This section describes various pieces of the APZ code, and goes into
+more specific detail on APIs and code than the previous sections. The
+primary purpose of this section is to help people who plan on making
+changes to the code, while also not going into so much detail that it
+needs to be updated with every patch.
+
+Overall flow of input events
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This section describes how input events flow through the APZ code.
+
+1.  Input events arrive from the hardware/widget code into the APZ via
+    APZCTreeManager::ReceiveInputEvent. The thread that invokes this is
+    called the input thread, and may or may not be the same as the Gecko
+    main thread.
+2.  Conceptually the first thing that the APZCTreeManager does is to
+    associate these events with “input blocks”. An input block is a set
+    of events that share certain properties, and generally are intended
+    to represent a single gesture. For example with touch events, all
+    events following a touchstart up to but not including the next
+    touchstart are in the same block. All of the events in a given block
+    will go to the same APZC instance and will either all be processed
+    or all be dropped.
+3.  Using the first event in the input block, the APZCTreeManager does a
+    hit-test to see which APZC it hits. This hit-test uses the event
+    regions populated on the layers, which may be larger than the true
+    hit area of the layer. If no APZC is hit, the events are discarded
+    and we jump to step 6. Otherwise, the input block is tagged with the
+    hit APZC as a tentative target and put into a global APZ input
+    queue.
+4.
+
+    i.  If the input events landed outside the dispatch-to-content event
+        region for the layer, any available events in the input block
+        are processed. These may trigger behaviours like scrolling or
+        tap gestures.
+    ii. If the input events landed inside the dispatch-to-content event
+        region for the layer, the events are left in the queue and a
+        400ms timeout is initiated. If the timeout expires before step 9
+        is completed, the APZ assumes the input block was not cancelled
+        and the tentative target is correct, and processes them as part
+        of step 10.
+
+5.  The call stack unwinds back to APZCTreeManager::ReceiveInputEvent,
+    which does an in-place modification of the input event so that any
+    async transforms are removed.
+6.  The call stack unwinds back to the widget code that called
+    ReceiveInputEvent. This code now has the event in the coordinate
+    space Gecko is expecting, and so can dispatch it to the Gecko main
+    thread.
+7.  Gecko performs its own usual hit-testing and event dispatching for
+    the event. As part of this, it records whether any touch listeners
+    cancelled the input block by calling preventDefault(). It also
+    activates inactive scrollframes that were hit by the input events.
+8.  The call stack unwinds back to the widget code, which sends two
+    notifications to the APZ code on the input thread. The first
+    notification is via APZCTreeManager::ContentReceivedInputBlock, and
+    informs the APZ whether the input block was cancelled. The second
+    notification is via APZCTreeManager::SetTargetAPZC, and informs the
+    APZ of the results of the Gecko hit-test during event dispatch. Note
+    that Gecko may report that the input event did not hit any
+    scrollable frame at all. The SetTargetAPZC notification happens only
+    once per input block, while the ContentReceivedInputBlock
+    notification may happen once per block, or multiple times per block,
+    depending on the input type.
+9.
+
+    i.   If the events were processed as part of step 4(i), the
+         notifications from step 8 are ignored and step 10 is skipped.
+    ii.  If events were queued as part of step 4(ii), and steps 5-8 take
+         less than 400ms, the arrival of both notifications from step 8
+         will mark the input block ready for processing.
+    iii. If events were queued as part of step 4(ii), but steps 5-8 take
+         longer than 400ms, the notifications from step 8 will be
+         ignored and step 10 will already have happened.
+
+10. If events were queued as part of step 4(ii) they are now either
+    processed (if the input block was not cancelled and Gecko detected a
+    scrollframe under the input event, or if the timeout expired) or
+    dropped (all other cases). Note that the APZC that processes the
+    events may be different at this step than the tentative target from
+    step 3, depending on the SetTargetAPZC notification. Processing the
+    events may trigger behaviours like scrolling or tap gestures.
+
+If the CSS touch-action property is enabled, the above steps are
+modified as follows: \* In step 4, the APZC also requires the allowed
+touch-action behaviours for the input event. This might have been
+determined as part of the hit-test in APZCTreeManager; if not, the
+events are queued. \* In step 6, the widget code determines the content
+element at the point under the input element, and notifies the APZ code
+of the allowed touch-action behaviours. This notification is sent via a
+call to APZCTreeManager::SetAllowedTouchBehavior on the input thread. \*
+In step 9(ii), the input block will only be marked ready for processing
+once all three notifications arrive.
+
+Threading considerations
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The bulk of the input processing in the APZ code happens on what we call
+“the input thread”. In practice the input thread could be the Gecko main
+thread, the compositor thread, or some other thread. There are obvious
+downsides to using the Gecko main thread - that is, “asynchronous”
+panning and zooming is not really asynchronous as input events can only
+be processed while Gecko is idle. In an e10s environment, using the
+Gecko main thread of the chrome process is acceptable, because the code
+running in that process is more controllable and well-behaved than
+arbitrary web content. Using the compositor thread as the input thread
+could work on some platforms, but may be inefficient on others. For
+example, on Android (Fennec) we receive input events from the system on
+a dedicated UI thread. We would have to redispatch the input events to
+the compositor thread if we wanted to the input thread to be the same as
+the compositor thread. This introduces a potential for higher latency,
+particularly if the compositor does any blocking operations - blocking
+SwapBuffers operations, for example. As a result, the APZ code itself
+does not assume that the input thread will be the same as the Gecko main
+thread or the compositor thread.
+
+Active vs. inactive scrollframes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The number of scrollframes on a page is potentially unbounded. However,
+we do not want to create a separate layer for each scrollframe right
+away, as this would require large amounts of memory. Therefore,
+scrollframes as designated as either “active” or “inactive”. Active
+scrollframes are the ones that do have their contents put on a separate
+layer (or set of layers), and inactive ones do not.
+
+Consider a page with a scrollframe that is initially inactive. When
+layout generates the layers for this page, the content of the
+scrollframe will be flattened into some other PaintedLayer (call it P).
+The layout code also adds the area (or bounding region in case of weird
+shapes) of the scrollframe to the dispatch-to-content region of P.
+
+When the user starts interacting with that content, the hit-test in the
+APZ code finds the dispatch-to-content region of P. The input block
+therefore has a tentative target of P when it goes into step 4(ii) in
+the flow above. When gecko processes the input event, it must detect the
+inactive scrollframe and activate it, as part of step 7. Finally, the
+widget code sends the SetTargetAPZC notification in step 8 to notify the
+APZ that the input block should really apply to this new layer. The
+issue here is that the layer transaction containing the new layer must
+reach the compositor and APZ before the SetTargetAPZC notification. If
+this does not occur within the 400ms timeout, the APZ code will be
+unable to update the tentative target, and will continue to use P for
+that input block. Input blocks that start after the layer transaction
+will get correctly routed to the new layer as there will now be a layer
+and APZC instance for the active scrollframe.
+
+This model implies that when the user initially attempts to scroll an
+inactive scrollframe, it may end up scrolling an ancestor scrollframe.
+(This is because in the absence of the SetTargetAPZC notification, the
+input events will get applied to the closest ancestor scrollframe’s
+APZC.) Only after the round-trip to the gecko thread is complete is
+there a layer for async scrolling to actually occur on the scrollframe
+itself. At that point the scrollframe will start receiving new input
+blocks and will scroll normally.
+
+WebRender Integration
+~~~~~~~~~~~~~~~~~~~~~
+
+The APZ code was originally written to work with the "layers" graphics
+backend. Many of the concepts (and therefore variable/function names)
+stem from the integration with the layers backend. After the WebRender
+backend was added, the existing code evolved over time to integrate
+with that backend as well, resulting in a bit of a hodge-podge effect.
+With that cautionary note out of the way, there are three main pieces
+that need to be understood to grasp the integration between the APZ
+code and WebRender. These are detailed below.
+
+HitTestingTree
+^^^^^^^^^^^^^^
+
+The APZCTreeManager keeps as part of its internal state a tree of
+HitTestingTreeNode instances. This is referred to as the HitTestingTree.
+As the name implies, this was used for hit-testing purposes, so that
+APZ could determine which scrollframe a particular incoming input event
+would be targeting. Doing the hit-test requires access to a bunch of state,
+such as CSS transforms and clip rects, as well as ancillary data like
+event regions, which affect how APZ reacts to input events.
+
+With the layers backend, all this information was provided by a layer tree
+update, and so the HitTestingTree was created to mirror the layer tree,
+allowing APZ access to that information from other threads. The structure
+of the tree was identical to the layer tree. But with WebRender, there
+is no "layer tree" per se, and instead we "fake it" by creating a
+HitTestingTree structure that is similar to what it would be like on the
+equivalent layer tree. But the bigger difference is that with WebRender,
+the HitTestingTree is not actually used for hit-testing at all; instead
+we get WebRender to do the hit-test for us, as it can do so using its
+own internal state and produce a more precise result.
+
+Information stored in the HitTestingTree (e.g. CSS transforms) is still
+used by other pieces of APZ (e.g. some of the scrollbar manipulation code)
+so it is still needed, even with the WebRender backend. For this reason,
+and for consistency between the two backends, we try to populate as much
+information in the HitTestingTree that we can, even with the WebRender
+backend.
+
+With the layers backend, the way the HitTestingTree is created is by
+walking the layer tree with a LayerMetricsWrapper class. This wraps
+a layer tree but also expands layers with multiple ScrollMetadata into
+multiple nodes. The equivalent in the WebRender world is the
+WebRenderScrollDataWrapper, which wraps a WebRenderScrollData object. The
+WebRenderScrollData object is roughly analogous to a layer tree, but
+is something that is constructed deliberately rather than being a natural
+output from the WebRender paint transaction (i.e. we create it explicitly
+for APZ's consumption, rather than something that we would create anyway
+for WebRender's consumption).
+
+The WebRenderScrollData structure contains within it a tree of
+WebRenderLayerScrollData instances, which are analogous to individual
+layers in a layer tree. These instances contain various fields like
+CSS transforms, fixed/sticky position info, etc. that would normally be
+found on individual layers in the layer tree. This allows the code
+that builds the HitTestingTree to consume either a WebRenderScrollData
+or a layer tree in a more-or-less unified fashion.
+
+Working backwards a bit more, the WebRenderLayerScrollData instances
+are created as we traverse the Gecko display list and build the
+WebRender display list. In the layers world, the code in FrameLayerBuilder
+was responsible for building the layer tree from the Gecko display list,
+but in the WebRender world, this happens primarily in WebRenderCommandBuilder.
+As of this writing, the architecture for this is that, as we walk
+the Gecko display list, we query it to see if it contains any information
+that APZ might need to know (e.g. CSS transforms) via a call to
+`nsDisplayItem::UpdateScrollData(nullptr, nullptr)`. If this call
+returns true, we create a WebRenderLayerScrollData instance for the item,
+and populate it with the necessary information in
+`WebRenderLayerScrollData::Initialize`. We also create
+WebRenderLayerScrollData instances if we detect (via ASR changes) that
+we are now processing a Gecko display item that is in a different scrollframe
+than the previous item. This is equivalent to how FrameLayerBuilder will
+flatten items with different ASRs into different layers, so that it
+is cheap to scroll scrollframes in the compositor.
+
+The main sources of complexity in this code come from:
+
+1. Ensuring the ScrollMetadata instances end on the proper
+   WebRenderLayerScrollData instances (such that every path from a leaf
+   WebRenderLayerScrollData node to the root has a consistent ordering of
+   scrollframes without duplications).
+2. The deferred-transform optimization that is described in more detail
+   at the declaration of StackingContextHelper::mDeferredTransformItem.
+
+Hit-testing
+^^^^^^^^^^^
+
+Since the HitTestingTree is not used for actual hit-testing purposes
+with the WebRender backend (see previous section), this section describes
+how hit-testing actually works with WebRender.
+
+With both layers and WebRender, the Gecko display list contains display items
+(`nsDisplayCompositorHitTestInfo`) that store hit-testing state. These
+items implement the `CreateWebRenderCommands` method and generate a "hit-test
+item" into the WebRender display list. This is basically just a rectangle
+item in the WebRender display list that is no-op for painting purposes,
+but contains information that should be returned by the hit-test (specifically
+the hit info flags and the scrollId of the enclosing scrollframe). The
+hit-test item gets clipped and transformed in the same way that all the other
+items in the WebRender display list do, via clip chains and enclosing
+reference frame/stacking context items.
+
+When WebRender needs to do a hit-test, it goes through its display list,
+taking into account the current clips and transforms, adjusted for the
+most recent async scroll/zoom, and determines which hit-test item(s) are under
+the target point, and returns those items. APZ can then take the frontmost
+item from that list (or skip over it if it happens to be inside a OOP
+subdocument that's pointer-events:none) and use that as the hit target.
+It's important to note that when APZ does hit-testing for the layers backend,
+it uses the most recent available async transforms, even if those transforms
+have not yet been composited. With WebRender, the hit-test uses the last
+transform provided by the `SampleForWebRender` API (see next section) which
+generally reflects the last composite, and doesn't take into account further
+changes to the transforms that have occurred since then.
+
+When debugging hit-test issues, it is often useful to apply the patches
+on bug 1656260, which introduce a guid on Gecko display items and propagate
+it all the way through to where APZ gets the hit-test result. This allows
+answering the question "which nsDisplayCompositorHitTestInfo was responsible
+for this hit-test result?" which is often a very good first step in
+solving the bug. From there, one can determine if there was some other
+display item in front that should have generated a
+nsDisplayCompositorHitTestInfo but didn't, or if display item itself had
+incorrect information. The second patch on that bug further allows exposing
+hand-written debug info to the APZ code, so that the WR hit-testing
+mechanism itself can be more effectively debugged, in case there is a problem
+with the WR display items getting improperly transformed or clipped.
+
+Sampling
+^^^^^^^^
+
+With both the layers and WebRender backend, the compositing step needs to
+read the latest async transforms from APZ in order to ensure scrollframes
+are rendered at the right position. In both cases, the API for this is
+exposed via the `APZSampler` class. The difference is that with the layers
+backend, the `AsyncCompositionManager` walks the layer tree and queries
+the transform components for each layer individually via the various getters
+on `APZSampler`. In contrast, with the WebRender backend, there is a single
+`APZSampler::SampleForWebRender` API that returns all the information needed
+for all the scrollframes, scrollthumbs, etc. Conceptually though, the
+functionality is pretty similar, because the compositor needs the same
+information from APZ regardless of which backend is in use.
+
+Along with sampling the APZ transforms, the compositor also triggers APZ
+animations to advance to the next timestep (usually the next vsync). Again,
+with both the WebRender and layers backend, this happens just before reading
+the APZ transforms. The only difference is that with the layers backend,
+the `AsyncCompositionManager` invokes the `APZSampler::AdvanceAnimations` API
+directly, whereas with the WebRender backend this happens as part of the
+`APZSampler::SampleForWebRender` implementation.
+
+Threading / Locking Overview
+----------------------------
+
+Threads
+~~~~~~~
+
+There are three threads relevant to APZ: the **controller thread**,
+the **updater thread**, and the **sampler thread**. This table lists
+which threads play these roles on each platform / configuration:
+
+===================== ========== =========== ============= ============== ========== =============
+APZ Thread Name       Desktop    Desktop+GPU Desktop+WR    Desktop+WR+GPU Android    Android+WR
+===================== ========== =========== ============= ============== ========== =============
+**controller thread** UI main    GPU main    UI main       GPU main       Java UI    Java UI
+**updater thread**    Compositor Compositor  SceneBuilder  SceneBuilder   Compositor SceneBuilder
+**sampler thread**    Compositor Compositor  RenderBackend RenderBackend  Compositor RenderBackend
+===================== ========== =========== ============= ============== ========== =============
+
+Locks
+~~~~~
+
+There are also a number of locks used in APZ code:
+
+======================= ==============================
+Lock type               How many instances
+======================= ==============================
+APZ tree lock           one per APZCTreeManager
+APZC map lock           one per APZCTreeManager
+APZC instance lock      one per AsyncPanZoomController
+APZ test lock           one per APZCTreeManager
+Checkerboard event lock one per AsyncPanZoomController
+======================= ==============================
+
+Thread / Lock Ordering
+~~~~~~~~~~~~~~~~~~~~~~
+
+To avoid deadlocks, the threads and locks have a global **ordering**
+which must be respected.
+
+Respecting the ordering means the following:
+
+- Let "A < B" denote that A occurs earlier than B in the ordering
+- Thread T may only acquire lock L, if T < L
+- A thread may only acquire lock L2 while holding lock L1, if L1 < L2
+- A thread may only block on a response from another thread T while holding a lock L, if L < T
+
+**The lock ordering is as follows**:
+
+1. UI main
+2. GPU main              (only if GPU enabled)
+3. Compositor thread
+4. SceneBuilder thread   (only if WR enabled)
+5. **APZ tree lock**
+6. RenderBackend thread  (only if WR enabled)
+7. **APZC map lock**
+8. **APZC instance lock**
+9. **APZ test lock**
+10. **Checkerboard event lock**
+
+Example workflows
+^^^^^^^^^^^^^^^^^
+
+Here are some example APZ workflows. Observe how they all obey
+the global thread/lock ordering. Feel free to add others:
+
+- **Input handling** (in WR+GPU) case: UI main -> GPU main -> APZ tree lock -> RenderBackend thread
+- **Sync messages** in ``PCompositorBridge.ipdl``: UI main thread -> Compositor thread
+- **GetAPZTestData**: Compositor thread -> SceneBuilder thread -> test lock
+- **Scene swap**: SceneBuilder thread -> APZ tree lock -> RenderBackend thread
+- **Updating hit-testing tree**: SceneBuilder thread -> APZ tree lock -> APZC instance lock
+- **Updating APZC map**: SceneBuilder thread -> APZ tree lock -> APZC map lock
+- **Sampling and animation deferred tasks** [1]_: RenderBackend thread -> APZC map lock -> APZC instance lock
+- **Advancing animations**: RenderBackend thread -> APZC instance lock
+
+.. [1] It looks like there are two deferred tasks that actually need the tree lock,
+   ``AsyncPanZoomController::HandleSmoothScrollOverscroll`` and
+   ``AsyncPanZoomController::HandleFlingOverscroll``. We should be able to rewrite
+   these to use the map lock instead of the tree lock.
+   This will allow us to continue running the deferred tasks on the sampler
+   thread rather than having to bounce them to another thread.
diff --git a/gfx/docs/AsyncPanZoomArchitecture.png b/gfx/docs/AsyncPanZoomArchitecture.png
new file mode 100644
index 0000000000..d19dcb7c8b
--- /dev/null
+++ b/gfx/docs/AsyncPanZoomArchitecture.png
diff --git a/gfx/docs/GraphicsOverview.rst b/gfx/docs/GraphicsOverview.rst
new file mode 100644
index 0000000000..a065101a8d
--- /dev/null
+++ b/gfx/docs/GraphicsOverview.rst
@@ -0,0 +1,159 @@
+Graphics Overview
+=========================
+
+Work in progress. Possibly incorrect or incomplete.
+---------------------------------------------------
+
+Jargon
+------
+
+There's a lot of jargon in the graphics stack. We try to maintain a list
+of common words and acronyms `here <https://wiki.mozilla.org/Platform/GFX/Jargon>`__.
+
+Overview
+--------
+
+The graphics systems is responsible for rendering (painting, drawing)
+the frame tree (rendering tree) elements as created by the layout
+system. Each leaf in the tree has content, either bounded by a rectangle
+(or perhaps another shape, in the case of SVG.)
+
+The simple approach for producing the result would thus involve
+traversing the frame tree, in a correct order, drawing each frame into
+the resulting buffer and displaying (printing non-withstanding) that
+buffer when the traversal is done. It is worth spending some time on the
+“correct order” note above. If there are no overlapping frames, this is
+fairly simple - any order will do, as long as there is no background. If
+there is background, we just have to worry about drawing that first.
+Since we do not control the content, chances are the page is more
+complicated. There are overlapping frames, likely with transparency, so
+we need to make sure the elements are draw “back to front”, in layers,
+so to speak. Layers are an important concept, and we will revisit them
+shortly, as they are central to fixing a major issue with the above
+simple approach.
+
+While the above simple approach will work, the performance will suffer.
+Each time anything changes in any of the frames, the complete process
+needs to be repeated, everything needs to be redrawn. Further, there is
+very little space to take advantage of the modern graphics (GPU)
+hardware, or multi-core computers. If you recall from the previous
+sections, the frame tree is only accessible from the UI thread, so while
+we’re doing all this work, the UI is basically blocked.
+
+(Retained) Layers
+~~~~~~~~~~~~~~~~~
+
+Layers framework was introduced to address the above performance issues,
+by having a part of the design address each item. At the high level:
+
+1. We create a layer tree. The leaf elements of the tree contain all
+   frames (possibly multiple frames per leaf).
+2. We render each layer tree element and cache (retain) the result.
+3. We composite (combine) all the leaf elements into the final result.
+
+Let’s examine each of these steps, in reverse order.
+
+Compositing
+~~~~~~~~~~~
+
+We use the term composite as it implies that the order is important. If
+the elements being composited overlap, whether there is transparency
+involved or not, the order in which they are combined will effect the
+result. Compositing is where we can use some of the power of the modern
+graphics hardware. It is optimal for doing this job. In the scenarios
+where only the position of individual frames changes, without the
+content inside them changing, we see why caching each layer would be
+advantageous - we only need to repeat the final compositing step,
+completely skipping the layer tree creation and the rendering of each
+leaf, thus speeding up the process considerably.
+
+Another benefit is equally apparent in the context of the stated
+deficiencies of the simple approach. We can use the available graphics
+hardware accelerated APIs to do the compositing step. Direct3D, OpenGL
+can be used on different platforms and are well suited to accelerate
+this step.
+
+Finally, we can now envision performing the compositing step on a
+separate thread, unblocking the UI thread for other work, and doing more
+work in parallel. More on this below.
+
+It is important to note that the number of operations in this step is
+proportional to the number of layer tree (leaf) elements, so there is
+additional work and complexity involved, when the layer tree is large.
+
+Render and retain layer elements
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As we saw, the compositing step benefits from caching the intermediate
+result. This does result in the extra memory usage, so needs to be
+considered during the layer tree creation. Beyond the caching, we can
+accelerate the rendering of each element by (indirectly) using the
+available platform APIs (e.g., Direct2D, CoreGraphics, even some of the
+3D APIs like OpenGL or Direct3D) as available. This is actually done
+through a platform independent API (see Moz2D) below, but is important
+to realize it does get accelerated appropriately.
+
+Creating the layer tree
+~~~~~~~~~~~~~~~~~~~~~~~
+
+We need to create a layer tree (from the frames tree), which will give
+us the correct result while striking the right balance between a layer
+per frame element and a single layer for the complete frames tree. As
+was mentioned above, there is an overhead in traversing the whole tree
+and caching each of the elements, balanced by the performance
+improvements. Some of the performance improvements are only noticed when
+something changes (e.g., one element is moving, we only need to redo the
+compositing step).
+
+Refresh Driver
+~~~~~~~~~~~~~~
+
+Layers
+~~~~~~
+
+Rendering each layer
+~~~~~~~~~~~~~~~~~~~~
+
+Tiling vs. Buffer Rotation vs. Full paint
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Compositing for the final result
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Graphics API
+~~~~~~~~~~~~
+
+Moz2D
+~~~~~
+
+-  The Moz2D graphics API, part of the Azure project, is a
+   cross-platform interface onto the various graphics backends that
+   Gecko uses for rendering such as Direct2D (1.0 and 1.1), Skia, Cairo,
+   Quartz, and NV Path. Adding a new graphics platform to Gecko is
+   accomplished by adding a backend to Moz2D.
+   See `Moz2D documentation on wiki <https://wiki.mozilla.org/Platform/GFX/Moz2D>`__.
+
+Compositing
+~~~~~~~~~~~
+
+Image Decoding
+~~~~~~~~~~~~~~
+
+Image Animation
+~~~~~~~~~~~~~~~
+
+`Historical Documents <http://www.youtube.com/watch?v=lLZQz26-kms>`__
+---------------------------------------------------------------------
+
+A number of posts and blogs that will give you more details or more
+background, or reasoning that led to different solutions and approaches.
+
+-  2010-01 `Layers: Cross Platform Acceleration <http://www.basschouten.com/blog1.php/layers-cross-platform-acceleration>`__
+-  2010-04 `Layers <http://robert.ocallahan.org/2010/04/layers_01.html>`__
+-  2010-07 `Retained Layers <http://robert.ocallahan.org/2010/07/retained-layers_16.html>`__
+-  2011-04 `Introduction <https://web.archive.org/web/20140604005804/https://blog.mozilla.org/joe/2011/04/26/introducing-the-azure-project/>`__
+-  2011-07 `Layers <http://chrislord.net/index.php/2011/07/25/shadow-layers-and-learning-by-failing/%20Shadow>`__
+-  2011-09 `Graphics API Design <http://robert.ocallahan.org/2011/09/graphics-api-design.html>`__
+-  2012-04 `Moz2D Canvas on OSX <http://muizelaar.blogspot.ca/2012/04/azure-canvas-on-os-x.html>`__
+-  2012-05 `Mask Layers <http://featherweightmusings.blogspot.co.uk/2012/05/mask-layers_26.html>`__
+-  2013-07 `Graphics related <http://www.basschouten.com/blog1.php>`__
diff --git a/gfx/docs/LayersHistory.rst b/gfx/docs/LayersHistory.rst
new file mode 100644
index 0000000000..360df9b37d
--- /dev/null
+++ b/gfx/docs/LayersHistory.rst
@@ -0,0 +1,63 @@
+Layers History
+==============
+
+This is an overview of the major events in the history of our Layers
+infrastructure.
+
+-  iPhone released in July 2007 (Built on a toolkit called LayerKit)
+
+-  Core Animation (October 2007) LayerKit was publicly renamed to OS X
+   10.5
+
+-  Webkit CSS 3d transforms (July 2009)
+
+-  Original layers API (March 2010) Introduced the idea of a layer
+   manager that would composite. One of the first use cases for this was
+   hardware accelerated YUV conversion for video.
+
+-  Retained layers (July 7 2010 - Bug 564991) This was an important
+   concept that introduced the idea of persisting the layer content
+   across paints in gecko controlled buffers instead of just by the OS.
+   This introduced the concept of buffer rotation to deal with scrolling
+   instead of using the native scrolling APIs like ScrollWindowEx
+
+-  Layers IPC (July 2010 - Bug 570294) This introduced shadow layers and
+   edit lists and was originally done for e10s v1
+
+-  3D transforms (September 2011 - Bug 505115)
+
+-  OMTC (December 2011 - Bug 711168) This was prototyped on OS X but
+   shipped first for Fennec
+
+-  Tiling v1 (April 2012 - Bug 739679) Originally done for Fennec. This
+   was done to avoid situations where we had to do a bunch of work for
+   scrolling a small amount. i.e. buffer rotation. It allowed us to have
+   a variety of interesting features like progressive painting and lower
+   resolution painting.
+
+-  C++ Async pan zoom controller (July 2012 - Bug 750974) The existing
+   APZ code was in Java for Fennec so this was reimplemented.
+
+-  Streaming WebGL Buffers (February 2013 - Bug 716859) Infrastructure
+   to allow OMTC WebGL and avoid the need to glFinish() every frame.
+
+-  Compositor API (April 2013 - Bug 825928) The planning for this
+   started around November 2012. Layers refactoring created a compositor
+   API that abstracted away the differences between the D3D vs OpenGL.
+   The main piece of API is DrawQuad.
+
+-  Tiling v2 (Mar 7 2014 - Bug 963073) Tiling for B2G. This work is
+   mainly porting tiled layers to new textures, implementing
+   double-buffered tiles and implementing a texture client pool, to be
+   used by tiled content clients.
+
+   A large motivation for the pool was the very slow performance of
+   allocating tiles because of the sync messages to the compositor.
+
+   The slow performance of allocating was directly addressed by bug 959089
+   which allowed us to allocate gralloc buffers without sync messages to
+   the compositor thread.
+
+-  B2G WebGL performance (May 2014 - Bug 1006957, 1001417, 1024144) This
+   work improved the synchronization mechanism between the compositor
+   and the producer.
diff --git a/gfx/docs/OffMainThreadPainting.rst b/gfx/docs/OffMainThreadPainting.rst
new file mode 100644
index 0000000000..c5a75f6025
--- /dev/null
+++ b/gfx/docs/OffMainThreadPainting.rst
@@ -0,0 +1,237 @@
+Off Main Thread Painting
+========================
+
+OMTP, or ‘off main thread painting’, is the component of Gecko that
+allows us to perform painting of web content off of the main thread.
+This gives us more time on the main thread for javascript, layout,
+display list building, and other tasks which allows us to increase our
+responsiveness.
+
+Take a look at this `blog
+post <https://mozillagfx.wordpress.com/2017/12/05/off-main-thread-painting/>`__
+for an introduction.
+
+Background
+----------
+
+Painting (or rasterization) is the last operation that happens in a
+layer transaction before we forward it to the compositor. At this point,
+all display items have been assigned to a layer and invalid regions have
+been calculated and assigned to each layer.
+
+The painted layer uses a content client to acquire a buffer for
+painting. The main purpose of the content client is to allow us to
+retain already painted content when we are scrolling a layer. We have
+two main strategies for this, rotated buffer and tiling.
+
+This is implemented with two class hierarchies. ``ContentClient`` for
+rotated buffer and ``TiledContentClient`` for tiling. Additionally we
+have two different painted layer implementations, ``ClientPaintedLayer``
+and ``ClientTiledPaintedLayer``.
+
+The main distinction between rotated buffer and tiling is the amount of
+graphics surfaces required. Rotated buffer utilizes just a single buffer
+for a frame but potentially requires painting into it multiple times.
+Tiling uses multiple buffers but doesn’t require painting into the
+buffers multiple times.
+
+Once the painted layer has a surface (or surfaces with tiling) to paint
+into, they are wrapped in a ``DrawTarget`` of some form and a callback
+to ``FrameLayerBuilder`` is called. This callback uses the assigned
+display items and invalid regions to trigger rasterization. Each
+``nsDisplayItem`` has their ``Paint`` method called with the provided
+``DrawTarget`` that represents the surface, and they paint into it.
+
+High level
+----------
+
+The key abstraction that allows us to paint off of the main thread is
+``DrawTargetCapture`` [1]_. ``DrawTargetCapture`` is a special
+``DrawTarget`` which records all draw commands for replaying to another
+draw target in the local process. This is similar to
+``DrawTargetRecording``, but only holds a reference to resources instead
+of copying them into the command stream. This allows the command stream
+to be much more lightweight than ``DrawTargetRecording``.
+
+OMTP works by instrumenting the content clients to use a capture target
+for all painting [2]_ [3]_ [4]_ [5]_. This capture draw target records all
+the operations that would normally be performed directly on the
+surface’s draw target. Once we have all of the commands, we send the
+capture and surface draw target to the ``PaintThread`` [6]_ where the
+commands are replayed onto the surface. Once the rasterization is done,
+we forward the layer transaction to the compositor.
+
+Tiling and parallel painting
+----------------------------
+
+We can make one additional improvement if we are using tiling as our
+content client backend.
+
+When we are tiling, the screen is subdivided into a grid of equally
+sized surfaces and draw commands are performed on the tiles they affect.
+Each tile is independent of the others, so we’re able to parallelize
+painting by using a worker thread pool and dispatching a task for each
+tile individually.
+
+This is commonly referred to as P-OMTP or parallel painting.
+
+Main thread rasterization
+-------------------------
+
+Even with OMTP it’s still possible for the main thread to perform
+rasterization. A common pattern for painting code is to create a
+temporary draw target, perform drawing with it, take a snapshot, and
+then draw the snapshot onto the main draw target. This is done for
+blurs, box shadows, text shadows, and with the basic layer manager
+fallback.
+
+If the temporary draw target is not a draw target capture, then this
+will perform rasterization on the main thread. This can be bad as it
+lowers our parallelism and can cause contention with content backends,
+like Direct2D, that use locking around shared resources.
+
+To work around this, we changed the main thread painting code to use a
+draw target capture for these operations and added a source surface
+capture [7]_ which only resolves the painting of the draw commands when
+needed on the paint thread.
+
+There are still possible cases we can perform main thread rasterization,
+but we try and address them when they come up.
+
+Out of memory issues
+--------------------
+
+The web is very complex, and so we can sometimes have a large amount of
+draw commands for a content paint. We’ve observed OOM errors for capture
+command lists that have grown to be 200MiB large.
+
+We initially tried to mitigate this by lowering the overhead of capture
+command lists. We do this by filtering commands that don’t actually
+change the draw target state and folding consecutive transform changes,
+but that was not always enough. So we added the ability for our draw
+target capture’s to flush their command lists to the surface draw target
+while we are capturing on the main thread [8]_.
+
+This is triggered by a configurable memory limit. Because this
+introduces a new source of main thread rasterization we try to balance
+setting this too low and suffering poor performance, or setting this too
+high and suffering crashes.
+
+Synchronization
+---------------
+
+OMTP is conceptually simple, but in practice it relies on subtle code to
+ensure thread safety. This was the most arguably the most difficult part
+of the project.
+
+There are roughly four areas that are critical.
+
+1. Compositor message ordering
+
+   Immediately after we queue the async paints to be asynchronously
+   completed, we have a problem. We need to forward the layer
+   transaction at some point, but the compositor cannot process the
+   transaction until all async paints have finished. If it did, it could
+   access unfinished painted content.
+
+   We obviously can’t block on the async paints completing as that would
+   beat the whole point of OMTP. We also can’t hold off on sending the
+   layer transaction to ``IPDL``, as we’d trigger race conditions for
+   messages sent after the layer transaction is built but before it is
+   forwarded. Reftest and other code assumes that messages sent after a
+   layer transaction to the compositor are processed after that layer
+   transaction is processed.
+
+   The solution is to forward the layer transaction to the compositor
+   over ``IPDL``, but flag the message channel to start postponing
+   messages [9]_. Then once all async paints have completed, we unflag
+   the message channel and all postponed messages are sent [10]_. This
+   allows us to keep our message ordering guarantees and not have to
+   worry about scheduling a runnable in the future.
+
+2. Texture clients
+
+   The backing store for content surfaces is managed by texture client.
+   While async paints are executing, it’s possible for shutdown or any
+   number of things to happen that could cause layer manager, all
+   layers, all content clients, and therefore all texture clients to be
+   destroyed. Therefore it’s important that we keep these texture
+   clients alive throughout async painting. Texture clients also manage
+   IPC resources and must be destroyed on the main thread, so we are
+   careful to do that [11]_.
+
+3. Double buffering
+
+   We currently double buffer our content painting - our content clients
+   only ever have zero or one texture that is available to be painted
+   into at any moment.
+
+   This implies that we cannot start async painting a layer tree while
+   previous async paints are still active as this would lead to awful
+   races. We also don’t support multiple nested sets of postponed IPC
+   messages to allow sending the first layer transaction to the
+   compositor, but not the second.
+
+   To prevent issues with this, we flush all active async paints before
+   we begin to paint a new layer transaction [12]_.
+
+   There was some initial debate about implementing triple buffering for
+   content painting, but we have not seen evidence it would help us
+   significantly.
+
+4. Moz2D thread safety
+
+   Finally, most Moz2D objects were not thread safe. We had to insert
+   special locking into draw target and source surface as they have a
+   special copy on write relationship that must be consistent even if
+   they are on different threads.
+
+   Some platform specific resources like fonts needed locking added in
+   order to be thread safe. We also did some work to make filter nodes
+   work with multiple threads executing them at the same time.
+
+Browser process
+---------------
+
+Currently only content processes are able to use OMTP.
+
+This restriction was added because of concern about message ordering
+between ``APZ`` and OMTP. It might be able to lifted in the future.
+
+Important bugs
+--------------
+
+1. `OMTP Meta <https://bugzilla.mozilla.org/show_bug.cgi?id=omtp>`__
+2. `Enable on
+   Windows <https://bugzilla.mozilla.org/show_bug.cgi?id=1403935>`__
+3. `Enable on
+   OSX <https://bugzilla.mozilla.org/show_bug.cgi?id=1422392>`__
+4. `Enable on
+   Linux <https://bugzilla.mozilla.org/show_bug.cgi?id=1432531>`__
+5. `Parallel
+   painting <https://bugzilla.mozilla.org/show_bug.cgi?id=1425056>`__
+
+Code links
+----------
+
+.. [1]  `DrawTargetCapture <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/2d/DrawTargetCapture.h#22>`__
+.. [2]  `Creating DrawTargetCapture for rotated
+    buffer <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/ContentClient.cpp#185>`__
+.. [3]  `Dispatch DrawTargetCapture for rotated
+    buffer <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/ClientPaintedLayer.cpp#99>`__
+.. [4]  `Creating DrawTargetCapture for
+    tiling <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/TiledContentClient.cpp#714>`__
+.. [5]  `Dispatch DrawTargetCapture for
+    tiling <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/MultiTiledContentClient.cpp#288>`__
+.. [6]  `PaintThread <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/PaintThread.h#53>`__
+.. [7]  `SourceSurfaceCapture <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/2d/SourceSurfaceCapture.h#19>`__
+.. [8] `Sync flushing draw
+    commands <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/2d/DrawTargetCapture.h#165>`__
+.. [9]  `Postponing messages for
+    PCompositorBridge <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/ipc/CompositorBridgeChild.cpp#1319>`__
+.. [10]  `Releasing messages for
+    PCompositorBridge <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/ipc/CompositorBridgeChild.cpp#1303>`__
+.. [11] `Releasing texture clients on main
+    thread <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/ipc/CompositorBridgeChild.cpp#1170>`__
+.. [12] `Flushing async
+    paints <https://searchfox.org/mozilla-central/rev/dd965445ec47fbf3cee566eff93b301666bda0e1/gfx/layers/client/ClientLayerManager.cpp#289>`__
diff --git a/gfx/docs/RenderingOverview.rst b/gfx/docs/RenderingOverview.rst
new file mode 100644
index 0000000000..50b146d9b9
--- /dev/null
+++ b/gfx/docs/RenderingOverview.rst
@@ -0,0 +1,384 @@
+Rendering Overview
+==================
+
+This document is an overview of the steps to render a webpage, and how HTML
+gets transformed and broken down, step by step, into commands that can execute
+on the GPU.
+
+If you're coming into the graphics team with not a lot of background
+in browsers, start here :)
+
+.. contents::
+
+High level overview
+-------------------
+
+.. image:: RenderingOverviewSimple.png
+   :width: 100%
+
+Layout
+~~~~~~
+Starting at the left in the above image, we have a document
+represented by a DOM - a Document Object Model.  A Javascript engine
+will execute JS code, either to make changes to the DOM, or to respond to
+events generated by the DOM (or do both).
+
+The DOM is a high level description and we don't know what to draw or
+where until it is combined with a Cascading Style Sheet (CSS).
+Combining these two and figuring out what, where and how to draw
+things is the responsibility of the Layout team.  The
+DOM is converted into a hierarchical Frame Tree, which nests visual
+elements (boxes).  Each element points to some node in a Style Tree
+that describes what it should look like -- color, transparency, etc.
+The result is that now we know exactly what to render where, what goes
+on top of what (layering and blending) and at what pixel coordinate.
+This is the Display List.
+
+The Display List is a light-weight data structure because it's shallow
+-- it mostly points back to the Frame Tree.  There are two problems
+with this.  First, we want to cross process boundaries at this point.
+Everything up until now happens in a Content Process (of which there are
+several).  Actual GPU rendering happens in a GPU Process (on some
+platforms).  Second, everything up until now was written in C++; but
+WebRender is written in Rust.  Thus the shallow Display List needs to
+be serialized in a completely self-contained binary blob that will
+survive Interprocess Communication (IPC) and a language switch (C++ to
+Rust).  The result is the WebRender Display List.
+
+WebRender
+~~~~~~~~~
+
+The GPU process receives the WebRender Display List blob and
+de-serializes it into a Scene.  This Scene contains more than the
+strictly visible elements; for example, to anticipate scrolling, we
+might have several paragraphs of text extending past the visible page.
+
+For a given viewport, the Scene gets culled and stripped down to a
+Frame.  This is also where we start preparing data structures for GPU
+rendering, for example getting some font glyphs into an atlas for
+rasterizing text.
+
+The final step takes the Frame and submits commands to the GPU to
+actually render it.  The GPU will execute the commands and composite
+the final page.
+
+Software
+~~~~~~~~
+
+The above is the new WebRender-enabled way to do things.  But in the
+schematic you'll note a second branch towards the bottom: this is the
+legacy code path which does not use WebRender (nor Rust).  In this
+case, the Display List is converted into a Layer Tree. The purpose of
+this Tree is to try and avoid having to re-render absolutely
+everything when the page needs to be refreshed. For example, when
+scrolling we should be able to redraw the page by mostly shifting
+things around. However that requires those 'things' to still be around
+from last time we drew the page.  In other words, visual elements that
+are likely to be static and reusable need to be drawn into their own
+private "page" (a cache).  Then we can recombine (composite) all of
+these when redrawing the actual page.
+
+Figuring out which elements would be good candidates for this, and
+striking a balance between good performance versus excessive memory
+use, is the purpose of the Layer Tree.  Each 'layer' is a cached image
+of some element(s).  This logic also takes occlusion into account, eg.
+don't allocate and render a layer for elements that are known to be
+completely obscured by something in front of them.
+
+Redrawing the page by combining the Layer Tree with any newly
+rasterized elements is the job of the Compositor.
+
+
+Even when a layer cannot be reused in its entirety, it is likely
+that only a small part of it was invalidated.  Thus there is an
+elaborate system for tracking dirty rectangles, starting an update by
+copying the area that can be salvaged, and then redrawing only what
+cannot.
+
+In fact, this idea can be extended to delta-tracking of display lists
+themselves. Traversing the layout tree and building a display list is
+also not cheap, so the code tries to partially invalidate and rebuild
+the display list incrementally when possible.
+This optimization is used both for non-WebRender and WebRender in
+fact.
+
+
+Asynchronous Panning And Zooming
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Earlier we mentioned that a Scene might contain more elements than are
+strictly necessary for rendering what's visible (the Frame).  The
+reason for that is Asynchronous Panning and Zooming, or APZ for short.
+The browser will feel much more responsive if scrolling & zooming can
+short-circuit all of these data transformations and IPC boundaries,
+and instead directly update an offset of some layer and recomposite.
+(Think of late-latching in a VR context)
+
+This simple idea introduces a lot of complexity: how much extra do you
+rasterize, and in which direction?  How much memory can we afford?
+What about Javascript that responds to scroll events and perhaps does
+something 'interesting' with the page in return?  What about nested
+frames or nested scrollbars?  What if we scroll so much that we go
+past the boundaries of the Scene that we know about?
+
+See AsyncPanZoom.rst for all that and more.
+
+A Few More Details
+~~~~~~~~~~~~~~~~~~
+
+Here's another schematic which basically repeats the previous one, but
+showing a little bit more detail.  Note that the direction is reversed
+-- the data flow starts at the right.  Sorry about that :)
+
+.. image:: RenderingOverviewDetail.png
+   :width: 100%
+
+Some things to note:
+
+- there are multiple content processes, currently 4 of them.  This is
+  for security reasons (sandboxing), stability (isolate crashes) and
+  performance (multi-core machines);
+- ideally each "webpage" would run in its own process for security;
+  this is being developed under the term 'fission';
+- there is only a single GPU process, if there is one at all;
+  some platforms have it as part of the Parent;
+- not shown here is the Extension process that isolates WebExtensions;
+- for non-WebRender, rasterization happens in the Content Process, and
+  we send entire Layers to the GPU/Compositor process (via shared
+  memory, only using actual IPC for its metadata like width & height);
+- if the GPU process crashes (a bug or a driver issue) we can simply
+  restart it, resend the display list, and the browser itself doesn't crash;
+- the browser UI is just another set of DOM+JS, albeit one that runs
+  with elevated privileges. That is, its JS can do things that
+  normal JS cannot.  It lives in the Parent Process, which then uses
+  IPC to get it rendered, same as regular Content. (the IPC arrow also
+  goes to WebRender Display List but is omitted to reduce clutter);
+- UI events get routed to APZ first, to minimize latency. By running
+  inside the GPU process, we may have access to data such
+  as rasterized clipping masks that enables finer grained hit testing;
+- the GPU process talks back to the content process; in particular,
+  when APZ scrolls out of bounds, it asks Content to enlarge/shift the
+  Scene with a new "display port";
+- we still use the GPU when we can for compositing even in the
+  non-WebRender case;
+
+
+WebRender In Detail
+-------------------
+
+Converting a display list into GPU commands is broken down into a
+number of steps and intermediate data structures.
+
+
+.. image:: RenderingOverviewTrees.png
+   :width: 75%
+   :align: center
+
+..
+
+    *Each element in the picture tree points to exactly one node in the spatial
+    tree. Only a few of these links are shown for clarity (the dashed lines).*
+
+The Picture Tree
+~~~~~~~~~~~~~~~~
+
+The incoming display list uses "stacking contexts".  For example, to
+render some text with a drop shadow, a display list will contain three
+items:
+
+- "enable shadow" with some parameters such as shadow color, blur size, and offset;
+- the text item;
+- "pop all shadows" to deactivate shadows;
+
+WebRender will break this down into two distinct elements, or
+"pictures".  The first represents the shadow, so it contains a copy of the
+text item, but modified to use the shadow's color, and to shift the
+text by the shadow's offset.  The second picture contains the original text
+to draw on top of the shadow.
+
+The fact that the first picture, the shadow, needs to be blurred, is a
+"compositing" property of the picture which we'll deal with later.
+
+Thus, the stack-based display list gets converted into a list of pictures
+-- or more generally, a hierarchy of pictures, since items are nested
+as per the original HTML.
+
+Example visual elements are a TextRun, a LineDecoration, or an Image
+(like a .png file).
+
+Compared to 3D rendering, the picture tree is similar to a scenegraph: it's a
+parent/child hierarchy of all the drawable elements that make up the "scene", in
+this case the webpage.  One important difference is that the transformations are
+stored in a separate tree, the spatial tree.
+
+The Spatial Tree
+~~~~~~~~~~~~~~~~
+
+The nodes in the spatial tree represent coordinate transforms.  Every time the
+DOM hierarchy needs child elements to be transformed relative to their parent,
+we add a new Spatial Node to the tree. All those child elements will then point
+to this node as their "local space" reference (aka coordinate frame).  In
+traditional 3D terms, it's a scenegraph but only containing transform nodes.
+
+The nodes are called frames, as in "coordinate frame":
+
+- a Reference Frame corresponds to a ``<div>``;
+- a Scrolling Frame corresponds to a scrollable part of the page;
+- a Sticky Frame corresponds to some fixed position CSS style.
+
+Each element in the picture tree then points to a spatial node inside this tree,
+so by walking up and down the tree we can find the absolute position of where
+each element should render (traversing down) and how large each element needs to
+be (traversing up).  Originally the transform information was part of the
+picture tree, as in a traditional scenegraph, but visual elements and their
+transforms were split apart for technical reasons.
+
+Some of these nodes are dynamic.  A scroll-frame can obviously scroll, but a
+Reference Frame might also use a property binding to enable a live link with
+JavaScript, for dynamic updates of (currently) the transform and opacity.
+
+Axis-aligned transformations (scales and translations) are considered "simple",
+and are conceptually combined into a single "CoordinateSystem".  When we
+encounter a non-axis-aligned transform, we start a new CoordinateSystem.  We
+start in CoordinateSystem 0 at the root, and would bump this to CoordinateSystem
+1 when we encounter a Reference Frame with a rotation or 3D transform, for
+example.  This would then be the CoordinateSystem index for all its children,
+until we run into another (nested) non-simple transform, and so on.  Roughly
+speaking, as long as we're in the same CoordinateSystem, the transform stack is
+simple enough that we have a reasonable chance of being able to flatten it. That
+lets us directly rasterize text at its final scale for example, optimizing
+away some of the intermediate pictures (offscreen textures).
+
+The layout code positions elements relative to their parent.  Thus to position
+the element on the actual page, we need to walk the Spatial Tree all the way to
+the root and apply each transform; the result is a ``LayoutToWorldTransform``.
+
+One final step transforms from World to Device coordinates, which deals with
+DPI scaling and such.
+
+.. csv-table::
+    :header: "WebRender term", "Rough analogy"
+
+      Spatial Tree, Scenegraph -- transforms only
+      Picture Tree, Scenegraph -- drawables only (grouping)
+      Spatial Tree Rootnode, World Space
+      Layout space, Local/Object Space
+      Picture, RenderTarget (sort of; see RenderTask below)
+      Layout-To-World transform, Local-To-World transform
+      World-To-Device transform, World-To-Clipspace transform
+
+
+The Clip Tree
+~~~~~~~~~~~~~
+
+Finally, we also have a Clip Tree, which contains Clip Shapes. For
+example, a rounded corner div will produce a clip shape, and since
+divs can be nested, you end up with another tree.  By pointing at a Clip Shape,
+visual elements will be clipped against this shape plus all parent shapes above it
+in the Clip Tree.
+
+As with CoordinateSystems, a chain of simple 2D clip shapes can be collapsed
+into something that can be handled in the vertex shader, at very little extra
+cost.  More complex clips must be rasterized into a mask first, which we then
+sample from to ``discard`` in the pixel shader as needed.
+
+In summary, at the end of scene building the display list turned into
+a picture tree, plus a spatial tree that tells us what goes where
+relative to what, plus a clip tree.
+
+RenderTask Tree
+~~~~~~~~~~~~~~~
+
+Now in a perfect world we could simply traverse the picture tree and start
+drawing things: one drawcall per picture to render its contents, plus one
+drawcall to draw the picture into its parent.  However, recall that the first
+picture in our example is a "text shadow" that needs to be blurred.  We can't
+just rasterize blurry text directly, so we need a number of steps or "render
+passes" to get the intended effect:
+
+.. image:: RenderingOverviewBlurTask.png
+   :align: right
+   :height: 400px
+
+- rasterize the text into an offscreen rendertarget;
+- apply one or more downscaling passes until the blur radius is reasonable;
+- apply a horizontal Gaussian blur;
+- apply a vertical Gaussian blur;
+- use the result as an input for whatever comes next, or blit it to
+  its final position on the page (or more generally, on the containing
+  parent surface/picture).
+
+In the general case, which passes we need and how many of them depends
+on how the picture is supposed to be composited (CSS filters, SVG
+filters, effects) and its parameters (very large vs. small blur
+radius, say).
+
+Thus, we walk the picture tree and build a render task tree: each high
+level abstraction like "blur me" gets broken down into the necessary
+render passes to get the effect.  The result is again a tree because a
+render pass can have multiple input dependencies (eg. blending).
+
+(Cfr. games, this has echoes of the Frostbite Framegraph in that it
+dynamically builds up a renderpass DAG and dynamically allocates storage
+for the outputs).
+
+If there are complicated clip shapes that need to be rasterized first,
+so their output can be sampled as a texture for clip/discard
+operations, that would also end up in this tree as a dependency... (I think?).
+
+Once we have the entire tree of dependencies, we analyze it to see
+which tasks can be combined into a single pass for efficiency.  We
+ping-pong rendertargets when we can, but sometimes the dependencies
+cut across more than one level of the rendertask tree, and some
+copying is necessary.
+
+Once we've figured out the passes and allocated storage for anything
+we wish to persist in the texture cache, we finally start rendering.
+
+When rasterizing the elements into the Picture's offscreen texture, we'd
+position them by walking the transform hierarchy as far up as the picture's
+transform node, resulting in a ``Layout To Picture`` transform.  The picture
+would then go onto the page using a ``Picture To World`` coordinate transform.
+
+Caching
+```````
+
+Just as with layers in the software rasterizer, it is not always necessary to
+redraw absolutely everything when parts of a document change.  The webrender
+equivalent of layers is Slices -- a grouping of pictures that are expected to
+render and update together.  Slices are automatically created based on
+heuristics and layout hints/flags.
+
+Implementation wise, slices re-use a lot of the existing machinery for Pictures;
+in fact they're implemented as a "Virtual picture" of sorts.  The similarities
+make sense: both need to allocate offscreen textures in a cache, both will
+position and render all their children into it, and both then draw themselves
+into their parent as part of the parent's draw.
+
+If a slice isn't expected to change much, we give it a TileCacheInstance. It is
+itself made up of Tiles, where each tile will track what's in it, what's
+changing, and if it needs to be invalidated and redrawn or not as a result.
+Thus the "damage" from changes can be localized to single tiles, while we
+salvage the rest of the cache.  If tiles keep seeing a lot of invalidations,
+they will recursively divide themselves in a quad-tree like structure to try and
+localize the invalidations.  (And conversely, they'll recombine children if
+nothing is invalidating them "for a while").
+
+Interning
+`````````
+
+To spot invalidated tiles, we need a fast way to compare its contents from the
+previous frame with the current frame.  To speed this up, we use interning;
+similar to string-interning, this means that each ``TextRun``, ``Decoration``,
+``Image`` and so on is registered in a repository (a ``DataStore``) and
+consequently referred to by its unique ID. Cache contents can then be encoded as a
+list of IDs (one such list per internable element type).  Diffing is then just a
+fast list comparison.
+
+
+Callbacks
+`````````
+GPU text rendering assumes that the individual font-glyphs are already
+available in a texture atlas.  Likewise SVG is not being rendered on
+the GPU.  Both inputs are prepared during scene building; glyph
+rasterization via a thread pool from within Rust itself, and SVG via
+opaque callbacks (back to C++) that produce blobs.
diff --git a/gfx/docs/RenderingOverviewBlurTask.png b/gfx/docs/RenderingOverviewBlurTask.png
new file mode 100644
index 0000000000..baffc08f32
--- /dev/null
+++ b/gfx/docs/RenderingOverviewBlurTask.png
diff --git a/gfx/docs/RenderingOverviewDetail.png b/gfx/docs/RenderingOverviewDetail.png
new file mode 100644
index 0000000000..2909a811e4
--- /dev/null
+++ b/gfx/docs/RenderingOverviewDetail.png
diff --git a/gfx/docs/RenderingOverviewSimple.png b/gfx/docs/RenderingOverviewSimple.png
new file mode 100644
index 0000000000..43c0a59439
--- /dev/null
+++ b/gfx/docs/RenderingOverviewSimple.png
diff --git a/gfx/docs/RenderingOverviewTrees.png b/gfx/docs/RenderingOverviewTrees.png
new file mode 100644
index 0000000000..ffdf0812fa
--- /dev/null
+++ b/gfx/docs/RenderingOverviewTrees.png
diff --git a/gfx/docs/Silk.rst b/gfx/docs/Silk.rst
new file mode 100644
index 0000000000..45ec627a1e
--- /dev/null
+++ b/gfx/docs/Silk.rst
@@ -0,0 +1,472 @@
+Silk Overview
+==========================
+
+.. image:: SilkArchitecture.png
+
+Architecture
+------------
+
+Our current architecture is to align three components to hardware vsync
+timers:
+
+1. Compositor
+2. RefreshDriver / Painting
+3. Input Events
+
+The flow of our rendering engine is as follows:
+
+1. Hardware Vsync event occurs on an OS specific *Hardware Vsync Thread*
+   on a per monitor basis.
+2. The *Hardware Vsync Thread* attached to the monitor notifies the
+   ``CompositorVsyncDispatchers`` and ``RefreshTimerVsyncDispatcher``.
+3. For every Firefox window on the specific monitor, notify a
+   ``CompositorVsyncDispatcher``. The ``CompositorVsyncDispatcher`` is
+   specific to one window.
+4. The ``CompositorVsyncDispatcher`` notifies a
+   ``CompositorWidgetVsyncObserver`` when remote compositing, or a
+   ``CompositorVsyncScheduler::Observer`` when compositing in-process.
+5. If remote compositing, a vsync notification is sent from the
+   ``CompositorWidgetVsyncObserver`` to the ``VsyncBridgeChild`` on the
+   UI process, which sends an IPDL message to the ``VsyncBridgeParent``
+   on the compositor thread of the GPU process, which then dispatches to
+   ``CompositorVsyncScheduler::Observer``.
+6. The ``RefreshTimerVsyncDispatcher`` notifies the Chrome
+   ``RefreshTimer`` that a vsync has occurred.
+7. The ``RefreshTimerVsyncDispatcher`` sends IPC messages to all content
+   processes to tick their respective active ``RefreshTimer``.
+8. The ``Compositor`` dispatches input events on the *Compositor
+   Thread*, then composites. Input events are only dispatched on the
+   *Compositor Thread* on b2g.
+9. The ``RefreshDriver`` paints on the *Main Thread*.
+
+Hardware Vsync
+--------------
+
+Hardware vsync events from (1), occur on a specific ``Display`` Object.
+The ``Display`` object is responsible for enabling / disabling vsync on
+a per connected display basis. For example, if two monitors are
+connected, two ``Display`` objects will be created, each listening to
+vsync events for their respective displays. We require one ``Display``
+object per monitor as each monitor may have different vsync rates. As a
+fallback solution, we have one global ``Display`` object that can
+synchronize across all connected displays. The global ``Display`` is
+useful if a window is positioned halfway between the two monitors. Each
+platform will have to implement a specific ``Display`` object to hook
+and listen to vsync events. As of this writing, both Firefox OS and OS X
+create their own hardware specific *Hardware Vsync Thread* that executes
+after a vsync has occurred. OS X creates one *Hardware Vsync Thread* per
+``CVDisplayLinkRef``. We do not currently support multiple displays, so
+we use one global ``CVDisplayLinkRef`` that works across all active
+displays. On Windows, we have to create a new platform ``thread`` that
+waits for DwmFlush(), which works across all active displays. Once the
+thread wakes up from DwmFlush(), the actual vsync timestamp is retrieved
+from DwmGetCompositionTimingInfo(), which is the timestamp that is
+actually passed into the compositor and refresh driver.
+
+When a vsync occurs on a ``Display``, the *Hardware Vsync Thread*
+callback fetches all ``CompositorVsyncDispatchers`` associated with the
+``Display``. Each ``CompositorVsyncDispatcher`` is notified that a vsync
+has occurred with the vsync’s timestamp. It is the responsibility of the
+``CompositorVsyncDispatcher`` to notify the ``Compositor`` that is
+awaiting vsync notifications. The ``Display`` will then notify the
+associated ``RefreshTimerVsyncDispatcher``, which should notify all
+active ``RefreshDrivers`` to tick.
+
+All ``Display`` objects are encapsulated in a ``VsyncSource`` object.
+The ``VsyncSource`` object lives in ``gfxPlatform`` and is instantiated
+only on the parent process when ``gfxPlatform`` is created. The
+``VsyncSource`` is destroyed when ``gfxPlatform`` is destroyed. It can
+also be destroyed when the layout frame rate pref (or other prefs that
+influence frame rate) are changed. This may mean we switch from hardware
+to software vsync (or vice versa) at runtime. During the switch, there
+may briefly be 2 vsync sources. Otherwise, there is only one
+``VsyncSource`` object throughout the entire lifetime of Firefox. Each
+platform is expected to implement their own ``VsyncSource`` to manage
+vsync events. On OS X, this is through ``CVDisplayLinkRef``. On
+Windows, it should be through ``DwmGetCompositionTimingInfo``.
+
+Compositor
+----------
+
+When the ``CompositorVsyncDispatcher`` is notified of the vsync event,
+the ``CompositorVsyncScheduler::Observer`` associated with the
+``CompositorVsyncDispatcher`` begins execution. Since the
+``CompositorVsyncDispatcher`` executes on the *Hardware Vsync Thread*
+and the ``Compositor`` composites on the ``CompositorThread``, the
+``CompositorVsyncScheduler::Observer`` posts a task to the
+``CompositorThread``. The ``CompositorBridgeParent`` then composites.
+The model where the ``CompositorVsyncDispatcher`` notifies components on
+the *Hardware Vsync Thread*, and the component schedules the task on the
+appropriate thread is used everywhere.
+
+The ``CompositorVsyncScheduler::Observer`` listens to vsync events as
+needed and stops listening to vsync when composites are no longer
+scheduled or required. Every ``CompositorBridgeParent`` is associated
+and tied to one ``CompositorVsyncScheduler::Observer``, which is
+associated with the ``CompositorVsyncDispatcher``. Each
+``CompositorBridgeParent`` is associated with one widget and is created
+when a new platform window or ``nsBaseWidget`` is created. The
+``CompositorBridgeParent``, ``CompositorVsyncDispatcher``,
+``CompositorVsyncScheduler::Observer``, and ``nsBaseWidget`` all have
+the same lifetimes, which are created and destroyed together.
+
+Out-of-process Compositors
+--------------------------
+
+When compositing out-of-process, this model changes slightly. In this
+case there are effectively two observers: a UI process observer
+(``CompositorWidgetVsyncObserver``), and the
+``CompositorVsyncScheduler::Observer`` in the GPU process. There are
+also two dispatchers: the widget dispatcher in the UI process
+(``CompositorVsyncDispatcher``), and the IPDL-based dispatcher in the
+GPU process (``CompositorBridgeParent::NotifyVsync``). The UI process
+observer and the GPU process dispatcher are linked via an IPDL protocol
+called PVsyncBridge. ``PVsyncBridge`` is a top-level protocol for
+sending vsync notifications to the compositor thread in the GPU process.
+The compositor controls vsync observation through a separate actor,
+``PCompositorWidget``, which (as a subactor for
+``CompositorBridgeChild``) links the compositor thread in the GPU
+process to the main thread in the UI process.
+
+Out-of-process compositors do not go through
+``CompositorVsyncDispatcher`` directly. Instead, the
+``CompositorWidgetDelegate`` in the UI process creates one, and gives it
+a ``CompositorWidgetVsyncObserver``. This observer forwards
+notifications to a Vsync I/O thread, where ``VsyncBridgeChild`` then
+forwards the notification again to the compositor thread in the GPU
+process. The notification is received by a ``VsyncBridgeParent``. The
+GPU process uses the layers ID in the notification to find the correct
+compositor to dispatch the notification to.
+
+CompositorVsyncDispatcher
+-------------------------
+
+The ``CompositorVsyncDispatcher`` executes on the *Hardware Vsync
+Thread*. It contains references to the ``nsBaseWidget`` it is associated
+with and has a lifetime equal to the ``nsBaseWidget``. The
+``CompositorVsyncDispatcher`` is responsible for notifying the
+``CompositorBridgeParent`` that a vsync event has occurred. There can be
+multiple ``CompositorVsyncDispatchers`` per ``Display``, one
+``CompositorVsyncDispatcher`` per window. The only responsibility of the
+``CompositorVsyncDispatcher`` is to notify components when a vsync event
+has occurred, and to stop listening to vsync when no components require
+vsync events. We require one ``CompositorVsyncDispatcher`` per window so
+that we can handle multiple ``Displays``. When compositing in-process,
+the ``CompositorVsyncDispatcher`` is attached to the CompositorWidget
+for the window. When out-of-process, it is attached to the
+CompositorWidgetDelegate, which forwards observer notifications over
+IPDL. In the latter case, its lifetime is tied to a CompositorSession
+rather than the nsIWidget.
+
+Multiple Displays
+-----------------
+
+The ``VsyncSource`` has an API to switch a ``CompositorVsyncDispatcher``
+from one ``Display`` to another ``Display``. For example, when one
+window either goes into full screen mode or moves from one connected
+monitor to another. When one window moves to another monitor, we expect
+a platform specific notification to occur. The detection of when a
+window enters full screen mode or moves is not covered by Silk itself,
+but the framework is built to support this use case. The expected flow
+is that the OS notification occurs on ``nsIWidget``, which retrieves the
+associated ``CompositorVsyncDispatcher``. The
+``CompositorVsyncDispatcher`` then notifies the ``VsyncSource`` to
+switch to the correct ``Display`` the ``CompositorVsyncDispatcher`` is
+connected to. Because the notification works through the ``nsIWidget``,
+the actual switching of the ``CompositorVsyncDispatcher`` to the correct
+``Display`` should occur on the *Main Thread*. The current
+implementation of Silk does not handle this case and needs to be built
+out.
+
+CompositorVsyncScheduler::Observer
+----------------------------------
+
+The ``CompositorVsyncScheduler::Observer`` handles the vsync
+notifications and interactions with the ``CompositorVsyncDispatcher``.
+When the ``Compositor`` requires a scheduled composite, it notifies the
+``CompositorVsyncScheduler::Observer`` that it needs to listen to vsync.
+The ``CompositorVsyncScheduler::Observer`` then observes / unobserves
+vsync as needed from the ``CompositorVsyncDispatcher`` to enable
+composites.
+
+GeckoTouchDispatcher
+--------------------
+
+The ``GeckoTouchDispatcher`` is a singleton that resamples touch events
+to smooth out jank while tracking a user’s finger. Because input and
+composite are linked together, the
+``CompositorVsyncScheduler::Observer`` has a reference to the
+``GeckoTouchDispatcher`` and vice versa.
+
+Input Events
+------------
+
+One large goal of Silk is to align touch events with vsync events. On
+Firefox OS, touchscreens often have different touch scan rates than the
+display refreshes. A Flame device has a touch refresh rate of 75 HZ,
+while a Nexus 4 has a touch refresh rate of 100 HZ, while the device’s
+display refresh rate is 60HZ. When a vsync event occurs, we resample
+touch events, and then dispatch the resampled touch event to APZ. Touch
+events on Firefox OS occur on a *Touch Input Thread* whereas they are
+processed by APZ on the *APZ Controller Thread*. We use `Google
+Android’s touch
+resampling <https://web.archive.org/web/20200909082458/http://www.masonchang.com/blog/2014/8/25/androids-touch-resampling-algorithm>`__
+algorithm to resample touch events.
+
+Currently, we have a strict ordering between Composites and touch
+events. When a touch event occurs on the *Touch Input Thread*, we store
+the touch event in a queue. When a vsync event occurs, the
+``CompositorVsyncDispatcher`` notifies the ``Compositor`` of a vsync
+event, which notifies the ``GeckoTouchDispatcher``. The
+``GeckoTouchDispatcher`` processes the touch event first on the *APZ
+Controller Thread*, which is the same as the *Compositor Thread* on b2g,
+then the ``Compositor`` finishes compositing. We require this strict
+ordering because if a vsync notification is dispatched to both the
+``Compositor`` and ``GeckoTouchDispatcher`` at the same time, a race
+condition occurs between processing the touch event and therefore
+position versus compositing. In practice, this creates very janky
+scrolling. As of this writing, we have not analyzed input events on
+desktop platforms.
+
+One slight quirk is that input events can start a composite, for example
+during a scroll and after the ``Compositor`` is no longer listening to
+vsync events. In these cases, we notify the ``Compositor`` to observe
+vsync so that it dispatches touch events. If touch events were not
+dispatched, and since the ``Compositor`` is not listening to vsync
+events, the touch events would never be dispatched. The
+``GeckoTouchDispatcher`` handles this case by always forcing the
+``Compositor`` to listen to vsync events while touch events are
+occurring.
+
+Widget, Compositor, CompositorVsyncDispatcher, GeckoTouchDispatcher Shutdown Procedure
+--------------------------------------------------------------------------------------
+
+When the `nsBaseWidget shuts
+down <https://hg.mozilla.org/mozilla-central/file/0df249a0e4d3/widget/nsBaseWidget.cpp#l182>`__
+- It calls nsBaseWidget::DestroyCompositor on the *Gecko Main Thread*.
+During nsBaseWidget::DestroyCompositor, it first destroys the
+CompositorBridgeChild. CompositorBridgeChild sends a sync IPC call to
+CompositorBridgeParent::RecvStop, which calls
+`CompositorBridgeParent::Destroy <https://hg.mozilla.org/mozilla-central/file/ab0490972e1e/gfx/layers/ipc/CompositorParent.cpp#l509>`__.
+During this time, the *main thread* is blocked on the parent process.
+CompositorBridgeParent::RecvStop runs on the *Compositor thread* and
+cleans up some resources, including setting the
+``CompositorVsyncScheduler::Observer`` to nullptr.
+CompositorBridgeParent::RecvStop also explicitly keeps the
+CompositorBridgeParent alive and posts another task to run
+CompositorBridgeParent::DeferredDestroy on the Compositor loop so that
+all ipdl code can finish executing. The
+``CompositorVsyncScheduler::Observer`` also unobserves from vsync and
+cancels any pending composite tasks. Once
+CompositorBridgeParent::RecvStop finishes, the *main thread* in the
+parent process continues shutting down the nsBaseWidget.
+
+At the same time, the *Compositor thread* is executing tasks until
+CompositorBridgeParent::DeferredDestroy runs, which flushes the
+compositor message loop. Now we have two tasks as both the nsBaseWidget
+releases a reference to the Compositor on the *main thread* during
+destruction and the CompositorBridgeParent::DeferredDestroy releases a
+reference to the CompositorBridgeParent on the *Compositor Thread*.
+Finally, the CompositorBridgeParent itself is destroyed on the *main
+thread* once both references are gone due to explicit `main thread
+destruction <https://hg.mozilla.org/mozilla-central/file/50b95032152c/gfx/layers/ipc/CompositorParent.h#l148>`__.
+
+With the ``CompositorVsyncScheduler::Observer``, any accesses to the
+widget after nsBaseWidget::DestroyCompositor executes are invalid. Any
+accesses to the compositor between the time the
+nsBaseWidget::DestroyCompositor runs and the
+CompositorVsyncScheduler::Observer’s destructor runs aren’t safe yet a
+hardware vsync event could occur between these times. Since any tasks
+posted on the Compositor loop after
+CompositorBridgeParent::DeferredDestroy is posted are invalid, we make
+sure that no vsync tasks can be posted once
+CompositorBridgeParent::RecvStop executes and DeferredDestroy is posted
+on the Compositor thread. When the sync call to
+CompositorBridgeParent::RecvStop executes, we explicitly set the
+CompositorVsyncScheduler::Observer to null to prevent vsync
+notifications from occurring. If vsync notifications were allowed to
+occur, since the ``CompositorVsyncScheduler::Observer``\ ’s vsync
+notification executes on the *hardware vsync thread*, it would post a
+task to the Compositor loop and may execute after
+CompositorBridgeParent::DeferredDestroy. Thus, we explicitly shut down
+vsync events in the ``CompositorVsyncDispatcher`` and
+``CompositorVsyncScheduler::Observer`` during nsBaseWidget::Shutdown to
+prevent any vsync tasks from executing after
+CompositorBridgeParent::DeferredDestroy.
+
+The ``CompositorVsyncDispatcher`` may be destroyed on either the *main
+thread* or *Compositor Thread*, since both the nsBaseWidget and
+``CompositorVsyncScheduler::Observer`` race to destroy on different
+threads. nsBaseWidget is destroyed on the *main thread* and releases a
+reference to the ``CompositorVsyncDispatcher`` during destruction. The
+``CompositorVsyncScheduler::Observer`` has a race to be destroyed either
+during CompositorBridgeParent shutdown or from the
+``GeckoTouchDispatcher`` which is destroyed on the main thread with
+`ClearOnShutdown <https://hg.mozilla.org/mozilla-central/file/21567e9a6e40/xpcom/base/ClearOnShutdown.h#l15>`__.
+Whichever object, the CompositorBridgeParent or the
+``GeckoTouchDispatcher`` is destroyed last will hold the last reference
+to the ``CompositorVsyncDispatcher``, which destroys the object.
+
+Refresh Driver
+--------------
+
+The Refresh Driver is ticked from a `single active
+timer <https://hg.mozilla.org/mozilla-central/file/ab0490972e1e/layout/base/nsRefreshDriver.cpp#l11>`__.
+The assumption is that there are multiple ``RefreshDrivers`` connected
+to a single ``RefreshTimer``. There are two ``RefreshTimers``: an active
+and an inactive ``RefreshTimer``. Each Tab has its own
+``RefreshDriver``, which connects to one of the global
+``RefreshTimers``. The ``RefreshTimers`` execute on the *Main Thread*
+and tick their connected ``RefreshDrivers``. We do not want to break
+this model of multiple ``RefreshDrivers`` per a set of two global
+``RefreshTimers``. Each ``RefreshDriver`` switches between the active
+and inactive ``RefreshTimer``.
+
+Instead, we create a new ``RefreshTimer``, the ``VsyncRefreshTimer``
+which ticks based on vsync messages. We replace the current active timer
+with a ``VsyncRefreshTimer``. All tabs will then tick based on this new
+active timer. Since the ``RefreshTimer`` has a lifetime of the process,
+we only need to create a single ``RefreshTimerVsyncDispatcher`` per
+``Display`` when Firefox starts. Even if we do not have any content
+processes, the Chrome process will still need a ``VsyncRefreshTimer``,
+thus we can associate the ``RefreshTimerVsyncDispatcher`` with each
+``Display``.
+
+When Firefox starts, we initially create a new ``VsyncRefreshTimer`` in
+the Chrome process. The ``VsyncRefreshTimer`` will listen to vsync
+notifications from ``RefreshTimerVsyncDispatcher`` on the global
+``Display``. When nsRefreshDriver::Shutdown executes, it will delete the
+``VsyncRefreshTimer``. This creates a problem as all the
+``RefreshTimers`` are currently manually memory managed whereas
+``VsyncObservers`` are ref counted. To work around this problem, we
+create a new ``RefreshDriverVsyncObserver`` as an inner class to
+``VsyncRefreshTimer``, which actually receives vsync notifications. It
+then ticks the ``RefreshDrivers`` inside ``VsyncRefreshTimer``.
+
+With Content processes, the start up process is more complicated. We
+send vsync IPC messages via the use of the PBackground thread on the
+parent process, which allows us to send messages from the Parent
+process’ without waiting on the *main thread*. This sends messages from
+the Parent::\ *PBackground Thread* to the Child::\ *Main Thread*. The
+*main thread* receiving IPC messages on the content process is
+acceptable because ``RefreshDrivers`` must execute on the *main thread*.
+However, there is some amount of time required to setup the IPC
+connection upon process creation and during this time, the
+``RefreshDrivers`` must tick to set up the process. To get around this,
+we initially use software ``RefreshTimers`` that already exist during
+content process startup and swap in the ``VsyncRefreshTimer`` once the
+IPC connection is created.
+
+During nsRefreshDriver::ChooseTimer, we create an async PBackground IPC
+open request to create a ``VsyncParent`` and ``VsyncChild``. At the same
+time, we create a software ``RefreshTimer`` and tick the
+``RefreshDrivers`` as normal. Once the PBackground callback is executed
+and an IPC connection exists, we swap all ``RefreshDrivers`` currently
+associated with the active ``RefreshTimer`` and swap the
+``RefreshDrivers`` to use the ``VsyncRefreshTimer``. Since all
+interactions on the content process occur on the main thread, there are
+no need for locks. The ``VsyncParent`` listens to vsync events through
+the ``VsyncRefreshTimerDispatcher`` on the parent side and sends vsync
+IPC messages to the ``VsyncChild``. The ``VsyncChild`` notifies the
+``VsyncRefreshTimer`` on the content process.
+
+During the shutdown process of the content process, ActorDestroy is
+called on the ``VsyncChild`` and ``VsyncParent`` due to the normal
+PBackground shutdown process. Once ActorDestroy is called, no IPC
+messages should be sent across the channel. After ActorDestroy is
+called, the IPDL machinery will delete the **VsyncParent/Child** pair.
+The ``VsyncParent``, due to being a ``VsyncObserver``, is ref counted.
+After ``VsyncParent::ActorDestroy`` is called, it unregisters itself
+from the ``RefreshTimerVsyncDispatcher``, which holds the last reference
+to the ``VsyncParent``, and the object will be deleted.
+
+Thus the overall flow during normal execution is:
+
+1. VsyncSource::Display::RefreshTimerVsyncDispatcher receives a Vsync
+   notification from the OS in the parent process.
+2. RefreshTimerVsyncDispatcher notifies
+   VsyncRefreshTimer::RefreshDriverVsyncObserver that a vsync occurred on
+   the parent process on the hardware vsync thread.
+3. RefreshTimerVsyncDispatcher notifies the VsyncParent on the hardware
+   vsync thread that a vsync occurred.
+4. The VsyncRefreshTimer::RefreshDriverVsyncObserver in the parent
+   process posts a task to the main thread that ticks the refresh
+   drivers.
+5. VsyncParent posts a task to the PBackground thread to send a vsync
+   IPC message to VsyncChild.
+6. VsyncChild receive a vsync notification on the content process on the
+   main thread and ticks their respective RefreshDrivers.
+
+Compressing Vsync Messages
+--------------------------
+
+Vsync messages occur quite often and the *main thread* can be busy for
+long periods of time due to JavaScript. Consistently sending vsync
+messages to the refresh driver timer can flood the *main thread* with
+refresh driver ticks, causing even more delays. To avoid this problem,
+we compress vsync messages on both the parent and child processes.
+
+On the parent process, newer vsync messages update a vsync timestamp but
+do not actually queue any tasks on the *main thread*. Once the parent
+process’ *main thread* executes the refresh driver tick, it uses the
+most updated vsync timestamp to tick the refresh driver. After the
+refresh driver has ticked, one single vsync message is queued for
+another refresh driver tick task. On the content process, the IPDL
+``compress`` keyword automatically compresses IPC messages.
+
+Multiple Monitors
+-----------------
+
+In order to have multiple monitor support for the ``RefreshDrivers``, we
+have multiple active ``RefreshTimers``. Each ``RefreshTimer`` is
+associated with a specific ``Display`` via an id and tick when it’s
+respective ``Display`` vsync occurs. We have **N RefreshTimers**, where
+N is the number of connected displays. Each ``RefreshTimer`` still has
+multiple ``RefreshDrivers``.
+
+When a tab or window changes monitors, the ``nsIWidget`` receives a
+display changed notification. Based on which display the window is on,
+the window switches to the correct ``RefreshTimerVsyncDispatcher`` and
+``CompositorVsyncDispatcher`` on the parent process based on the display
+id. Each ``TabParent`` should also send a notification to their child.
+Each ``TabChild``, given the display ID, switches to the correct
+``RefreshTimer`` associated with the display ID. When each display vsync
+occurs, it sends one IPC message to notify vsync. The vsync message
+contains a display ID, to tick the appropriate ``RefreshTimer`` on the
+content process. There is still only one **VsyncParent/VsyncChild**
+pair, just each vsync notification will include a display ID, which maps
+to the correct ``RefreshTimer``.
+
+Object Lifetime
+---------------
+
+1. CompositorVsyncDispatcher - Lives as long as the nsBaseWidget
+   associated with the VsyncDispatcher
+2. CompositorVsyncScheduler::Observer - Lives and dies the same time as
+   the CompositorBridgeParent.
+3. RefreshTimerVsyncDispatcher - As long as the associated display
+   object, which is the lifetime of Firefox.
+4. VsyncSource - Lives as long as the gfxPlatform on the chrome process,
+   which is the lifetime of Firefox.
+5. VsyncParent/VsyncChild - Lives as long as the content process
+6. RefreshTimer - Lives as long as the process
+
+Threads
+-------
+
+All ``VsyncObservers`` are notified on the *Hardware Vsync Thread*. It
+is the responsibility of the ``VsyncObservers`` to post tasks to their
+respective correct thread. For example, the
+``CompositorVsyncScheduler::Observer`` will be notified on the *Hardware
+Vsync Thread*, and post a task to the *Compositor Thread* to do the
+actual composition.
+
+1. Compositor Thread - Nothing changes
+2. Main Thread - PVsyncChild receives IPC messages on the main thread.
+   We also enable/disable vsync on the main thread.
+3. PBackground Thread - Creates a connection from the PBackground thread
+   on the parent process to the main thread in the content process.
+4. Hardware Vsync Thread - Every platform is different, but we always
+   have the concept of a hardware vsync thread. Sometimes this is
+   actually created by the host OS. On Windows, we have to create a
+   separate platform thread that blocks on DwmFlush().
diff --git a/gfx/docs/SilkArchitecture.png b/gfx/docs/SilkArchitecture.png
new file mode 100644
index 0000000000..938c585e40
--- /dev/null
+++ b/gfx/docs/SilkArchitecture.png
diff --git a/gfx/docs/index.rst b/gfx/docs/index.rst
new file mode 100644
index 0000000000..223ae0f02a
--- /dev/null
+++ b/gfx/docs/index.rst
@@ -0,0 +1,18 @@
+Graphics
+========
+
+This collection of linked pages contains design documents for the
+Mozilla graphics architecture. The design documents live in gfx/docs directory.
+
+This `wiki page <https://wiki.mozilla.org/Platform/GFX>`__ contains
+information about graphics and the graphics team at Mozilla.
+
+.. toctree::
+   :maxdepth: 1
+
+   GraphicsOverview
+   LayersHistory
+   OffMainThreadPainting
+   AsyncPanZoom
+   AdvancedLayers
+   Silk