From 36d22d82aa202bb199967e9512281e9a53db42c9 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 21:33:14 +0200 Subject: Adding upstream version 115.7.0esr. Signed-off-by: Daniel Baumann --- docs/performance/Benchmarking.md | 98 ++++ docs/performance/GPU_performance.md | 42 ++ docs/performance/activity_monitor_and_top.md | 165 ++++++ ...automated_performance_testing_and_sheriffing.md | 24 + docs/performance/bestpractices.md | 578 +++++++++++++++++++++ docs/performance/build_metrics/build_metrics.md | 31 ++ docs/performance/dtrace.md | 49 ++ docs/performance/img/ActMon-Energy.png | Bin 0 -> 148867 bytes docs/performance/img/EJCrt4N.png | Bin 0 -> 331906 bytes docs/performance/img/PerfDotHTMLRedLines.png | Bin 0 -> 8383 bytes docs/performance/img/annotation.png | Bin 0 -> 150000 bytes docs/performance/img/battery-status-menu.png | Bin 0 -> 26645 bytes docs/performance/img/dominators-1.png | Bin 0 -> 28248 bytes docs/performance/img/dominators-10.png | Bin 0 -> 22011 bytes docs/performance/img/dominators-2.png | Bin 0 -> 29454 bytes docs/performance/img/dominators-3.png | Bin 0 -> 33064 bytes docs/performance/img/dominators-4.png | Bin 0 -> 40001 bytes docs/performance/img/dominators-5.png | Bin 0 -> 31844 bytes docs/performance/img/dominators-6.png | Bin 0 -> 56667 bytes docs/performance/img/dominators-7.png | Bin 0 -> 26682 bytes docs/performance/img/dominators-8.png | Bin 0 -> 26678 bytes docs/performance/img/dominators-9.png | Bin 0 -> 45509 bytes docs/performance/img/memory-1-small.png | Bin 0 -> 12408 bytes docs/performance/img/memory-2-small.png | Bin 0 -> 20870 bytes docs/performance/img/memory-3-small.png | Bin 0 -> 20962 bytes docs/performance/img/memory-4-small.png | Bin 0 -> 20940 bytes docs/performance/img/memory-5-small.png | Bin 0 -> 21140 bytes docs/performance/img/memory-6-small.png | Bin 0 -> 20922 bytes docs/performance/img/memory-7-small.png | Bin 0 -> 20555 bytes .../memory-graph-dominator-multiple-references.svg | 4 + docs/performance/img/memory-graph-dominators.svg | 4 + .../img/memory-graph-immediate-dominator.svg | 4 + docs/performance/img/memory-graph-unreachable.svg | 4 + docs/performance/img/memory-graph.svg | 4 + .../performance/img/memory-tool-aggregate-view.png | Bin 0 -> 55175 bytes .../img/memory-tool-call-stack-expanded.png | Bin 0 -> 69021 bytes docs/performance/img/memory-tool-call-stack.png | Bin 0 -> 37960 bytes docs/performance/img/memory-tool-in-group-icon.png | Bin 0 -> 6376 bytes .../img/memory-tool-in-group-retaining-paths.png | Bin 0 -> 47488 bytes docs/performance/img/memory-tool-in-group.png | Bin 0 -> 32277 bytes .../img/memory-tool-inverted-call-stack.png | Bin 0 -> 42582 bytes docs/performance/img/memory-tool-switch-view.png | Bin 0 -> 26038 bytes docs/performance/img/monsters.svg | 4 + docs/performance/img/pid.png | Bin 0 -> 43742 bytes docs/performance/img/power-planes.jpg | Bin 0 -> 85483 bytes docs/performance/img/rendering.png | Bin 0 -> 103379 bytes docs/performance/img/reportingperf1.png | Bin 0 -> 10237 bytes docs/performance/img/reportingperf2.png | Bin 0 -> 27651 bytes docs/performance/img/reportingperf3.png | Bin 0 -> 20182 bytes docs/performance/img/treemap-bbc.png | Bin 0 -> 48965 bytes docs/performance/img/treemap-domnodes.png | Bin 0 -> 10998 bytes docs/performance/img/treemap-monsters.png | Bin 0 -> 20713 bytes docs/performance/index.md | 53 ++ docs/performance/intel_power_gadget.md | 56 ++ docs/performance/jit_profiling_with_perf.md | 119 +++++ docs/performance/memory/DOM_allocation_example.md | 57 ++ docs/performance/memory/about_colon_memory.md | 274 ++++++++++ docs/performance/memory/aggregate_view.md | 198 +++++++ docs/performance/memory/awsy.md | 22 + docs/performance/memory/basic_operations.md | 82 +++ docs/performance/memory/bloatview.md | 245 +++++++++ docs/performance/memory/dmd.md | 489 +++++++++++++++++ docs/performance/memory/dominators.md | 90 ++++ docs/performance/memory/dominators_view.md | 221 ++++++++ docs/performance/memory/gc_and_cc_logs.md | 109 ++++ docs/performance/memory/heap_scan_mode.md | 313 +++++++++++ docs/performance/memory/leak_gauge.md | 45 ++ .../memory/leak_hunting_strategies_and_tips.md | 219 ++++++++ docs/performance/memory/memory.md | 64 +++ docs/performance/memory/monster_example.md | 79 +++ .../memory/refcount_tracing_and_balancing.md | 235 +++++++++ docs/performance/memory/tree_map_view.md | 62 +++ docs/performance/perf.md | 57 ++ docs/performance/perfstats.md | 30 ++ .../platform_microbenchmarks.md | 21 + docs/performance/power_profiling_overview.md | 326 ++++++++++++ docs/performance/powermetrics.md | 167 ++++++ .../profiling_with_concurrency_visualizer.md | 5 + docs/performance/profiling_with_instruments.md | 110 ++++ docs/performance/profiling_with_xperf.md | 180 +++++++ docs/performance/profiling_with_zoom.md | 5 + .../performance/reporting_a_performance_problem.md | 94 ++++ docs/performance/scroll-linked_effects.md | 177 +++++++ docs/performance/sorting_algorithms_comparison.md | 52 ++ docs/performance/timerfirings_logging.md | 136 +++++ docs/performance/tools_power_rapl.md | 113 ++++ docs/performance/turbostat.md | 50 ++ 87 files changed, 5566 insertions(+) create mode 100644 docs/performance/Benchmarking.md create mode 100644 docs/performance/GPU_performance.md create mode 100644 docs/performance/activity_monitor_and_top.md create mode 100644 docs/performance/automated_performance_testing_and_sheriffing.md create mode 100644 docs/performance/bestpractices.md create mode 100644 docs/performance/build_metrics/build_metrics.md create mode 100644 docs/performance/dtrace.md create mode 100644 docs/performance/img/ActMon-Energy.png create mode 100644 docs/performance/img/EJCrt4N.png create mode 100644 docs/performance/img/PerfDotHTMLRedLines.png create mode 100644 docs/performance/img/annotation.png create mode 100644 docs/performance/img/battery-status-menu.png create mode 100644 docs/performance/img/dominators-1.png create mode 100644 docs/performance/img/dominators-10.png create mode 100644 docs/performance/img/dominators-2.png create mode 100644 docs/performance/img/dominators-3.png create mode 100644 docs/performance/img/dominators-4.png create mode 100644 docs/performance/img/dominators-5.png create mode 100644 docs/performance/img/dominators-6.png create mode 100644 docs/performance/img/dominators-7.png create mode 100644 docs/performance/img/dominators-8.png create mode 100644 docs/performance/img/dominators-9.png create mode 100644 docs/performance/img/memory-1-small.png create mode 100644 docs/performance/img/memory-2-small.png create mode 100644 docs/performance/img/memory-3-small.png create mode 100644 docs/performance/img/memory-4-small.png create mode 100644 docs/performance/img/memory-5-small.png create mode 100644 docs/performance/img/memory-6-small.png create mode 100644 docs/performance/img/memory-7-small.png create mode 100644 docs/performance/img/memory-graph-dominator-multiple-references.svg create mode 100644 docs/performance/img/memory-graph-dominators.svg create mode 100644 docs/performance/img/memory-graph-immediate-dominator.svg create mode 100644 docs/performance/img/memory-graph-unreachable.svg create mode 100644 docs/performance/img/memory-graph.svg create mode 100644 docs/performance/img/memory-tool-aggregate-view.png create mode 100644 docs/performance/img/memory-tool-call-stack-expanded.png create mode 100644 docs/performance/img/memory-tool-call-stack.png create mode 100644 docs/performance/img/memory-tool-in-group-icon.png create mode 100644 docs/performance/img/memory-tool-in-group-retaining-paths.png create mode 100644 docs/performance/img/memory-tool-in-group.png create mode 100644 docs/performance/img/memory-tool-inverted-call-stack.png create mode 100644 docs/performance/img/memory-tool-switch-view.png create mode 100644 docs/performance/img/monsters.svg create mode 100644 docs/performance/img/pid.png create mode 100644 docs/performance/img/power-planes.jpg create mode 100644 docs/performance/img/rendering.png create mode 100644 docs/performance/img/reportingperf1.png create mode 100644 docs/performance/img/reportingperf2.png create mode 100644 docs/performance/img/reportingperf3.png create mode 100644 docs/performance/img/treemap-bbc.png create mode 100644 docs/performance/img/treemap-domnodes.png create mode 100644 docs/performance/img/treemap-monsters.png create mode 100644 docs/performance/index.md create mode 100644 docs/performance/intel_power_gadget.md create mode 100644 docs/performance/jit_profiling_with_perf.md create mode 100644 docs/performance/memory/DOM_allocation_example.md create mode 100644 docs/performance/memory/about_colon_memory.md create mode 100644 docs/performance/memory/aggregate_view.md create mode 100644 docs/performance/memory/awsy.md create mode 100644 docs/performance/memory/basic_operations.md create mode 100644 docs/performance/memory/bloatview.md create mode 100644 docs/performance/memory/dmd.md create mode 100644 docs/performance/memory/dominators.md create mode 100644 docs/performance/memory/dominators_view.md create mode 100644 docs/performance/memory/gc_and_cc_logs.md create mode 100644 docs/performance/memory/heap_scan_mode.md create mode 100644 docs/performance/memory/leak_gauge.md create mode 100644 docs/performance/memory/leak_hunting_strategies_and_tips.md create mode 100644 docs/performance/memory/memory.md create mode 100644 docs/performance/memory/monster_example.md create mode 100644 docs/performance/memory/refcount_tracing_and_balancing.md create mode 100644 docs/performance/memory/tree_map_view.md create mode 100644 docs/performance/perf.md create mode 100644 docs/performance/perfstats.md create mode 100644 docs/performance/platform_microbenchmarks/platform_microbenchmarks.md create mode 100644 docs/performance/power_profiling_overview.md create mode 100644 docs/performance/powermetrics.md create mode 100644 docs/performance/profiling_with_concurrency_visualizer.md create mode 100644 docs/performance/profiling_with_instruments.md create mode 100644 docs/performance/profiling_with_xperf.md create mode 100644 docs/performance/profiling_with_zoom.md create mode 100644 docs/performance/reporting_a_performance_problem.md create mode 100644 docs/performance/scroll-linked_effects.md create mode 100644 docs/performance/sorting_algorithms_comparison.md create mode 100644 docs/performance/timerfirings_logging.md create mode 100644 docs/performance/tools_power_rapl.md create mode 100644 docs/performance/turbostat.md (limited to 'docs/performance') diff --git a/docs/performance/Benchmarking.md b/docs/performance/Benchmarking.md new file mode 100644 index 0000000000..3b429463f7 --- /dev/null +++ b/docs/performance/Benchmarking.md @@ -0,0 +1,98 @@ +# Benchmarking + +## Debug Builds + +Debug builds (\--enable-debug) and non-optimized builds +(\--disable-optimize) are *much* slower. Any performance metrics +gathered by such builds are largely unrelated to what would be found in +a release browser. + +## Rust optimization level + +Local optimized builds are [compiled with rust optimization level 1 by +default](https://groups.google.com/forum/#!topic/mozilla.dev.platform/pN9O5EB_1q4), +unlike Nightly builds, which use rust optimization level 2. This setting +reduces build times significantly but comes with a serious hit to +runtime performance for any rust code ([for example stylo and +webrender](https://groups.google.com/d/msg/mozilla.dev.platform/pN9O5EB_1q4/ooXNuqMECAAJ)). +Add the following to your [mozconfig] in order to build with level 2: + +``` +ac_add_options RUSTC_OPT_LEVEL=2 +``` + +## Profile Guided Optimization (PGO) +[Profile Guided +Optimization](/build/buildsystem/pgo.rst#profile-guided-optimization) is +disabled by default and may improve runtime by up to 20%. However, it takes a +long time to build. To enable, add the following to your [mozconfig]: +``` +ac_add_options MOZ_PGO=1 +``` + +## GC Poisoning + +Many Firefox builds have a diagnostic tool that causes crashes to happen +sooner and produce much more actionable information, but also slow down +regular usage substantially. In particular, \"GC poisoning\" is used in +all debug builds, and in optimized Nightly builds (but not opt Developer +Edition or Beta builds). The poisoning can be disabled by setting the +environment variable + +``` + JSGC_DISABLE_POISONING=1 +``` + +before starting the browser. + +## Async Stacks + +Async stacks no longer impact performance since **Firefox 78**, as +{{bug(1601179)}} limits async stack capturing to when DevTools is +opened. + +Another option that is on by default in non-release builds is the +preference javascript.options.asyncstack, which provides better +debugging information to developers. Set it to false to match a release +build. (This may be disabled for many situations in the future. See +{{bug(1280819)}}. + +## Accelerated Graphics + +Especially on Linux, accelerated graphics can sometimes lead to severe +performance problems even if things look ok visually. Normally you would +want to leave acceleration enabled while profiling, but on Linux you may +wish to disable accelerated graphics (Preferences -\> Advanced -\> +General -\> Use hardware acceleration when available). + +## Flash Plugin + +If you are profiling real websites, you should disable the Adobe Flash +plugin so you are testing Firefox code and not Flash jank problems. In +about:addons \> Plugins, set Shockwave Flash to \"Never Activate\". + +## Timer Precision + +Firefox reduces the precision of the Performance APIs and other clock +and timer APIs accessible to Web Content. They are currently reduce to a +multiple of 2ms; which is controlled by the privacy.reduceTimerPrecision +about:config flag. + +The exact value of the precision is controlled by the +privacy.resistFingerprinting.reduceTimerPrecision.microseconds +about:config flag. + +## Profiling tools + +Currently the Gecko Profiler has limitations in the UI for inverted call +stack top function analysis which is very useful for finding heavy +functions that call into a whole bunch of code. Currently such functions +may be easy to miss looking at a profile, so feel free to *also* use +your favorite native profiler. It also lacks features such as +instruction level profiling which can be helpful in low level profiling, +or finding the hot loop inside a large function, etc. Some example tools +include Instruments on OSX (part of XCode), [RotateRight +Zoom](http://www.rotateright.com/) on Linux (uses perf underneath), and +Intel VTune on Windows or Linux. + +[mozconfig]: /setup/configuring_build_options.rst#using-a-mozconfig-configuration-file diff --git a/docs/performance/GPU_performance.md b/docs/performance/GPU_performance.md new file mode 100644 index 0000000000..25085f41c3 --- /dev/null +++ b/docs/performance/GPU_performance.md @@ -0,0 +1,42 @@ +# GPU Performance + +Doing performance work with GPUs is harder than with CPUs because of the +asynchronous and massively parallel architecture. + +## Tools + +[PIX](https://devblogs.microsoft.com/pix/introduction/) - Can do +timing of Direct3D calls. Works reasonably well with Firefox. + +NVIDIA PerfHUD - Last I checked required a special build to be used. + +NVIDIA Parallel Nsight - Haven\'t tried. + +AMD GPU ShaderAnalyzer - Will compile a shader and show the machine code +and give static pipeline estimations. Not that useful for Firefox +because all of our shaders are pretty simple. + +AMD GPU PerfStudio - I had trouble getting this to work, and can\'t +remember whether I actually did or not. + +[Intel Graphics Performance Analyzers](http://software.intel.com/en-us/articles/intel-gpa/ "http://software.intel.com/en-us/articles/intel-gpa/") +- Haven\'t tried. + +[APITrace](https://github.com/apitrace/apitrace "https://github.com/apitrace/apitrace") +- Open source, works OK. + +[PVRTrace](http://www.imgtec.com/powervr/insider/pvrtrace.asp "http://www.imgtec.com/powervr/insider/pvrtrace.asp") +- Doesn\'t seem to emit traces on android/Nexus S. Looks like it\'s +designed for X11-based linux-ARM devices, OMAP3 is mentioned a lot in +the docs \... + +## Guides + +[Accurately Profiling Direct3D API Calls (Direct3D +9)](http://msdn.microsoft.com/en-us/library/bb172234%28v=vs.85%29.aspx "http://msdn.microsoft.com/en-us/library/bb172234(v=vs.85).aspx") +Suggests avoiding normal profilers like xperf and instead measuring the +time to flush the command buffer. + +[OS X - Best Practices for Working with Texture +Data](http://developer.apple.com/library/mac/#documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_texturedata/opengl_texturedata.html "http://developer.apple.com/library/mac/#documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_texturedata/opengl_texturedata.html") +- Sort of old, but still useful. diff --git a/docs/performance/activity_monitor_and_top.md b/docs/performance/activity_monitor_and_top.md new file mode 100644 index 0000000000..4e687c4dfe --- /dev/null +++ b/docs/performance/activity_monitor_and_top.md @@ -0,0 +1,165 @@ +# Activity Monitor, Battery Status Menu and top + +This article describes the Activity Monitor, Battery Status Menu, and +`top` --- three related tools available on Mac OS X. + +**Note**: The [power profiling overview](power_profiling_overview.md) is +worth reading at this point if you haven't already. It may make parts +of this document easier to understand. + +## Activity Monitor + +This is a [built-in OS X tool](https://support.apple.com/en-au/HT201464) +that shows real-time process measurements. It is well-known and its +"Energy Impact" measure is likely to be consulted by users to compare +the power consumption of different programs. ([Apple support +documentation](https://support.apple.com/en-au/HT202776) specifically +recommends it for troubleshooting battery life problems.) +***Unfortunately "Energy Impact" is not a good measure for either +users or software developers and it should be avoided.*** Activity +Monitor can still be useful, however. + +### Power-related measurements + +Activity Monitor has several tabs. They can all be customized to show +any of the available measurements (by right-clicking on the column +strip) but only the "Energy" tab groups child processes with parent +processes, which is useful, so it's the best one to use. The following +screenshot shows a customized "Energy" tab. + +![](img/ActMon-Energy.png) + +The power-related columns are as follows. + +- **Energy Impact** / **Avg Energy Impact**: See the separate section + below. +- **% CPU**: CPU usage percentage. This can exceed 100% if multiple + cores are being used. +- **Idle wake Ups**: + - In Mac OS 10.9 this measured "package idle exit" wakeups. This + is the same value as + [powermetrics](./powermetrics.md)' + "Pkg idle" measurement (i.e. + `task_power_info::task_platform_idle_wakeups` obtained from the + `task_info` function.) + - In Mac OS 10.10 it appears to have been changed to measure + interrupt-level wakeups (a superset of idle wakeups), which are + less interesting. This is the same value as + [powermetrics](./powermetrics.md)' + "Intr" measurement (i.e. + `task_power_info::task_interrupt_wakeups` obtained from the + `task_info` function.) +- **Requires High Perf GPU**: Many Macs have two GPUs: a low-power, + low-performance integrated GPU, and a high-power, high-performance + external GPU. Using the high-performance GPU can greatly increase + power consumption, and should be avoided whenever possible. This + column indicates which GPU is being used. + +Activity Monitor can be useful for cursory measurements, but for more +precise and detailed measurements other tools such as +[powermetrics](./powermetrics.md) are better. + +### What does "Energy Impact" measure? + +"Energy Impact" is a hybrid proxy measure of power consumption. +[Careful +investigation](https://blog.mozilla.org/nnethercote/2015/08/26/what-does-the-os-x-activity-monitors-energy-impact-actually-measure/) +indicates that on Mac OS 10.10 and 10.11 it is computed with a formula +that is machine model-specific, and includes the following factors: CPU +usage, wakeup frequency, [quality of service +class](https://developer.apple.com/library/prerelease/mac/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/PrioritizeWorkAtTheTaskLevel.html) +usage, and disk, GPU, and network activity. The weightings of each +factor can be found in one of the the files in +`/usr/share/pmenergy/Mac-.plist`, where `` can be determined +with the following command. + + ioreg -l | grep board-id + +In contrast, on Mac OS 10.9 it is computed via a simpler machine +model-independent formula that only factors in CPU usage and wakeup +frequency. + +In both cases "Energy Impact" often correlates poorly with actual +power consumption and should be avoided in favour of direct measurements +that have clear physical meanings. + +### What does "Average Energy Impact" measure? + +When the Energy tab of Activity Monitor is first opened, the "Average +Energy Impact" column is empty and the title bar says "Activity +Monitor (Processing\...)". After 5--10 seconds, the "Average Energy +Impact" column is populated with values and the title bar changes to +"Activity Monitor (Applications in last 8 hours)". If you have `top` +open during those 5--10 seconds you'll see that `systemstats` is +running and using a lot of CPU, and so presumably the measurements are +obtained from it. + +`systemstats` is a program that runs continuously and periodically +measures, among other things, CPU usage and idle wakeups for each +running process. Tests indicate that it is almost certainly using the +same "Energy Impact" formula to compute the "Average Energy Impact", +using measurements from the past 8 hours of wake time (i.e. if a laptop +is closed for several hours and then reopened, those hours are not +included in the calculation.) + +## Battery status menu + +When you click on the battery icon in the OS X menu bar you get a +drop-down menu that includes a list of "Apps Using Significant Energy". +This is crude but prominent, and therefore worth understanding --- even +though it's not much use for profiling applications. + +![Screenshot of the OS X battery statusmenu](img/battery-status-menu.png) + +When you open this menu for the first time in a while it says +"Collecting Power Usage Information" for a few seconds, and if you have +`top` open during that time you'll see that, once again, `systemstats` +is running and using a lot of CPU. Furthermore, if you click on an +application name in the menu Activity Monitor will be opened and that +application's entry will be highlighted. Based on these facts it seems +reasonable to assume that "Energy Impact" is again being used to +determine which applications are "using significant energy". + +Testing shows that once an energy-intensive application is started it +takes less than a minute for it to show up in the battery status menu. +And once the application stops using high amounts of energy it takes a +few minutes to disappear. The exception is when the application is +closed, in which case it disappears immediately. And it appears that a +program with an "Energy Impact" of roughly 20 or more will eventually +show up as significant, and programs that have much higher "Energy +Impact" values tend to show up more quickly. + +All of these battery status menu observations are difficult to make +reliably and so should be treated with caution. It is clear, however, +that the window used by the battery status menu is measured in seconds +or minutes, which is much less than the 8 hour window used for "Average +Energy Impact" in Activity Monitor. + +## `top` + +`top` is similar to Activity Monitor, but is a command-line utility. To +see power-related measurements, invoke it as follows. + +``` +top -stats pid,command,cpu,idlew,power -o power -d +``` + +**Note**: `-a` and `-e` can be used instead of `-d` to get different +counting modes. See the man page for details. + +It will show periodically-updating data like the following. + + PID COMMAND %CPU IDLEW POWER + 50300 firefox 12.9 278 26.6 + 76256 plugin-container 3.4 159 11.3 + 151 coreaudiod 0.9 68 4.3 + 76505 top 1.5 1 1.6 + 76354 Activity Monitor 1.0 0 1.0 + +- The **PID**, **COMMAND** and **%CPU** columns are self-explanatory. +- The **IDLEW** column is the number of "package idle exit" wakeups. +- The **POWER** column's value is computed by the same formula as the + one used for "Energy Impact" by Activity Monitor in Mac OS 10.9, + and should also be avoided. + +`top` is unlikely to be much use for power profiling. diff --git a/docs/performance/automated_performance_testing_and_sheriffing.md b/docs/performance/automated_performance_testing_and_sheriffing.md new file mode 100644 index 0000000000..02469c65de --- /dev/null +++ b/docs/performance/automated_performance_testing_and_sheriffing.md @@ -0,0 +1,24 @@ +# Automated performance testing and sheriffing + +We have several test harnesses that test Firefox for various performance +characteristics (page load time, startup time, etc.). We also generate +some metrics as part of the build process (like installer size) that are +interesting to track over time. Currently we aggregate this information +in the [Perfherder web +application](https://wiki.mozilla.org/Auto-tools/Projects/Perfherder) +where performance sheriffs watch for significant regressions, filing +bugs as appropriate. + +Current list of automated systems we are tracking (at least to some +degree): + +- [Talos](https://wiki.mozilla.org/TestEngineering/Performance/Talos): The main + performance system, run on virtually every check-in to an + integration branch +- [build_metrics](/setup/configuring_build_options.rst): + A grab bag of performance metrics generated by the build system +- [AreWeFastYet](https://arewefastyet.com/): A generic JavaScript and + Web benchmarking system + tool +- [Platform microbenchmarks](platform_microbenchmarks/platform_microbenchmarks.md) +- [Build Metrics](build_metrics/build_metrics.md) diff --git a/docs/performance/bestpractices.md b/docs/performance/bestpractices.md new file mode 100644 index 0000000000..ee176de294 --- /dev/null +++ b/docs/performance/bestpractices.md @@ -0,0 +1,578 @@ +# Performance best practices for Firefox front-end engineers + +This guide will help Firefox developers working on front-end code +produce code which is as performant as possible—not just on its own, but +in terms of its impact on other parts of Firefox. Always keep in mind +the side effects your changes may have, from blocking other tasks, to +interfering with other user interface elements. + +## Avoid the main thread where possible + +The main thread is where we process user events and do painting. It's +also important to note that most of our JavaScript runs on the main +thread, so it's easy for script to cause delays in event processing or +painting. That means that the more code we can get off of the main +thread, the more that thread can respond to user events, paint, and +generally be responsive to the user. + +You might want to consider using a Worker if you need to do some +computation that can be done off of the main thread. If you need more +elevated privileges than a standard worker allows, consider using a +ChromeWorker, which is a Firefox-only API which lets you create +workers with more elevated privileges. + +## Use requestIdleCallback() + +If you simply cannot avoid doing some kind of long job on the main +thread, try to break it up into smaller pieces that you can run when the +browser has a free moment to spare, and the user isn't doing anything. +You can do that using **requestIdleCallback()** and the [Cooperative +Scheduling of Background Tasks API](https://developer.mozilla.org/en-US/docs/Web/API/Background_Tasks_API), +and doing it only when we have a free second where presumably the user +isn’t doing something. + +See also the blog post [Collective scheduling with requestIdleCallback](https://hacks.mozilla.org/2016/11/cooperative-scheduling-with-requestidlecallback/). + +As of [bug 1353206](https://bugzilla.mozilla.org/show_bug.cgi?id=1353206), +you can also schedule idle events in non-DOM contexts by using +**Services.tm.idleDispatchToMainThread**. See the +**nsIThreadManager.idl** file for more details. + +## Hide your panels + +If you’re adding a new XUL *\* or *\* to a +document, set the **hidden** attribute to **true** by default. By doing +so, you cause the binding applied on demand rather than at load time, +which makes initial construction of the XUL document faster. + +## Get familiar with the pipeline that gets pixels to the screen + +Learn how pixels you draw make their way to the screen. Knowing the path +they will take through the various layers of the browser engine will +help you optimize your code to avoid pitfalls. + +The rendering process goes through the following steps: + +![This is the pipeline that a browser uses to get pixels to the screen](img/rendering.png) + +The above image is used under [Creative Commons Attribution 3.0](https://creativecommons.org/licenses/by/3.0/), +courtesy of [this page](https://developers.google.com/web/fundamentals/performance/rendering/avoid-large-complex-layouts-and-layout-thrashing) +from our friends at Google, which itself is well worth the read. + +For a very down-to-earth explanation of the Style, Layout, Paint and +Composite steps of the pipeline, [this Hacks blog post](https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-engine-quantum-css-aka-stylo/) +does a great job of explaining it. + +To achieve a 60 FPS frame rate, all of the above has to happen in 16 +milliseconds or less, every frame. + +Note that **requestAnimationFrame()** lets you queue up JavaScript to +**run right before the style flush occurs**. This allows you to put all +of your DOM writes (most importantly, anything that could change the +size or position of things in the DOM) just before the style and layout +steps of the pipeline, combining all the style and layout calculations +into a single batch so it all happens once, in a single frame tick, +instead of across multiple frames. + +See [Detecting and avoiding synchronous reflow](#detecting-and-avoiding-synchronous-reflow) +below for more information. + +This also means that *requestAnimationFrame()* is **not a good place** +to put queries for layout or style information. + +## Detecting and avoiding synchronous style flushes + +### What are style flushes? +When CSS is applied to a document (HTML or XUL, it doesn’t matter), the +browser does calculations to figure out which CSS styles will apply to +each element. This happens the first time the page loads and the CSS is +initially applied, but can happen again if JavaScript modifies the DOM. + +JavaScript code might, for example, change DOM node attributes (either +directly or by adding or removing classes from elements), and can also +add, remove, or delete DOM nodes. Because styles are normally scoped to +the entire document, the cost of doing these style calculations is +proportional to the number of DOM nodes in the document (and the number +of styles being applied). + +It is expected that over time, script will update the DOM, requiring us +to recalculate styles. Normally, the changes to the DOM just result in +the standard style calculation occurring immediately after the +JavaScript has finished running during the 16ms window, inside the +"Style" step. That's the ideal scenario. + +However, it's possible for script to do things that force multiple style +calculations (or **style flushes**) to occur synchronously during the +JavaScript part of the 16 ms window. The more of them there are, the +more likely they'll exceed the 16ms frame budget. If that happens, some +of them will be postponed until the next frame (or possibly multiple +frames, if necessary), this skipping of frames is called **jank**. + +Generally speaking, you force a synchronous style flush any time you +query for style information after the DOM has changed within the same +frame tick. Depending on whether or not [the style information you’re +asking for has something to do with size or position](https://gist.github.com/paulirish/5d52fb081b3570c81e3a) +you may also cause a layout recalculation (also referred to as *layout +flush* or *reflow*), which is also an expensive step see [Detecting +and avoiding synchronous reflow](#detecting-and-avoiding-synchronous-reflow) below. + +To avoid this: avoid reading style information if you can. If you *must* +read style information, do so at the very beginning of the frame, before +any changes have been made to the DOM since the last time a style flush +occurred. + +Historically, there hasn't been an easy way of doing this - however, +[bug 1434376](https://bugzilla.mozilla.org/show_bug.cgi?id=1434376) +has landed some ChromeOnly helpers to the window binding to +make this simpler. + +If you want to queue up some JavaScript to run after the next *natural* +style and layout flush, try: + + + // Suppose we want to get the computed "display" style of some node without + // causing a style flush. We could do it this way: + async function nodeIsDisplayNone(node) { + let display = await window.promiseDocumentFlushed(() => { + // Do _not_ under any circumstances write to the DOM in one of these + // callbacks! + return window.getComputedStyle(node).display; + }); + + return display == "none"; + } + +See [Detecting and avoiding synchronous reflow](#detecting-and-avoiding-synchronous-reflow) +for a more advanced example of getting layout information, and then +setting it safely, without causing flushes. + +bestpractices.html#detecting-and-avoiding-synchronous-reflow + + +*promiseDocumentFlushed* is only available to privileged script, and +should be called on the inner window of a top-level frame. Calling it on +the outer window of a subframe is not supported, and calling it from +within the inner window of a subframe might cause the callback to fire +even though a style and layout flush will still be required. These +gotchas should be fixed by +[bug 1441173](https://bugzilla.mozilla.org/show_bug.cgi?id=1441173). + +For now, it is up to you as the consumer of this API to not accidentally +write to the DOM within the *promiseDocumentFlushed* callback. Doing +so might cause flushes to occur for other *promiseDocumentFlushed* +callbacks that are scheduled to fire in the same tick of the refresh +driver. +[bug 1441168](https://bugzilla.mozilla.org/show_bug.cgi?id=1441168) +tracks work to make it impossible to modify the DOM within a +*promiseDocumentFlushed* callback. + +### Writing tests to ensure you don’t add more synchronous style flushes + +Unlike reflow, there isn’t a “observer” mechanism for style +recalculations. However, as of Firefox 49, the +*nsIDOMWindowUtils.elementsRestyled* attribute records a count of how +many style calculations have occurred for a particular DOM window. + +It should be possible to write a test that gets the +*nsIDOMWindowUtils* for a browser window, records the number of +styleFlushes, then **synchronously calls the function** that you want to +test, and immediately after checks the styleFlushes attribute again. If +the value went up, your code caused synchronous style flushes to occur. + +Note that your test and function *must be called synchronously* in order +for this test to be accurate. If you ever go back to the event loop (by +yielding, waiting for an event, etc), style flushes unrelated to your +code are likely to run, and your test will give you a false positive. + +## Detecting and avoiding synchronous reflow + +This is also sometimes called “sync layout”, "sync layout flushes" or +“sync layout calculations” + +*Sync reflow* is a term bandied about a lot, and has negative +connotations. It's not unusual for an engineer to have only the vaguest +sense of what it is—and to only know to avoid it. This section will +attempt to demystify things. + +The first time a document (XUL or HTML) loads, we parse the markup, and +then apply styles. Once the styles have been calculated, we then need to +calculate where things are going to be placed on the page. This layout +step can be seen in the “16ms” pipeline graphic above, and occurs just +before we paint things to be composited for the user to see. + +It is expected that over time, script will update the DOM, requiring us +to recalculate styles, and then update layout. Normally, however, the +changes to the DOM just result in the standard style calculation that +occurs immediately after the JavaScript has finished running during the +16ms window. + +### Interruptible reflow + +Since [the early days](https://bugzilla.mozilla.org/show_bug.cgi?id=67752), Gecko has +had the notion of interruptible reflow. This is a special type of +**content-only** reflow that checks at particular points whether or not +it should be interrupted (usually to respond to user events). + +Because **interruptible reflows can only be interrupted when laying out +content, and not chrome UI**, the rest of this section is offered only +as context. + +When an interruptible reflow is interrupted, what really happens is that +certain layout operations can be skipped in order to paint and process +user events sooner. + +When an interruptible reflow is interrupted, the best-case scenario is +that all layout is skipped, and the layout operation ends. + +The worst-case scenario is that none of the layout can be skipped +despite being interrupted, and the entire layout calculation occurs. + +Reflows that are triggered "naturally" by the 16ms tick are all +considered interruptible. Despite not actually being interuptible when +laying out chrome UI, striving for interruptible layout is always good +practice because uninterruptible layout has the potential to be much +worse (see next section). + +**To repeat, only interruptible reflows in web content can be +interrupted.** + +### Uninterruptible reflow + +Uninterruptible reflow is what we want to **avoid at all costs**. +Uninterruptible reflow occurs when some DOM node’s styles have changed +such that the size or position of one or more nodes in the document will +need to be updated, and then **JavaScript asks for the size or position +of anything**. Since everything is pending a reflow, the answer isn't +available, so everything stalls until the reflow is complete and the +script can be given an answer. Flushing layout also means that styles +must be flushed to calculate the most up-to-date state of things, so +it's a double-whammy. + +Here’s a simple example, cribbed from [this blog post by Paul +Rouget](http://paulrouget.com/e/fxoshud): + + + div1.style.margin = "200px"; // Line 1 + var height1 = div1.clientHeight; // Line 2 + div2.classList.add("foobar"); // Line 3 + var height2 = div2.clientHeight; // Line 4 + doSomething(height1, height2); // Line 5 + +At line 1, we’re setting some style information on a DOM node that’s +going to result in a reflow - but (at just line 1) it’s okay, because +that reflow will happen after the style calculation. + +Note line 2 though - we’re asking for the height of some DOM node. This +means that Gecko needs to synchronously calculate layout (and styles) +using an uninterruptible reflow in order to answer the question that +JavaScript is asking (“What is the *clientHeight* of *div1*?”). + +It’s possible for our example to avoid this synchronous, uninterruptible +reflow by moving lines 2 and 4 above line 1. Assuming there weren’t any +style changes requiring size or position recalculation above line 1, the +*clientHeight* information should be cached since the last reflow, and +will not result in a new layout calculation. + +If you can avoid querying for the size or position of things in +JavaScript, that’s the safest option—especially because it’s always +possible that something earlier in this tick of JavaScript execution +caused a style change in the DOM without you knowing it. + +Note that given the same changes to the DOM of a chrome UI document, a +single synchronous uninterruptible reflow is no more computationally +expensive than an interruptible reflow triggered by the 16ms tick. It +is, however, advantageous to strive for reflow to only occur in the one +place (the layout step of the 16ms tick) as opposed to multiple times +during the 16ms tick (which has a higher probability of running through +the 16ms budget). + +### How do I avoid triggering uninterruptible reflow? + +Here's a [list of things that JavaScript can ask for that can cause +uninterruptible reflow](https://gist.github.com/paulirish/5d52fb081b3570c81e3a), to +help you think about the problem. Note that some items in the list may +be browser-specific or subject to change, and that an item not occurring +explicitly in the list doesn't mean it doesn't cause reflow. For +instance, at time of writing accessing *event.rangeOffset* [triggers +reflow](https://searchfox.org/mozilla-central/rev/6bfadf95b4a6aaa8bb3b2a166d6c3545983e179a/dom/events/UIEvent.cpp#215-226) +in Gecko, and does not occur in the earlier link. If you're unsure +whether something causes reflow, check! + +Note how abundant the properties in that first list are. This means that +when enumerating properties on DOM objects (e.g. elements/nodes, events, +windows, etc.) **accessing the value of each enumerated property will +almost certainly (accidentally) cause uninterruptible reflow**, because +a lot of DOM objects have one or even several properties that do so. + +If you require size or position information, you have a few options. + +[bug 1434376](https://bugzilla.mozilla.org/show_bug.cgi?id=1434376) +has landed a helper in the window binding to make it easier for +privileged code to queue up JavaScript to run when we know that the DOM +is not dirty, and size, position, and style information is cheap to +query for. + +Here's an example: + + async function matchWidth(elem, otherElem) { + let width = await window.promiseDocumentFlushed(() => { + // Do _not_ under any circumstances write to the DOM in one of these + // callbacks! + return elem.clientWidth; + }); + + requestAnimationFrame(() => { + otherElem.style.width = `${width}px`; + }); + } + +Please see the section on *promiseDocumentFlushed* in [Detecting and +avoiding synchronous style flushes](#detecting-and-avoiding-synchronous-style-flushes) +for more information on how to use the API. + +Note that queries for size and position information are only expensive +if the DOM has been written to. Otherwise, we're doing a cheap look-up +of cached information. If we work hard to move all DOM writes into +*requestAnimationFrame()*, then we can be sure that all size and +position queries are cheap. + +It's also possible (though less infallible than +*promiseDocumentFlushed*) to queue JavaScript to run very soon after +the frame has been painted, where the likelihood is highest that the DOM +has not been written to, and layout and style information queries are +still cheap. This can be done by using a *setTimeout* or dispatching a +runnable inside a *requestAnimationFrame* callback, for example: + + + requestAnimationFrame(() => { + setTimeout(() => { + // This code will be run ASAP after Style and Layout information have + // been calculated and the paint has occurred. Unless something else + // has dirtied the DOM very early, querying for style and layout information + // here should be cheap. + }, 0); + }); + + // Or, if you are running in privileged JavaScript and want to avoid the timer overhead, + // you could also use: + + requestAnimationFrame(() => { + Services.tm.dispatchToMainThread(() => { + // Same-ish as above. + }); + }); + +This also implies that *querying for size and position information* in +*requestAnimationFrame()* has a high probability of causing a +synchronous reflow. + +### Other useful methods + +Below you'll find some suggestions for other methods which may come in +handy when you need to do things without incurring synchronous reflow. +These methods generally return the most-recently-calculated value for +the requested value, which means the value may no longer be current, but +may still be "close enough" for your needs. Unless you need precisely +accurate information, they can be valuable tools in your performance +toolbox. + +#### nsIDOMWindowUtils.getBoundsWithoutFlushing() + +*getBoundsWithoutFlushing()* does exactly what its name suggests: it +allows you to get the bounds rectangle for a DOM node contained in a +window without flushing layout. This means that the information you get +is potentially out-of-date, but allows you to avoid a sync reflow. If +you can make do with information that may not be quite current, this can +be helpful. + +#### nsIDOMWindowUtils.getRootBounds() + +Like *getBoundsWithoutFlushing()*, *getRootBounds()* lets you get +the dimensions of the window without risking a synchronous reflow. + +#### nsIDOMWindowUtils.getScrollXY() + +Returns the window's scroll offsets without taking the chance of causing +a sync reflow. + +### Writing tests to ensure you don’t add more unintentional reflow + +The interface +[nsIReflowObserver](https://dxr.mozilla.org/mozilla-central/source/docshell/base/nsIReflowObserver.idl) +lets us detect both interruptible and uninterruptible reflows. A number +of tests have been written that exercise various functions of the +browser [opening tabs](http://searchfox.org/mozilla-central/rev/78cefe75fb43195e7f5aee1d8042b8d8fc79fc70/browser/base/content/test/general/browser_tabopen_reflows.js), +[opening windows](http://searchfox.org/mozilla-central/source/browser/base/content/test/general/browser_windowopen_reflows.js) +and ensure that we don’t add new uninterruptible reflows accidentally +while those actions occur. + +You should add tests like this for your feature if you happen to be +touching the DOM. + +## Detecting over-painting + +Painting is, in general, cheaper than both style calculation and layout +calculation; still, the more you can avoid, the better. Generally +speaking, the larger an area that needs to be repainted, the longer it +takes. Similarly, the more things that need to be repainted, the longer +it takes. + +If a profile says a lot of time is spent in painting or display-list building, +and you're unsure why, consider talking to our always helpful graphics team in +the [gfx room](https://chat.mozilla.org/#/room/%23gfx:mozilla.org) on +[Matrix](https://wiki.mozilla.org/Matrix), and they can probably advise you. + +Note that a significant number of the graphics team members are in the US +Eastern Time zone (UTC-5 or UTC-4 during Daylight Saving Time), so let that +information guide your timing when you ask questions in the +[gfx room](https://chat.mozilla.org/#/room/%23gfx:mozilla.org). + +## Adding nodes using DocumentFragments + +Sometimes you need to add several DOM nodes as part of an existing DOM +tree. For example, when using XUL *\s*, you often have +script which dynamically inserts *\s*. Inserting items +into the DOM has a cost. If you're adding a number of children to a DOM +node in a loop, it's often more efficient to batch them into a single +insertion by creating a *DocumentFragment*, adding the new nodes to +that, then inserting the *DocumentFragment* as a child of the desired +node. + +A *DocumentFragment* is maintained in memory outside the DOM itself, +so changes don't cause reflow. The API is straightforward: + +1. Create the *DocumentFragment* by calling + *Document.createDocumentFragment()*. + +2. Create each child element (by calling *Document.createElement()* + for example), and add each one to the fragment by calling + *DocumentFragment.appendChild()*. + +3. Once the fragment is populated, append the fragment to the DOM by + calling *appendChild()* on the parent element for the new elements. + +This example has been cribbed from [davidwalsh’s blog +post](https://davidwalsh.name/documentfragment): + + + // Create the fragment + + var frag = document.createDocumentFragment(); + + // Create numerous list items, add to fragment + + for(var x = 0; x < 10; x++) { + var li = document.createElement("li"); + li.innerHTML = "List item " + x; + frag.appendChild(li); + } + + // Mass-add the fragment nodes to the list + + listNode.appendChild(frag); + +The above is strictly cheaper than individually adding each node to the +DOM. + +## The Gecko profiler add-on is your friend + +The Gecko profiler is your best friend when diagnosing performance +problems and looking for bottlenecks. There’s plenty of excellent +documentation on MDN about the Gecko profiler: + +- [Basic instructions for gathering and sharing a performance profile](reporting_a_performance_problem.md) + +- [Advanced profile analysis](https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler) + +## Don’t guess—measure. + +If you’re working on a performance improvement, this should go without +saying: ensure that what you care about is actually improving by +measuring before and after. + +Landing a speculative performance enhancement is the same thing as +landing speculative bug fixes—these things need to be tested. Even if +that means instrumenting a function with a *Date.now()* recording at +the entrance, and another *Date.now()* at the exit points in order to +measure processing time changes. + +Prove to yourself that you’ve actually improved something by measuring +before and after. + +### Use the performance API + +The [performance +API](https://developer.mozilla.org/en-US/docs/Web/API/Performance_API) +is very useful for taking high-resolution measurements. This is usually +much better than using your own hand-rolled timers to measure how long +things take. You access the API through *Window.performance*. + +Also, the Gecko profiler back-end is in the process of being modified to +expose things like markers (from *window.performance.mark()*). + +### Use the compositor for animations + +Performing animations on the main thread should be treated as +**deprecated**. Avoid doing it. Instead, animate using +*Element.animate()*. See the article [Animating like you just don't +care](https://hacks.mozilla.org/2016/08/animating-like-you-just-dont-care-with-element-animate/) +for more information on how to do this. + +### Explicitly define start and end animation values + +Some optimizations in the animation code of Gecko are based on an +expectation that the *from* (0%) and the *to* (100%) values will be +explicitly defined in the *@keyframes* definition. Even though these +values may be inferred through the use of initial values or the cascade, +the offscreen animation optimizations are dependent on the explicit +definition. See [this comment](https://bugzilla.mozilla.org/show_bug.cgi?id=1419096#c18) +and a few previous comments on that bug for more information. + +## Use IndexedDB for storage + +[AppCache](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/en-US/docs/Web/HTML/Using_the_application_cache) +and +[LocalStorage](https://developer.mozilla.org/en-US/docs/Web/API/Storage/LocalStorage) +are synchronous storage APIs that will block the main thread when you +use them. Avoid them at all costs! + +[IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Using_IndexedDB) +is preferable, as the API is asynchronous (all disk operations occur off +of the main thread), and can be accessed from web workers. + +IndexedDB is also arguably better than storing and retrieving JSON from +a file—particularly if the JSON encoding or decoding is occurring on the +main thread. IndexedDB will do JavaScript object serialization and +deserialization for you using the [structured clone +algorithm](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm) +meaning that you can stash [things like maps, sets, dates, blobs, and +more](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#Supported_types) +without having to do conversions for JSON compatibility. + +A Promise-based wrapper for IndexedDB, +[IndexedDB.sys.mjs](http://searchfox.org/mozilla-central/source/toolkit/modules/IndexedDB.sys.mjs) +is available for chrome code. + +## Test on weak hardware + +For the folks paid to work on Firefox, we tend to have pretty powerful +hardware for development. This is great, because it reduces build times, +and means we can do our work faster. + +We should remind ourselves that the majority of our user base is +unlikely to have similar hardware. Look at the [Firefox Hardware +Report](https://data.firefox.com/dashboard/hardware) to get +a sense of what our users are working with. Test on slower machines to +make it more obvious to yourself if what you’ve written impacts the +performance of the browser. + +## Consider loading scripts with the subscript loader asynchronously + +If you've ever used the subscript loader, you might not know that it can +load scripts asynchronously, and return a Promise once they're loaded. +For example: + + + Services.scriptloader.loadSubScriptWithOptions(myScriptURL, { async: true }).then(() => { + console.log("Script at " + myScriptURL + " loaded asynchronously!"); + }); diff --git a/docs/performance/build_metrics/build_metrics.md b/docs/performance/build_metrics/build_metrics.md new file mode 100644 index 0000000000..cb27a42be7 --- /dev/null +++ b/docs/performance/build_metrics/build_metrics.md @@ -0,0 +1,31 @@ +# Build Metrics + +**Build Metrics** is a catch-all term for performance measures that are +generated by the Firefox build system and tracked by Perfherder. + +## num_constructors + +Number of static constructors found by the compiler in the Firefox C++ +codebase. Lower is better. Static constructors are undesirable because +their initialization imposes an unavoidable time penalty every time +Firefox is started. + +## installer size + +Size in bytes of the Firefox installer. Lower is better here, especially +on space-restricted platforms like Android. + +## build times + +Amount of time it takes to build Firefox in automation on a specific +platform / configuration. Lower is better. + +## compiler warnings + +Number of compiler warnings detected during a build. Lower is better. + +Due to the way the build system works, compiler warnings are not +consistently detected. So the value may fluctuate from build to build +even if the number of compiler warnings didn\'t actually change. Since +Perfherder alerts are calculated based on the mean value of a range, a +regression may be reported as a fractional value. diff --git a/docs/performance/dtrace.md b/docs/performance/dtrace.md new file mode 100644 index 0000000000..68e5114297 --- /dev/null +++ b/docs/performance/dtrace.md @@ -0,0 +1,49 @@ +# dtrace + +`dtrace` is a powerful Mac OS X kernel instrumentation system that can +be used to profile wakeups. This article provides a light introduction +to it. + +::: +**Note**: The [power profiling overview](power_profiling_overview.md) is +worth reading at this point if you haven't already. It may make parts +of this document easier to understand. +::: + +## Invocation + +`dtrace` must be invoked as the super-user. A good starting command for +profiling wakeups is the following. + +``` +sudo dtrace -n 'mach_kernel::wakeup { @[ustack()] = count(); }' -p $FIREFOX_PID > $OUTPUT_FILE +``` + +Let's break that down further. + +- The` -n` option combined with the `mach_kernel::wakeup` selects a + *probe point*. `mach_kernel` is the *module name* and `wakeup` is + the *probe name*. You can see a complete list of probes by running + `sudo dtrace -l`. +- The code between the braces is run when the probe point is hit. The + above code counts unique stack traces when wakeups occur; `ustack` + is short for \"user stack\", i.e. the stack of the userspace program + executing. + +Run that command for a few seconds and then hit [Ctrl]{.kbd} + [C]{.kbd} +to interrupt it. `dtrace` will then print to the output file a number of +stack traces, along with a wakeup count for each one. The ordering of +the stack traces can be non-obvious, so look at them carefully. + +Sometimes the stack trace has less information than one would like. +It's unclear how to improve upon this. + +## See also + +dtrace is *very* powerful, and you can learn more about it by consulting +the following resources: + +- [The DTrace one-liner + tutorial](https://wiki.freebsd.org/DTrace/Tutorial) from FreeBSD. +- [DTrace tools](http://www.brendangregg.com/dtrace.html), by Brendan + Gregg. diff --git a/docs/performance/img/ActMon-Energy.png b/docs/performance/img/ActMon-Energy.png new file mode 100644 index 0000000000..1133ca314b Binary files /dev/null and b/docs/performance/img/ActMon-Energy.png differ diff --git a/docs/performance/img/EJCrt4N.png b/docs/performance/img/EJCrt4N.png new file mode 100644 index 0000000000..5397386f18 Binary files /dev/null and b/docs/performance/img/EJCrt4N.png differ diff --git a/docs/performance/img/PerfDotHTMLRedLines.png b/docs/performance/img/PerfDotHTMLRedLines.png new file mode 100644 index 0000000000..fbedc92b4e Binary files /dev/null and b/docs/performance/img/PerfDotHTMLRedLines.png differ diff --git a/docs/performance/img/annotation.png b/docs/performance/img/annotation.png new file mode 100644 index 0000000000..23655e0594 Binary files /dev/null and b/docs/performance/img/annotation.png differ diff --git a/docs/performance/img/battery-status-menu.png b/docs/performance/img/battery-status-menu.png new file mode 100644 index 0000000000..f8468387b7 Binary files /dev/null and b/docs/performance/img/battery-status-menu.png differ diff --git a/docs/performance/img/dominators-1.png b/docs/performance/img/dominators-1.png new file mode 100644 index 0000000000..163a80016c Binary files /dev/null and b/docs/performance/img/dominators-1.png differ diff --git a/docs/performance/img/dominators-10.png b/docs/performance/img/dominators-10.png new file mode 100644 index 0000000000..e6688060af Binary files /dev/null and b/docs/performance/img/dominators-10.png differ diff --git a/docs/performance/img/dominators-2.png b/docs/performance/img/dominators-2.png new file mode 100644 index 0000000000..99b7db7b09 Binary files /dev/null and b/docs/performance/img/dominators-2.png differ diff --git a/docs/performance/img/dominators-3.png b/docs/performance/img/dominators-3.png new file mode 100644 index 0000000000..2d380f6e21 Binary files /dev/null and b/docs/performance/img/dominators-3.png differ diff --git a/docs/performance/img/dominators-4.png b/docs/performance/img/dominators-4.png new file mode 100644 index 0000000000..d3d5eef59c Binary files /dev/null and b/docs/performance/img/dominators-4.png differ diff --git a/docs/performance/img/dominators-5.png b/docs/performance/img/dominators-5.png new file mode 100644 index 0000000000..41a03488e9 Binary files /dev/null and b/docs/performance/img/dominators-5.png differ diff --git a/docs/performance/img/dominators-6.png b/docs/performance/img/dominators-6.png new file mode 100644 index 0000000000..a3d3026eb2 Binary files /dev/null and b/docs/performance/img/dominators-6.png differ diff --git a/docs/performance/img/dominators-7.png b/docs/performance/img/dominators-7.png new file mode 100644 index 0000000000..160f205391 Binary files /dev/null and b/docs/performance/img/dominators-7.png differ diff --git a/docs/performance/img/dominators-8.png b/docs/performance/img/dominators-8.png new file mode 100644 index 0000000000..e9512b9b05 Binary files /dev/null and b/docs/performance/img/dominators-8.png differ diff --git a/docs/performance/img/dominators-9.png b/docs/performance/img/dominators-9.png new file mode 100644 index 0000000000..af396abc21 Binary files /dev/null and b/docs/performance/img/dominators-9.png differ diff --git a/docs/performance/img/memory-1-small.png b/docs/performance/img/memory-1-small.png new file mode 100644 index 0000000000..a2076330b8 Binary files /dev/null and b/docs/performance/img/memory-1-small.png differ diff --git a/docs/performance/img/memory-2-small.png b/docs/performance/img/memory-2-small.png new file mode 100644 index 0000000000..569b0d9d66 Binary files /dev/null and b/docs/performance/img/memory-2-small.png differ diff --git a/docs/performance/img/memory-3-small.png b/docs/performance/img/memory-3-small.png new file mode 100644 index 0000000000..5d77bd7f60 Binary files /dev/null and b/docs/performance/img/memory-3-small.png differ diff --git a/docs/performance/img/memory-4-small.png b/docs/performance/img/memory-4-small.png new file mode 100644 index 0000000000..9a1e18da6f Binary files /dev/null and b/docs/performance/img/memory-4-small.png differ diff --git a/docs/performance/img/memory-5-small.png b/docs/performance/img/memory-5-small.png new file mode 100644 index 0000000000..e3277186dc Binary files /dev/null and b/docs/performance/img/memory-5-small.png differ diff --git a/docs/performance/img/memory-6-small.png b/docs/performance/img/memory-6-small.png new file mode 100644 index 0000000000..da69b93e51 Binary files /dev/null and b/docs/performance/img/memory-6-small.png differ diff --git a/docs/performance/img/memory-7-small.png b/docs/performance/img/memory-7-small.png new file mode 100644 index 0000000000..844565a8b4 Binary files /dev/null and b/docs/performance/img/memory-7-small.png differ diff --git a/docs/performance/img/memory-graph-dominator-multiple-references.svg b/docs/performance/img/memory-graph-dominator-multiple-references.svg new file mode 100644 index 0000000000..0aaa0546ef --- /dev/null +++ b/docs/performance/img/memory-graph-dominator-multiple-references.svg @@ -0,0 +1,4 @@ + +AAs dominator diff --git a/docs/performance/img/memory-graph-dominators.svg b/docs/performance/img/memory-graph-dominators.svg new file mode 100644 index 0000000000..0525d0cb5a --- /dev/null +++ b/docs/performance/img/memory-graph-dominators.svg @@ -0,0 +1,4 @@ + +RAAs dominators diff --git a/docs/performance/img/memory-graph-immediate-dominator.svg b/docs/performance/img/memory-graph-immediate-dominator.svg new file mode 100644 index 0000000000..f88b482820 --- /dev/null +++ b/docs/performance/img/memory-graph-immediate-dominator.svg @@ -0,0 +1,4 @@ + +RAAs immediatedominator diff --git a/docs/performance/img/memory-graph-unreachable.svg b/docs/performance/img/memory-graph-unreachable.svg new file mode 100644 index 0000000000..5bc29d6163 --- /dev/null +++ b/docs/performance/img/memory-graph-unreachable.svg @@ -0,0 +1,4 @@ + +R diff --git a/docs/performance/img/memory-graph.svg b/docs/performance/img/memory-graph.svg new file mode 100644 index 0000000000..e39168c11c --- /dev/null +++ b/docs/performance/img/memory-graph.svg @@ -0,0 +1,4 @@ + +R diff --git a/docs/performance/img/memory-tool-aggregate-view.png b/docs/performance/img/memory-tool-aggregate-view.png new file mode 100644 index 0000000000..653710979f Binary files /dev/null and b/docs/performance/img/memory-tool-aggregate-view.png differ diff --git a/docs/performance/img/memory-tool-call-stack-expanded.png b/docs/performance/img/memory-tool-call-stack-expanded.png new file mode 100644 index 0000000000..fe2364da58 Binary files /dev/null and b/docs/performance/img/memory-tool-call-stack-expanded.png differ diff --git a/docs/performance/img/memory-tool-call-stack.png b/docs/performance/img/memory-tool-call-stack.png new file mode 100644 index 0000000000..52a96015da Binary files /dev/null and b/docs/performance/img/memory-tool-call-stack.png differ diff --git a/docs/performance/img/memory-tool-in-group-icon.png b/docs/performance/img/memory-tool-in-group-icon.png new file mode 100644 index 0000000000..6354a3d377 Binary files /dev/null and b/docs/performance/img/memory-tool-in-group-icon.png differ diff --git a/docs/performance/img/memory-tool-in-group-retaining-paths.png b/docs/performance/img/memory-tool-in-group-retaining-paths.png new file mode 100644 index 0000000000..191115f041 Binary files /dev/null and b/docs/performance/img/memory-tool-in-group-retaining-paths.png differ diff --git a/docs/performance/img/memory-tool-in-group.png b/docs/performance/img/memory-tool-in-group.png new file mode 100644 index 0000000000..88aac55e9e Binary files /dev/null and b/docs/performance/img/memory-tool-in-group.png differ diff --git a/docs/performance/img/memory-tool-inverted-call-stack.png b/docs/performance/img/memory-tool-inverted-call-stack.png new file mode 100644 index 0000000000..5a951c2e8c Binary files /dev/null and b/docs/performance/img/memory-tool-inverted-call-stack.png differ diff --git a/docs/performance/img/memory-tool-switch-view.png b/docs/performance/img/memory-tool-switch-view.png new file mode 100644 index 0000000000..bb3cb0cdb3 Binary files /dev/null and b/docs/performance/img/memory-tool-switch-view.png differ diff --git a/docs/performance/img/monsters.svg b/docs/performance/img/monsters.svg new file mode 100644 index 0000000000..2f12ef43e8 --- /dev/null +++ b/docs/performance/img/monsters.svg @@ -0,0 +1,4 @@ + +ObjectArrayMonsterMonsterStringStringArrayMonsterMonsterStringStringArrayMonsterMonsterStringString diff --git a/docs/performance/img/pid.png b/docs/performance/img/pid.png new file mode 100644 index 0000000000..bdad5d2cb8 Binary files /dev/null and b/docs/performance/img/pid.png differ diff --git a/docs/performance/img/power-planes.jpg b/docs/performance/img/power-planes.jpg new file mode 100644 index 0000000000..a564fae248 Binary files /dev/null and b/docs/performance/img/power-planes.jpg differ diff --git a/docs/performance/img/rendering.png b/docs/performance/img/rendering.png new file mode 100644 index 0000000000..c4995dbef8 Binary files /dev/null and b/docs/performance/img/rendering.png differ diff --git a/docs/performance/img/reportingperf1.png b/docs/performance/img/reportingperf1.png new file mode 100644 index 0000000000..e2285280af Binary files /dev/null and b/docs/performance/img/reportingperf1.png differ diff --git a/docs/performance/img/reportingperf2.png b/docs/performance/img/reportingperf2.png new file mode 100644 index 0000000000..c43eba2342 Binary files /dev/null and b/docs/performance/img/reportingperf2.png differ diff --git a/docs/performance/img/reportingperf3.png b/docs/performance/img/reportingperf3.png new file mode 100644 index 0000000000..5eb3b58fb7 Binary files /dev/null and b/docs/performance/img/reportingperf3.png differ diff --git a/docs/performance/img/treemap-bbc.png b/docs/performance/img/treemap-bbc.png new file mode 100644 index 0000000000..55552b8382 Binary files /dev/null and b/docs/performance/img/treemap-bbc.png differ diff --git a/docs/performance/img/treemap-domnodes.png b/docs/performance/img/treemap-domnodes.png new file mode 100644 index 0000000000..1192e390da Binary files /dev/null and b/docs/performance/img/treemap-domnodes.png differ diff --git a/docs/performance/img/treemap-monsters.png b/docs/performance/img/treemap-monsters.png new file mode 100644 index 0000000000..513adab923 Binary files /dev/null and b/docs/performance/img/treemap-monsters.png differ diff --git a/docs/performance/index.md b/docs/performance/index.md new file mode 100644 index 0000000000..70e57c89e9 --- /dev/null +++ b/docs/performance/index.md @@ -0,0 +1,53 @@ +# Performance + +This page explains how to optimize the performance of the Firefox code base. + +The [test documentation](/testing/perfdocs/index.rst) +explains how to test for performance in Firefox. +The [profiler documentation](/tools/profiler/index.rst) +explains how to use the Gecko profiler. + +## General Performance references +* Tips on generating valid performance metrics by [benchmarking](Benchmarking.md) +* [GPU Performance](GPU_performance.md) Tips for reducing impacts on browser performance in front-end code. +* [Automated Performance testing and Sheriffing](automated_performance_testing_and_sheriffing.md) Information on automated performance testing and sheriffing at Mozilla. +* [Performance best practices for Firefox front-end engineers](bestpractices.md) Tips for reducing impacts on browser performance in front-end code. +* [Reporting a performance problem](reporting_a_performance_problem.md) A user friendly guide to reporting a performance problem. A development environment is not required. +* [Scroll Linked Effects](scroll-linked_effects.md) Information on scroll-linked effects, their effect on performance, related tools, and possible mitigation techniques. + +## Memory profiling and leak detection tools +* The [Developer Tools Memory panel](memory/memory.md) supports taking heap snapshots, diffing them, computing dominator trees to surface "heavy retainers", and recording allocation stacks. +* [About:memory](memory/about_colon_memory.md) about:memory is the easiest-to-use tool for measuring memory usage in Mozilla code, and is the best place to start. It also lets you do other memory-related operations like trigger GC and CC, dump GC & CC logs, and dump DMD reports. about:memory is built on top of Firefox's memory reporting infrastructure. +* [DMD](memory/dmd.md) is a tool that identifies shortcomings in about:memory's measurements, and can also do multiple kinds of general heap profiling. +* [AWSY](memory/awsy.md) (are we slim yet?) is a memory usage and regression tracker. +* [Bloatview](memory/bloatview.md) prints per-class statistics on allocations and refcounts, and provides gross numbers on the amount of memory being leaked broken down by class. It is used as part of Mozilla's continuous integration testing. +* [Refcount Tracing and Balancing](memory/refcount_tracing_and_balancing.md) are ways to track down leaks caused by incorrect uses of reference counting. They are slow and not particular easy to use, and thus most suitable for use by expert developers. +* [GC and CC Logs](memory/gc_and_cc_logs.md) +* [Leak Gauge](memory/leak_gauge.md) can be generated and analyzed to in various ways. In particular, they can help you understand why a particular object is being kept alive. +* [LogAlloc](https://searchfox.org/mozilla-central/source/memory/replace/logalloc/README) is a tool that dumps a log of memory allocations in Gecko. That log can then be replayed against Firefox's default memory allocator independently or through another replace-malloc library, allowing the testing of other allocators under the exact same workload. +* [See also the documentation on Leak-hunting strategies and tips.](memory/leak_hunting_strategies_and_tips.md) + +## Profiling and performance tools + +* [JIT Profiling with perf](jit_profiling_with_perf.md) Using perf to collect JIT profiles. +* [Profiling with Instruments](profiling_with_instruments.md) How to use Apple's Instruments tool to profile Mozilla code. +* [Profiling with xperf](profiling_with_xperf.md) How to use Microsoft's Xperf tool to profile Mozilla code. +* [Profiling with Concurrency Visualizer](profiling_with_concurrency_visualizer.md) How to use Visual Studio's Concurrency Visualizer tool to profile Mozilla code. +* [Profiling with Zoom](profiling_with_zoom.md) Zoom is a profiler for Linux done by the people who made Shark. +* [Adding a new telemetry probe](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/start/adding-a-new-probe.html) Information on how to add a new measurement to the Telemetry performance-reporting system + +## Power Profiling + +* [An overview of power profiling](power_profiling_overview.md). It includes details about hardware, what can be measured, and recommended approaches. It should be the starting point for anybody new to power profiling. +* **(Mac, Linux)** [tools/power/rapl](tools_power_rapl.md) is a command-line utility in the Mozilla codebase that uses the Intel RAPL interface to gather direct power estimates for the package, cores, GPU and memory. +* **(Mac-only)** [powermetrics](powermetrics.md) is a command-line utility that gathers and displays a wide range of global and per-process measurements, including CPU usage, GPU usage, and various wakeups frequencies. +* **(All-platforms)** [TimerFirings](timerfirings_logging.md) logging is a built-in logging mechanism that prints data on every time fired. +* **(Mac-only)** [Activity Monitor and top](activity_monitor_and_top.md) The battery status menu, Activity Monitor and top are three related Mac tools that have major flaws but often consulted by users, and so are worth understanding. +* **(Windows, Mac and Linux)** [Intel Power Gadget](intel_power_gadget.md) Intel Power Gadget provides real-time graphs for package and processor RAPL estimates. It also provides an API through which those estimates can be obtained. +* **(Linux only)** [perf](perf.md) perf is a powerful command-line utility that can measure many different things, including energy estimates and high-context measurements of things such as wakeups. +* **(Linux-only)** [turbostat](turbostat.md) is a command-line utility that gathers and displays various power-related measurements, with a focus on per-CPU measurements such as frequencies and C-states. +* **(Linux-only)** [powertop](https://01.org/powertop) is an interactive command-line utility that gathers and displays various power-related measurements. + +## Performance Metrics + +* [PerfStats](perfstats.md) - A framework for low-overhead collection of internal performance metrics. diff --git a/docs/performance/intel_power_gadget.md b/docs/performance/intel_power_gadget.md new file mode 100644 index 0000000000..74f7801cff --- /dev/null +++ b/docs/performance/intel_power_gadget.md @@ -0,0 +1,56 @@ +# Intel Power Gadget + +[Intel Power Gadget](https://software.intel.com/en-us/articles/intel-power-gadget/) +provides real-time graphs of various power-related measures and +estimates, all taken from the Intel RAPL MSRs. This article provides a +basic introduction. + +**Note**: The [power profiling +overview](power_profiling_overview.md) is +worth reading at this point if you haven\'t already. It may make parts +of this document easier to understand. + +The main strengths of this tool are (a) it works on Windows, unlike most +other power-related tools, and (b) it shows this data in graph form, +which is occasionally useful. On Mac and Linux, `tools/power/rapl` +[](tools_power_rapl.md) is probably a better tool +to use. + +## Understanding the Power Gadget output + +The following screenshot (from the Mac version) demonstrates the +available measurements. + +![](https://mdn.mozillademos.org/files/11365/Intel-Power-Gadget.png) + +The three panes display the following information: + +- **Power**: Shows power estimates for the package and the cores + (\"IA\"). These are reasonably useful for power profiling purposes, + but Mozilla\'s `rapl` utility provides these along with GPU and RAM + estimates, and in a command-line format that is often easier to use. +- **Frequency**: Shows operating frequency measurements for the cores + (\"IA\") and the GPU (\"GT\"). These measurements aren\'t + particularly useful for power profiling purposes. +- **Temperature**: Shows the package temperature. This is interesting, + but again not useful for power profiling purposes. Specifically, + the temperature is a proxy measurement that is *affected by* + processor power consumption, rather than one that *affects* it, + which makes it even less useful than most proxy measurements. + +Intel Power Gadget can also log these results to a file. This feature +has been used in [energia](https://github.com/mozilla/energia), Roberto +Vitillo\'s tool for systematically measuring differential power usage +between different browsers. (An energia dashboard can be seen +[here](http://people.mozilla.org/~rvitillo/dashboard/); please note that +the data has not been updated since early 2014.) + +Version 3.0 (available on Mac and Windows, but not on Linux) also +exposes an API from which the same measurements can be extracted +programmatically. At one point the Gecko Profiler [used this +API](https://benoitgirard.wordpress.com/2012/06/29/correlating-power-usage-with-performance-data-using-the-gecko-profiler-and-intel-sandy-bridge/) +on Windows to implement experimental package power estimates. +Unfortunately, the Gecko profiler takes 1000 samples per second on +desktop and is CPU intensive and so is likely to skew the RAPL estimates +significantly, so the API integration was removed. The API is otherwise +unlikely to be of interest to Mozilla developers. diff --git a/docs/performance/jit_profiling_with_perf.md b/docs/performance/jit_profiling_with_perf.md new file mode 100644 index 0000000000..81feac2733 --- /dev/null +++ b/docs/performance/jit_profiling_with_perf.md @@ -0,0 +1,119 @@ +# JIT Profiling with perf + +perf is a performance profiling tool available on Linux that is capable of measuring performance events such as cycles, instructions executed, cache misses, etc and providing assembly and source code annotation. +It is possible to collect performance profiles of the SpiderMonkey JIT using perf on Linux and also annotate the generated assembly with the IR opcodes that were used during compilation as shown below. + +![](img/annotation.png) + +## Build setup + +To enable JIT profiling with perf jitdump, you must build Firefox or the JS shell with the following flag: + +``` +ac_add_options --enable-perf +``` + +## Environment Variables + +Environment variables that must be defined for perf JIT profiling: + +`PERF_SPEW_DIR`: Location of jitdump output files. Making this directory a tmpfs filesystem could help reduce overhead.\ +`IONPERF`: Valid options include: `func`, `src`, `ir`, `ir-ops`. + +`IONPERF=func` will disable all annotation and only function names will be available. It is the fastest option.\ +`IONPERF=ir` will enable IR annotation.\ +`IONPERF=ir-ops` will enable IR annotation with operand support. **Requires --enable-jitspew** and adds additional overhead to "ir".\ +`IONPERF=src` will enable source code annotation **only if** perf can read the source file locally. Only really works well in the JS shell. + +## Profiling the JS shell + +Profiling the JS shell requires the following commands but is very straight forward. + +Begin by removing any pre-existing jitdump files: + +`rm -rf output` or `rm -f jitted-*.so jit.data perf.data jit-*.dump jitdump-*.txt` + +Next define environment variables: +``` +export IONPERF=ir +export PERF_SPEW_DIR=output +``` + +Run your test case with perf attached: +``` +perf record -g -k 1 /home/denis/src/mozilla-central/obj-js/dist/bin/js test.js +``` + +Inject the jitdump files into your perf.data file: +``` +perf inject -j -i perf.data -o jit.data +``` + +View the profile: +``` +perf report --no-children --call-graph=graph,0 -i jit.data +``` + +All of the above commands can be put into a single shell script. + +## Profiling the Browser + +Profiling the browser is less straight forward than the shell, but the only main difference is that perf must attach to the content process while it is running. + +Begin by removing any pre-existing jitdump files: + +`rm -rf output` or `rm -f jitted-*.so jit.data perf.data jit-*.dump jitdump-*.txt` + +Next define environment variables: +``` +export IONPERF=ir +export PERF_SPEW_DIR=output +export MOZ_DISABLE_CONTENT_SANDBOX=1 +``` + +Run the Firefox browser +``` +~/mozilla-central/obj-opt64/dist/bin/firefox -no-remote -profile ~/mozilla-central/obj-opt64/tmp/profile-default & +``` + +Navigate to the test case, but do not start it yet. Then hover over the tab to get the content process PID. + +![](img/pid.png) + +Attach perf to begin profiling: +``` +perf record -g -k 1 -p +``` + +Close the browser when finished benchmarking. + +Inject the jitdump files into your perf.data file: +``` +perf inject -j -i perf.data -o jit.data +``` + +View the profile (--call-graph=graph,0 shows all call stacks instead of the default threshold of >= 0.5%): +``` +perf report --no-children --call-graph=graph,0 -i jit.data +``` + +## Additional Information + +Some Linux distributions offer a "libc6-prof" package that includes frame pointers. This can help resolve symbols and call stacks that involve libc calls. + +On Ubuntu, you can install this with: +``` +sudo apt-get install libc6-prof +``` + +libc6-prof can be used with `LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu` + +It may also be useful to have access to kernel addresses during profiling. These can be exposed with: +``` +sudo sh -c "echo 0 > /proc/sys/kernel/kptr_restrict" +``` + +The max stack depth is 127 by default. This is often too few. It can be increased with: +``` +sudo sh -c "echo 4000 > /proc/sys/kernel/perf_event_max_stack" +``` diff --git a/docs/performance/memory/DOM_allocation_example.md b/docs/performance/memory/DOM_allocation_example.md new file mode 100644 index 0000000000..db9a1f2c71 --- /dev/null +++ b/docs/performance/memory/DOM_allocation_example.md @@ -0,0 +1,57 @@ +# DOM allocation example + +This article describes a very simple web page that we\'ll use to +illustrate some features of the Memory tool. + +You can try out the site at +. + +It just contains a script that creates a large number of DOM nodes: + +```js +var toolbarButtonCount = 20; +var toolbarCount = 200; + +function getRandomInt(min, max) { + return Math.floor(Math.random() * (max - min + 1)) + min; +} + +function createToolbarButton() { + var toolbarButton = document.createElement("span"); + toolbarButton.classList.add("toolbarbutton"); + // stop Spidermonkey from sharing instances + toolbarButton[getRandomInt(0,5000)] = "foo"; + return toolbarButton; +} + +function createToolbar() { + var toolbar = document.createElement("div"); + // stop Spidermonkey from sharing instances + toolbar[getRandomInt(0,5000)] = "foo"; + for (var i = 0; i < toolbarButtonCount; i++) { + var toolbarButton = createToolbarButton(); + toolbar.appendChild(toolbarButton); + } + return toolbar; +} + +function createToolbars() { + var container = document.getElementById("container"); + for (var i = 0; i < toolbarCount; i++) { + var toolbar = createToolbar(); + container.appendChild(toolbar); + } +} + +createToolbars(); +``` + +A simple pseudocode representation of how this code operates looks like +this: + + createToolbars() + -> createToolbar() // called 200 times, creates 1 DIV element each time + -> createToolbarButton() // called 20 times per toolbar, creates 1 SPAN element each time + +In total, then, it creates 200 `HTMLDivElement` objects, and 4000 +`HTMLSpanElement` objects. diff --git a/docs/performance/memory/about_colon_memory.md b/docs/performance/memory/about_colon_memory.md new file mode 100644 index 0000000000..ab9dc81062 --- /dev/null +++ b/docs/performance/memory/about_colon_memory.md @@ -0,0 +1,274 @@ +# about:memory + +about:memory is a special page within Firefox that lets you view, save, +load, and diff detailed measurements of Firefox's memory usage. It also +lets you do other memory-related operations like trigger GC and CC, dump +GC & CC logs, and dump DMD reports. It is present in all builds and does +not require any preparation to be used. + +## How to generate memory reports + +Let's assume that you want to measure Firefox's memory usage. Perhaps +you want to investigate it yourself, or perhaps someone has asked you to +use about:memory to generate "memory reports" so they can investigate +a problem you are having. Follow these steps. + +- At the moment of interest (e.g. once Firefox's memory usage has + gotten high) open a new tab and type "about:memory" into the + address bar and hit "Enter". +- If you are using a communication channel where files can be sent, + such as Bugzilla or email, click on the "Measure and save..." + button. This will open a file dialog that lets you save the memory + reports to a file of your choosing. (The filename will have a + `.json.gz` suffix.) You can then attach or upload the file + appropriately. The recipients will be able to view the contents of + this file within about:memory in their own Firefox instance. +- If you are using a communication channel where only text can be + sent, such as a comment thread on a website, click on the + "Measure..." button. This will cause a tree-like structure to be + generated text within about:memory. This structure is just text, so + you can copy and paste some or all of this text into any kind of + text buffer. (You don't need to take a screenshot.) This text + contains fewer measurements than a memory reports file, but is often + good enough to diagnose problems. Don't click "Measure..." + repeatedly, because that will cause the memory usage of about:memory + itself to rise, due to it discarding and regenerating large numbers + of DOM nodes. + +Note that in both cases the generated data contains privacy-sensitive +details such as the full list of the web pages you have open in other +tabs. If you do not wish to share this information, you can select the +"anonymize" checkbox before clicking on "Measure and save..." or +"Measure...". This will cause the privacy-sensitive data to be +stripped out, but it may also make it harder for others to investigate +the memory usage. + +## Loading memory reports from file + +The easiest way to load memory reports from file is to use the +"Load..." button. You can also use the "Load and diff..." button +to get the difference between two memory report files. + +Single memory report files can also be loaded automatically when +about:memory is loaded by appending a `file` query string, for example: + + about:memory?file=/home/username/reports.json.gz + +This is most useful when loading memory reports files obtained from a +Firefox OS device. + +Memory reports are saved to file as gzipped JSON. These files can be +loaded as is, but they can also be loaded after unzipping. + +## Interpreting memory reports + +Almost everything you see in about:memory has an explanatory tool-tip. +Hover over any button to see a description of what it does. Hover over +any measurement to see a description of what it means. + +### [Measurement basics] + +Most measurements use bytes as their unit, but some are counts or +percentages. + +Most measurements are presented within trees. For example: + + 585 (100.0%) -- preference-service + └──585 (100.0%) -- referent + ├──493 (84.27%) ── strong + └───92 (15.73%) -- weak + ├──92 (15.73%) ── alive + └───0 (00.00%) ── dead + +Leaf nodes represent actual measurements; the value of each internal +node is the sum of all its children. + +The use of trees allows measurements to be broken down into further +categories, sub-categories, sub-sub-categories, etc., to arbitrary +depth, as needed. All the measurements within a single tree are +non-overlapping. + +Tree paths can be written using \'/\' as a separator. For example, +`preference/referent/weak/dead` represents the path to the final leaf +node in the example tree above. + +Sub-trees can be collapsed or expanded by clicking on them. If you find +any particular tree overwhelming, it can be helpful to collapse all the +sub-trees immediately below the root, and then gradually expand the +sub-trees of interest. + +### [Sections] + +Memory reports are displayed on a per-process basis, with one process +per section. Within each process's measurements, there are the +following subsections. + +#### Explicit Allocations + +This section contains a single tree, called "explicit", that measures +all the memory allocated via explicit calls to heap allocation functions +(such as `malloc` and `new`) and to non-heap allocations functions (such +as `mmap` and `VirtualAlloc`). + +Here is an example for a browser session where tabs were open to +cnn.com, techcrunch.com, and arstechnica.com. Various sub-trees have +been expanded and others collapsed for the sake of presentation. + + 191.89 MB (100.0%) -- explicit + ├───63.15 MB (32.91%) -- window-objects + │ ├──24.57 MB (12.80%) -- top(http://edition.cnn.com/, id=8) + │ │ ├──20.18 MB (10.52%) -- active + │ │ │ ├──10.57 MB (05.51%) -- window(http://edition.cnn.com/) + │ │ │ │ ├───4.55 MB (02.37%) ++ js-compartment(http://edition.cnn.com/) + │ │ │ │ ├───2.60 MB (01.36%) ++ layout + │ │ │ │ ├───1.94 MB (01.01%) ── style-sheets + │ │ │ │ └───1.48 MB (00.77%) -- (2 tiny) + │ │ │ │ ├──1.43 MB (00.75%) ++ dom + │ │ │ │ └──0.05 MB (00.02%) ── property-tables + │ │ │ └───9.61 MB (05.01%) ++ (18 tiny) + │ │ └───4.39 MB (02.29%) -- js-zone(0x7f69425b5800) + │ ├──15.75 MB (08.21%) ++ top(http://techcrunch.com/, id=20) + │ ├──12.85 MB (06.69%) ++ top(http://arstechnica.com/, id=14) + │ ├───6.40 MB (03.33%) ++ top(chrome://browser/content/browser.xul, id=3) + │ └───3.59 MB (01.87%) ++ (4 tiny) + ├───45.74 MB (23.84%) ++ js-non-window + ├───33.73 MB (17.58%) ── heap-unclassified + ├───22.51 MB (11.73%) ++ heap-overhead + ├────6.62 MB (03.45%) ++ images + ├────5.82 MB (03.03%) ++ workers/workers(chrome) + ├────5.36 MB (02.80%) ++ (16 tiny) + ├────4.07 MB (02.12%) ++ storage + ├────2.74 MB (01.43%) ++ startup-cache + └────2.16 MB (01.12%) ++ xpconnect + +Some expertise is required to understand the full details here, but +there are various things worth pointing out. + +- This "explicit" value at the root of the tree represents all the + memory allocated via explicit calls to allocation functions. +- The "window-objects" sub-tree represents all JavaScript `window` + objects, which includes the browser tabs and UI windows. For + example, the "top(http://edition.cnn.com/, id=8)" sub-tree + represents the tab open to cnn.com, and + "top(chrome://browser/content/browser.xul, id=3)" represents the + main browser UI window. +- Within each window's measurements are sub-trees for JavaScript + ("js-compartment(...)" and "js-zone(...)"), layout, + style-sheets, the DOM, and other things. +- It's clear that the cnn.com tab is using more memory than the + techcrunch.com tab, which is using more than the arstechnica.com + tab. +- Sub-trees with names like "(2 tiny)" are artificial nodes inserted + to allow insignificant sub-trees to be collapsed by default. If you + select the "verbose" checkbox before measuring, all trees will be + shown fully expanded and no artificial nodes will be inserted. +- The "js-non-window" sub-tree represents JavaScript memory usage + that doesn't come from windows, but from the browser core. +- The "heap-unclassified" value represents heap-allocated memory + that is not measured by any memory reporter. This is typically + 10--20% of "explicit". If it gets higher, it indicates that + additional memory reporters should be added. + [DMD](./dmd.md) + can be used to determine where these memory reporters should be + added. +- There are measurements for other content such as images and workers, + and for browser subsystems such as the startup cache and XPConnect. + +Some add-on memory usage is identified, as the following example shows. + + ├───40,214,384 B (04.17%) -- add-ons + │ ├──21,184,320 B (02.20%) ++ {d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d}/js-non-window/zones/zone(0x100496800)/compartment([System Principal], jar:file:///Users/njn/Library/Application%20Support/Firefox/Profiles/puna0zr8.new/extensions/%7Bd10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d%7D.xpi!/bootstrap.js (from: resource://gre/modules/addons/XPIProvider.jsm:4307)) + │ ├──11,583,312 B (01.20%) ++ jid1-xUfzOsOFlzSOXg@jetpack/js-non-window/zones/zone(0x100496800) + │ ├───5,574,608 B (00.58%) -- {59c81df5-4b7a-477b-912d-4e0fdf64e5f2} + │ │ ├──5,529,280 B (00.57%) -- window-objects + │ │ │ ├──4,175,584 B (00.43%) ++ top(chrome://chatzilla/content/chatzilla.xul, id=4293) + │ │ │ └──1,353,696 B (00.14%) ++ top(chrome://chatzilla/content/output-window.html, id=4298) + │ │ └─────45,328 B (00.00%) ++ js-non-window/zones/zone(0x100496800)/compartment([System Principal], file:///Users/njn/Library/Application%20Support/Firefox/Profiles/puna0zr8.new/extensions/%7B59c81df5-4b7a-477b-912d-4e0fdf64e5f2%7D/components/chatzilla-service.js) + │ └───1,872,144 B (00.19%) ++ treestyletab@piro.sakura.ne.jp/js-non-window/zones/zone(0x100496800) + +More things worth pointing out are as follows. + +- Some add-ons are identified by a name, such as Tree Style Tab. + Others are identified only by a hexadecimal identifier. You can look + in about:support to see which add-on a particular identifier belongs + to. For example, `59c81df5-4b7a-477b-912d-4e0fdf64e5f2` is + Chatzilla. +- All JavaScript memory usage for an add-on is measured separately and + shown in this sub-tree. +- For add-ons that use separate windows, such as Chatzilla, the memory + usage of those windows will show up in this sub-tree. +- For add-ons that use XUL overlays, such as AdBlock Plus, the memory + usage of those overlays will not show up in this sub-tree; it will + instead be in the non-add-on sub-trees and won't be identifiable as + being caused by the add-on. + +#### Other Measurements + +This section contains multiple trees, includes many that cross-cut the +measurements in the "explicit" tree. For example, in the "explicit" +tree all DOM and layout measurements are broken down by window by +window, but in "Other Measurements" those measurements are aggregated +into totals for the whole browser, as the following example shows. + + 26.77 MB (100.0%) -- window-objects + ├──14.59 MB (54.52%) -- layout + │ ├───6.22 MB (23.24%) ── style-sets + │ ├───4.00 MB (14.95%) ── pres-shell + │ ├───1.79 MB (06.68%) ── frames + │ ├───0.89 MB (03.33%) ── style-contexts + │ ├───0.62 MB (02.33%) ── rule-nodes + │ ├───0.56 MB (02.10%) ── pres-contexts + │ ├───0.47 MB (01.75%) ── line-boxes + │ └───0.04 MB (00.14%) ── text-runs + ├───6.53 MB (24.39%) ── style-sheets + ├───5.59 MB (20.89%) -- dom + │ ├──3.39 MB (12.66%) ── element-nodes + │ ├──1.56 MB (05.84%) ── text-nodes + │ ├──0.54 MB (02.03%) ── other + │ └──0.10 MB (00.36%) ++ (4 tiny) + └───0.06 MB (00.21%) ── property-tables + +Some of the trees in this section measure things that do not cross-cut +the measurements in the "explicit" tree, such as those in the +"preference-service" example above. + +Finally, at the end of this section are individual measurements, as the +following example shows. + + 0.00 MB ── canvas-2d-pixels + 5.38 MB ── gfx-surface-xlib + 0.00 MB ── gfx-textures + 0.00 MB ── gfx-tiles-waste + 0 ── ghost-windows + 109.22 MB ── heap-allocated + 164 ── heap-chunks + 1.00 MB ── heap-chunksize + 114.51 MB ── heap-committed + 164.00 MB ── heap-mapped + 4.84% ── heap-overhead-ratio + 1 ── host-object-urls + 0.00 MB ── imagelib-surface-cache + 5.27 MB ── js-main-runtime-temporary-peak + 0 ── page-faults-hard + 203,349 ── page-faults-soft + 274.99 MB ── resident + 251.47 MB ── resident-unique + 1,103.64 MB ── vsize + +Some measurements of note are as follows. + +- "resident". Physical memory usage. If you want a single + measurement to summarize memory usage, this is probably the best + one. +- "vsize". Virtual memory usage. This is often much higher than any + other measurement (particularly on Mac). It only really matters on + 32-bit platforms such as Win32. There is also + "vsize-max-contiguous" (not measured on all platforms, and not + shown in this example), which indicates the largest single chunk of + available virtual address space. If this number is low, it's likely + that memory allocations will fail due to lack of virtual address + space quite soon. +- Various graphics-related measurements ("gfx-*"). The measurements + taken vary between platforms. Graphics is often a source of high + memory usage, and so these measurements can be helpful for detecting + such cases. diff --git a/docs/performance/memory/aggregate_view.md b/docs/performance/memory/aggregate_view.md new file mode 100644 index 0000000000..9a4f01e01e --- /dev/null +++ b/docs/performance/memory/aggregate_view.md @@ -0,0 +1,198 @@ +# Aggregate view + +Before Firefox 48, this was the default view of a heap snapshot. After +Firefox 48, the default view is the [Tree map +view](tree_map_view.md), and you can switch to the +Aggregate view using the dropdown labeled \"View:\": + +![](../img/memory-tool-switch-view.png) + +The Aggregate view looks something like this: + +![](../img/memory-tool-aggregate-view.png) + +It presents a breakdown of the heap\'s contents, as a table. There are +three main ways to group the data: + +- Type +- Call Stack +- Inverted Call Stack + +You can switch between them using the dropdown menu labeled \"Group +by:\" located at the top of the panel: + +There\'s also a box labeled \"Filter\" at the top-right of the pane. You +can use this to filter the contents of the snapshot that are displayed, +so you can quickly see, for example, how many objects of a specific +class were allocated. + +## Type + +This is the default view, which looks something like this: + +![](../img/memory-tool-aggregate-view.png) + +It groups the things on the heap into types, including: + +- **JavaScript objects:** such as `Function` or `Array` +- **DOM elements:** such as `HTMLSpanElement` or `Window` +- **Strings:** listed as `"strings"` +- **JavaScript sources:** listed as \"`JSScript"` +- **Internal objects:** such as \"`js::Shape`\". These are prefixed + with `"js::"`. + +Each type gets a row in the table, and rows are ordered by the amount of +memory occupied by objects of that type. For example, in the screenshot +above you can see that JavaScript `Object`s account for most memory, +followed by strings. + +- The \"Total Count\" column shows you the number of objects of each + category that are currently allocated. +- The \"Total Bytes\" column shows you the number of bytes occupied by + objects in each category, and that number as a percentage of the + whole heap size for that tab. + +The screenshots in this section are taken from a snapshot of the +[monster example page](monster_example.md). + +For example, in the screenshot above, you can see that: + +- there are four `Array` objects +- that account for 15% of the total heap. + +Next to the type\'s name, there\'s an icon that contains three stars +arranged in a triangle: + +![](../img/memory-tool-in-group-icon.png) + +Click this to see every instance of that type. For example, the entry +for `Array` tells us that there are four `Array` objects in the +snapshot. If we click the star-triangle, we\'ll see all four `Array` +instances: + +![](../img/memory-tool-in-group.png) + +For each instance, you can see the [retained size and shallow +size](dominators.md#shallow_and_retained_size) of +that instance. In this case, you can see that the first three arrays +have a fairly large shallow size (5% of the total heap usage) and a much +larger retained size (26% of the total). + +On the right-hand side is a pane that just says \"Select an item to view +its retaining paths\". If you select an item, you\'ll see the [Retaining +paths +panel](dominators_view.md#retaining_paths_panel) +for that item: + +![](../img/memory-tool-in-group-retaining-paths.png) + + + + +## Call Stack + +The Call Stack shows you exactly where in your code you are making heap +allocations. + +Because tracing allocations has a runtime cost, it must be explicitly +enabled by checking \"Record call stacks\" *before* you allocate the +memory in the snapshot. + +You\'ll then see a list of all the functions that allocated objects, +ordered by the size of the allocations they made: + +![](../img/memory-tool-call-stack.png) +\ +The first entry says that: + +- 4,832,592 bytes, comprising 93% of the total heap usage, were + allocated in a function at line 35 of \"alloc.js\", **or in + functions called by that function** + +We can use the disclosure triangle to drill down the call tree, to find +the exact place your code made those allocations. + +It\'s easier to explain this with reference to a simple example. For +this we\'ll use the [DOM allocation +example](DOM_allocation_example.md). This page +runs a script that creates a large number of DOM nodes (200 +`HTMLDivElement` objects and 4000 `HTMLSpanElement` objects). + +Let\'s get an allocation trace: + +1. open the Memory tool +2. check \"Record call stacks\" +3. load + +4. take a snapshot +5. select \"View/Aggregate\" +6. select \"Group by/Call Stack\" + + + +You should see something like this: + +![](../img/memory-tool-call-stack.png) + +This is telling us that 93% of the total heap snapshot was allocated in +functions called from \"alloc.js\", line 35 (our initial +`createToolbars()` call). + +We can use the disclosure arrow to expand the tree to find out exactly +where we\'re allocating memory: + +![](../img/memory-tool-call-stack-expanded.png) + +This is where the \"Bytes\" and \"Count\" columns are useful: they show +allocation size and number of allocations at that exact point. + +So in the example above, we can see that we made 4002 allocations, +accounting for 89% of the total heap, in `createToolbarButton()`, at +[alloc.js line 9, position +23](https://github.com/mdn/performance-scenarios/blob/gh-pages/dom-allocs/scripts/alloc.js#L9): +that is, the exact point where we create the span +elements. + +The file name and line number is a link: if we click it, we go directly +to that line in the debugger: + + + + +## Inverted Call Stack + +The Call Stack view is top-down: it shows allocations that happen at +that point **or points deeper in the call tree**. So it\'s good for +getting an overview of where your program is memory-hungry. However, +this view means you have to drill a long way down to find the exact +place where the allocations are happening. + +The \"Inverted Call Stack\" view helps with that. It gives you the +bottom-up view of the program showing the exact places where allocations +are happening, ranked by the size of allocation at each place. The +disclosure arrow then walks you back up the call tree towards the top +level. + +Let\'s see what the example looks like when we select \"Inverted Call +Stack\": + +![](../img/memory-tool-inverted-call-stack.png) + +Now at the top we can immediately see the `createToolbarButton()` call +accounting for 89% of the heap usage in our page. + +## no stack available + +In the example above you\'ll note that 7% of the heap is marked \"(no +stack available)\". This is because not all heap usage results from your +JavaScript. + +For example: + +- any scripts the page loads occupy heap space +- sometimes an object is allocated when there is no JavaScript on the + stack. For example, DOM Event objects are allocated + before the JavaScript is run and event handlers are called. + +Many real-world pages will have a much higher \"(no stack available)\" +share than 7%. diff --git a/docs/performance/memory/awsy.md b/docs/performance/memory/awsy.md new file mode 100644 index 0000000000..5026f055aa --- /dev/null +++ b/docs/performance/memory/awsy.md @@ -0,0 +1,22 @@ +# Are We Slim Yet (AWSY) + +The Are We Slim Yet project (commonly known as AWSY) for several years +tracked memory usage across builds on the (now defunct) website. +It used the same +infrastructure as +[about:memory](about_colon_memory.md) to measure +memory usage on a predefined snapshot of Alexa top 100 pages known as +tp5. + +Since Firefox transitioned to using multiple processes by default, we +[moved AWSY into the +TaskCluster](https://bugzilla.mozilla.org/show_bug.cgi?id=1272113) +infrastructure. This allowed us to run measurements on all branches and +platforms. The results are posted to +[perfherder](https://treeherder.mozilla.org/perf.html) where we can +track regressions automatically. + +As new processes are added to Firefox we want to make sure their memory +usage is also tracked by AWSY. To this end we request that memory +reporting be integrated into any new process before it is enabled on +Nightly. diff --git a/docs/performance/memory/basic_operations.md b/docs/performance/memory/basic_operations.md new file mode 100644 index 0000000000..276c38bc2e --- /dev/null +++ b/docs/performance/memory/basic_operations.md @@ -0,0 +1,82 @@ +# Basic operations + +## Opening the Memory tool + +Before Firefox 50, the Memory tool is not enabled by default. To enable +it, open the developer tool settings, and check the "Memory" box under +"Default Firefox Developer Tools": + + + +From Firefox 50 onwards, the Memory tool is enabled by default. + +## Taking a heap snapshot + +To take a snapshot of the heap, click the "Take snapshot" button, or +the camera icon on the left: + +![memoryimage1](../img/memory-1-small.png) + +The snapshot will occupy the large pane on the right-hand side. On the +left, you'll see an entry for the new snapshot, including its +timestamp, size, and controls to save or clear this snapshot: + +![memoryimage2](../img/memory-2-small.png) + +## Clearing a snapshot + +To remove a snapshot, click the "X" icon: + +![memoryimage3](../img/memory-3-small.png) + +## Saving and loading snapshots + +If you close the Memory tool, all unsaved snapshots will be discarded. +To save a snapshot click "Save": + +![memoryimage4](../img/memory-4-small.png) + +You'll be prompted for a name and location, and the file will be saved +with an `.fxsnapshot` extension. + +To load a snapshot from an existing `.fxsnapshot` file, click the import +button, which looks like a rectangle with an arrow rising from it +(before Firefox 49, this button was labeled with the text +"Import\...\"): + +![memoryimage5](../img/memory-5-small.png) + +You'll be prompted to find a snapshot file on disk. + +## Comparing snapshots + +Starting in Firefox 45, you can diff two heap snapshots. The diff shows +you where memory was allocated or freed between the two snapshots. + +To create a diff, click the button that looks like a Venn diagram next +to the camera icon (before Firefox 47, this looked like a \"+/-\" icon): + +![memoryimage6](../img/memory-6-small.png) + +You'll be prompted to select the snapshot to use as a baseline, then +the snapshot to compare. The tool then shows you the differences between +the two snapshots: + + + + +::: {.note} +When you're looking at a comparison, you can't use the Dominators view +or the Tree Map view. +::: + +## Recording call stacks + +The Memory tool can tell you exactly where in your code you are +allocating memory. However, recording this information has a run-time +cost, so you must ask the tool to record memory calls *before* the +memory is allocated, if you want to see memory call sites in the +snapshot. To do this, check "Record call stacks" (before Firefox 49 +this was labeled "Record allocation stacks"): + +![memoryimage7](../img/memory-7-small.png) diff --git a/docs/performance/memory/bloatview.md b/docs/performance/memory/bloatview.md new file mode 100644 index 0000000000..9e290011b1 --- /dev/null +++ b/docs/performance/memory/bloatview.md @@ -0,0 +1,245 @@ +# Bloatview + +BloatView is a tool that shows information about cumulative memory usage +and leaks. If it finds leaks, you can use [refcount tracing and balancing](refcount_tracing_and_balancing.md) +to discover the root cause. + +## How to build with BloatView + +Build with `--enable-debug` or `--enable-logrefcnt`. + +## How to run with BloatView + +The are two environment variables that can be used. + + XPCOM_MEM_BLOAT_LOG + +If set, this causes a *bloat log* to be printed on program exit, and +each time `nsTraceRefcnt::DumpStatistics` is called. This log contains +data on leaks and bloat (a.k.a. usage). + + XPCOM_MEM_LEAK_LOG + +This is similar to `XPCOM_MEM_BLOAT_LOG`, but restricts the log to only +show data on leaks. + +You can set these environment variables to any of the following values. + +- **1** - log to stdout. +- **2** - log to stderr. +- ***filename*** - write log to a file. + +## Reading individual bloat logs + +Full BloatView output contains per-class statistics on allocations and +refcounts, and provides gross numbers on the amount of memory being +leaked broken down by class. Here's a sample of the BloatView output. + + == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 1862 + |<----------------Class--------------->|<-----Bytes------>|<----Objects---->| + | | Per-Inst Leaked| Total Rem| + 0 |TOTAL | 17 2484|253953338 38| + 17 |AsyncTransactionTrackersHolder | 40 40| 10594 1| + 78 |CompositorChild | 472 472| 1 1| + 79 |CondVar | 24 48| 3086 2| + 279 |MessagePump | 8 8| 30 1| + 285 |Mutex | 20 60| 89987 3| + 302 |PCompositorChild | 412 412| 1 1| + 308 |PImageBridgeChild | 416 416| 1 1| + +The first line tells you the pid of the leaking process, along with the +type of process. + +Here's how you interpret the columns. + +- The first, numerical column [is the index](https://searchfox.org/mozilla-central/source/xpcom/base/nsTraceRefcnt.cpp#365) + of the leaking class. +- **Class** - The name of the class in question (truncated to 20 + characters). +- **Bytes Per-Inst** - The number of bytes returned if you were to + write `sizeof(Class)`. Note that this number does not reflect any + memory held onto by the class, such as internal buffers, etc. (E.g. + for `nsString` you'll see the size of the header struct, not the + size of the string contents!) +- **Bytes Leaked** - The number of bytes per instance times the number + of objects leaked: (Bytes Per-Inst) x (Objects Rem). Use this number + to look for the worst offenders. (Should be zero!) +- **Objects Total** - The total count of objects allocated of a given + class. +- **Objects Rem** - The number of objects allocated of a given class + that weren't deleted. (Should be zero!) + +Interesting things to look for: + +- **Are your classes in the list?** - Look! If they aren't, then + you're not using the `NS_IMPL_ADDREF` and `NS_IMPL_RELEASE` (or + `NS_IMPL_ISUPPORTS` which calls them) for xpcom objects, or + `MOZ_COUNT_CTOR` and `MOZ_COUNT_DTOR` for non-xpcom objects. Not + having your classes in the list is *not* ok. That means no one is + looking at them, and we won't be able to tell if someone introduces + a leak. (See + [below](#how-to-instrument-your-objects-for-bloatview) + for how to fix this.) +- **The Bytes Leaked for your classes should be zero!** - Need I say + more? If it isn't, you should use the other tools to fix it. +- **The number of objects remaining might not be equal to the total + number of objects.** This could indicate a hand-written Release + method (that doesn't use the `NS_LOG_RELEASE` macro from + nsTraceRefcnt.h), or perhaps you're just not freeing any of the + instances you've allocated. These sorts of leaks are easy to fix. +- **The total number of objects might be 1.** This might indicate a + global variable or service. Usually this will have a large number of + refcounts. + +If you find leaks, you can use [refcount tracing and balancing](refcount_tracing_and_balancing.md) +to discover the root cause. + +## Combining and sorting bloat logs + +You can view one or more bloat logs in your browser by running the +following program. + + perl tools/bloatview/bloattable.pl *log1* *log2* \... *logn* > + *htmlfile* + +This will produce an HTML file that contains a table similar to the +following (but with added JavaScript so you can sort the data by +column). + + Byte Bloats + + ---------- ---------------- -------------------------- + Name File Date + blank `blank.txt` Tue Aug 29 14:17:40 2000 + mozilla `mozilla.txt` Tue Aug 29 14:18:42 2000 + yahoo `yahoo.txt` Tue Aug 29 14:19:32 2000 + netscape `netscape.txt` Tue Aug 29 14:20:14 2000 + ---------- ---------------- -------------------------- + +The numbers do not include malloc'd data such as string contents. + +Click on a column heading to sort by that column. Click on a class name +to see details for that class. + + -------------------- --------------- ----------------- --------- --------- ---------- ---------- ------------------------------- --------- -------- ---------- --------- + Class Name Instance Size Bytes allocated Bytes allocated but not freed + blank mozilla yahoo netscape Total blank mozilla yahoo netscape Total + TOTAL 1754408 432556 179828 404184 2770976 + nsStr 20 6261600 3781900 1120920 1791340 12955760 222760 48760 13280 76160 360960 + nsHashKey 8 610568 1842400 2457872 1134592 6045432 32000 536 568 1216 34320 + nsTextTransformer 548 8220 469088 1414936 1532756 3425000 0 0 0 0 0 + nsStyleContextData 736 259808 325312 489440 338560 1413120 141312 220800 -11040 94944 446016 + nsLineLayout 1100 2200 225500 402600 562100 1192400 0 0 0 0 0 + nsLocalFile 424 558832 19928 1696 1272 581728 72080 1272 424 -424 73352 + -------------------- --------------- ----------------- --------- --------- ---------- ---------- ------------------------------- --------- -------- ---------- --------- + +The first set of columns, **Bytes allocated**, shows the amount of +memory allocated for the first log file (`blank.txt`), the difference +between the first log file and the second (`mozilla.txt`), the +difference between the second log file and the third (`yahoo.txt`), the +difference between the third log file and the fourth (`netscape.txt`), +and the total amount of memory allocated in the fourth log file. These +columns provide an idea of how hard the memory allocator has to work, +but they do not indicate the size of the working set. + +The second set of columns, **Bytes allocated but not freed**, shows the +net memory gain or loss by subtracting the amount of memory freed from +the amount allocated. + +The **Show Objects** and **Show References** buttons show the same +statistics but counting objects or `AddRef`'d references rather than +bytes. + +## Comparing Bloat Logs + +You can also compare any two bloat logs (either those produced when the +program shuts down, or written to the bloatlogs directory) by running +the following program. + + `perl tools/bloatview/bloatdiff.pl` + +This will give you output of the form: + + Bloat/Leak Delta Report + Current file: dist/win32_D.OBJ/bin/bloatlogs/all-1999-10-22-133450.txt + Previous file: dist/win32_D.OBJ/bin/bloatlogs/all-1999-10-16-010302.txt + -------------------------------------------------------------------------- + CLASS LEAKS delta BLOAT delta + -------------------------------------------------------------------------- + TOTAL 6113530 2.79% 67064808 9.18% + StyleContextImpl 265440 81.19% 283584 -26.99% + CToken 236500 17.32% 306676 20.64% + nsStr 217760 14.94% 5817060 7.63% + nsXULAttribute 113048 -70.92% 113568 -71.16% + LiteralImpl 53280 26.62% 75840 19.40% + nsXULElement 51648 0.00% 51648 0.00% + nsProfile 51224 0.00% 51224 0.00% + nsFrame 47568 -26.15% 48096 -50.49% + CSSDeclarationImpl 42984 0.67% 43488 0.67% + +This "delta report" shows the leak offenders, sorted from most leaks +to fewest. The delta numbers show the percentage change between runs for +the amount of leaks and amount of bloat (negative numbers are better!). +The bloat number is a metric determined by multiplying the total number +of objects allocated of a given class by the class size. Note that +although this isn't necessarily the amount of memory consumed at any +given time, it does give an indication of how much memory we're +consuming. The more memory in general, the worse the performance and +footprint. The percentage 99999.99% will show up indicating an +"infinite" amount of leakage. This happens when something that didn't +leak before is now leaking. + +## BloatView and continuous integration + +BloatView runs on debug builds for many of the test suites Mozilla has +running under continuous integration. If a new leak occurs, it will +trigger a test job failure. + +BloatView's output file can also show you where the leaked objects are +allocated. To do so, the `XPCOM_MEM_LOG_CLASSES` environment variable +should be set to the name of the class from the BloatView table: + + XPCOM_MEM_LOG_CLASSES=MyClass mach mochitest [options] + +Multiple class names can be specified by setting `XPCOM_MEM_LOG_CLASSES` +to a comma-separated list of names: + + XPCOM_MEM_LOG_CLASSES=MyClass,MyOtherClass,DeliberatelyLeakedClass mach mochitest [options] + +Test harness scripts typically accept a `--setenv` option for specifying +environment variables, which may be more convenient in some cases: + + mach mochitest --setenv=XPCOM_MEM_LOG_CLASSES=MyClass [options] + +For getting allocation stacks in automation, you can add the appropriate +`--setenv` options to the test configurations for the platforms you're +interested in. Those configurations are located in +`testing/mozharness/configs/`. The most likely configs you'll want to +modify are listed below: + +- Linux: `unittests/linux_unittest.py` +- Mac: `unittests/mac_unittest.py` +- Windows: `unittests/win_unittest.py` +- Android: `android/androidarm.py` + +## How to instrument your objects for BloatView + +First, if your object is an xpcom object and you use the +`NS_IMPL_ADDREF` and `NS_IMPL_RELEASE` (or a variation thereof) macro to +implement your `AddRef` and `Release` methods, then there is nothing you +need do. By default, those macros support refcnt logging directly. + +If your object is not an xpcom object then some manual editing is in +order. The following sample code shows what must be done: + + MyType::MyType() + { + MOZ_COUNT_CTOR(MyType); + ... + } + + MyType::~MyType() + { + MOZ_COUNT_DTOR(MyType); + ... + } diff --git a/docs/performance/memory/dmd.md b/docs/performance/memory/dmd.md new file mode 100644 index 0000000000..ebd6b5a2f8 --- /dev/null +++ b/docs/performance/memory/dmd.md @@ -0,0 +1,489 @@ +# Dark Matter Detector (DMD) + +DMD (short for "dark matter detector") is a heap profiler within +Firefox. It has four modes. + +- "Dark Matter" mode. In this mode, DMD tracks the contents of the + heap, including which heap blocks have been reported by memory + reporters. It helps us reduce the "heap-unclassified" value in + Firefox's about:memory page, and also detects if any heap blocks + are reported twice. Originally, this was the only mode that DMD had, + which explains DMD's name. This is the default mode. +- "Live" mode. In this mode, DMD tracks the current contents of the + heap. You can dump that information to file, giving a profile of the + live heap blocks at that point in time. This is good for + understanding how memory is used at an interesting point in time, + such as peak memory usage. +- "Cumulative" mode. In this mode, DMD tracks both the past and + current contents of the heap. You can dump that information to file, + giving a profile of the heap usage for the entire session. This is + good for finding parts of the code that cause high heap churn, e.g. + by allocating many short-lived allocations. +- "Heap scanning" mode. This mode is like live mode, but it also + records the contents of every live block in the log. This can be + used to investigate leaks by figuring out which objects might be + holding references to other objects. + +## Building and Running + +### Nightly Firefox + +The easiest way to use DMD is with the normal Nightly Firefox build, +which has DMD already enabled in the build. To have DMD active while +running it, you just need to set the environment variable `DMD=1` when +running. For instance, on OSX, you can run something like: + + DMD=1 /Applications/Firefox\ Nightly.app/Contents/MacOS/firefox + +You can tell it is working by going to `about:memory` and looking for +"Save DMD Output". If DMD has been properly enabled, the "Save" +button won't be grayed out. Look at the "Trigger" section below to +see the full list of ways to get a DMD report once you have it +activated. Note that the stack information you get will likely be less +detailed, due to being unable to symbolicate. You will be able to get +function names, but not line numbers. + +### Desktop Firefox + +#### Build + +Build Firefox with this option added to your mozconfig: + + ac_add_options --enable-dmd + +If building via try server, modify +`browser/config/mozconfigs/linux64/common-opt` or a similar file before +pushing. + +#### Launch + +Use `mach run --dmd`; use `--mode` to choose the mode. + +On a Windows build done by the try server, [these +instructions](https://bugzilla.mozilla.org/show_bug.cgi?id=936784#c69) from +2013 may or may not be useful. + +#### Trigger + +There are a few ways to trigger a DMD snapshot. Most of these will also +first get a memory report. When DMD is working on writing its output, it +will print logging like this: + + DMD[5222] opened /tmp/dmd-1414556492-5222.json.gz for writing + DMD[5222] Dump 1 { + DMD[5222] Constructing the heap block list... + DMD[5222] Constructing the stack trace table... + DMD[5222] Constructing the stack frame table... + DMD[5222] } + +You'll see separate output for each process. This step can take 10 or +more seconds and may make Firefox freeze temporarily. + +If you see the "opened" line, it tells you where the file was saved. +It's always in a temp directory, and the filenames are always of the +form dmd-. + +The ways to trigger a DMD snapshot are: + +1. Visit about:memory and click the "Save" button under "Save DMD output". + The button won't be present in non-DMD builds, and will be grayed out + in DMD builds if DMD isn't enabled at start-up. + +2. If you wish to trigger DMD dumps from within C++ or JavaScript code, + you can use `nsIMemoryInfoDumper.dumpMemoryInfoToTempDir`. For example, + from JavaScript code you can do the following. + + const Cc = Components.classes; + let mydumper = Cc["@mozilla.org/memory-info-dumper;1"] + .getService(Ci.nsIMemoryInfoDumper); + mydumper.dumpMemoryInfoToTempDir(identifier, anonymize, minimize); + + This will dump memory reports and DMD output to the temporary + directory. `identifier` is a string that will be used for part of + the filename (or a timestamp will be used if it is an empty string); + `anonymize` is a boolean that indicates if the memory reports should + be anonymized; and `minimize` is a boolean that indicates if memory + usage should be minimized first. + +3. On Linux, you can send signal 34 to the firefox process, e.g. + with the following command. + + $ killall -34 firefox + +4. The `MOZ_DMD_SHUTDOWN_LOG` environment variable, if set, triggers a DMD + run at shutdown; its value must be a directory where the logs will be + placed. This is mostly useful for debugging leaks. Which processes get + logged is controlled by the `MOZ_DMD_LOG_PROCESS` environment variable. + If this is not set, it will log all processes. It can be set to any valid + value of `XRE_GetProcessTypeString()` and will log only those processes. + For instance, if set to `default` it will only log the parent process. If + set to `tab`, it will log content processes only. + + For example, if you have + + MOZ_DMD_SHUTDOWN_LOG=~/dmdlogs/ MOZ_DMD_LOG_PROCESS=tab + + then DMD will create logs at shutdown for content processes and save them to + `~/dmdlogs/`. + +**NOTE:** + +- To dump DMD data from content processes, you'll need to disable the + sandbox with `MOZ_DISABLE_CONTENT_SANDBOX=1`. +- MOZ_DMD_SHUTDOWN_LOG must (currently) include the trailing separator + (\'\'/\") + + +### Fennec + +**NOTE:** + +You'll note from the name of this section being "Fennec" that these instructions +are very old. Hopefully they'll be more useful than not having them. + +**NOTE:** + +In order to use DMD on Fennec you will need root access on the Android +device. Instructions on how to root your device is outside the scope of +this document. + + +#### Build + +Build with these options: + + ac_add_options --enable-dmd + +#### Prep + +In order to prepare your device for running Fennec with DMD enabled, you +will need to do a few things. First, you will need to push the libdmd.so +library to the device so that it can by dynamically loaded by Fennec. +You can do this by running: + + adb push $OBJDIR/dist/bin/libdmd.so /sdcard/ + +Second, you will need to make an executable wrapper for Fennec which +sets an environment variable before launching it. (If you are familiar +with the recommended "--es env0" method for setting environment +variables when launching Fennec, note that you cannot use this method +here because those are processed too late in the startup process. If you +are not familiar with that method, you can ignore this parenthetical +note.) First make the executable wrapper on your host machine using the +editor of your choice. Name the file dmd_fennec and enter this as the +contents: + + #!/system/bin/sh + export MOZ_REPLACE_MALLOC_LIB=/sdcard/libdmd.so + exec "$@" + +If you want to use other DMD options, you can enter additional +environment variables above. You will need to push this to the device +and make it executable. Since you cannot mark files in /sdcard/ as +executable, we will use /data/local/tmp for this purpose: + + adb push dmd_fennec /data/local/tmp + adb shell + cd /data/local/tmp + chmod 755 dmd_fennec + +The final step is to make Android use the above wrapper script while +launching Fennec, so that the environment variable is present when +Fennec starts up. Assuming you have done a local build, the app +identifier will be `org.mozilla.fennec_$USERNAME` (`$USERNAME` is your +username on the host machine) and so we do this as shown below. If you +are using a DMD-enabled try build, or build from other source, adjust +the app identifier as necessary. + + adb shell + su # You need root access for the setprop command to take effect + setprop wrap.org.mozilla.fennec_$USERNAME "/data/local/tmp/dmd_fennec" + +Once this is set up, starting the `org.mozilla.fennec_$USERNAME` app +will use the wrapper script. + +#### Launch + +Launch Fennec either by tapping on the icon as usual, or from the +command line (as before, be sure to replace +`org.mozilla.fennec_$USERNAME` with the app identifier as appropriate). + + adb shell am start -n org.mozilla.fennec_$USERNAME/.App + +#### Trigger + +Use the existing memory-report dumping hook: + + adb shell am broadcast -a org.mozilla.gecko.MEMORY_DUMP + +In logcat, you should see output similar to this: + + I/DMD (20731): opened /storage/emulated/0/Download/memory-reports/dmd-default-20731.json.gz for writing + ... + I/GeckoConsole(20731): nsIMemoryInfoDumper dumped reports to /storage/emulated/0/Download/memory-reports/unified-memory-report-default-20731.json.gz + +The path is where the memory reports and DMD reports get dumped to. You +can pull them like so: + + adb pull /sdcard/Download/memory-reports/dmd-default-20731.json.gz + adb pull /sdcard/Download/memory-reports/unified-memory-report-default-20731.json.gz + +## Processing the output + +DMD outputs one gzipped JSON file per process that contains a +description of that process's heap. You can analyze these files (either +gzipped or not) using `dmd.py`. On Nightly Firefox, `dmd.py` is included +in the distribution. For instance on OS X, it is located in the +directory `/Applications/Firefox Nightly.app/Contents/Resources/`. For +Nightly, symbolication will fail, but you can at least get some +information. In a local build, `dmd.py` will be located in the directory +`$OBJDIR/dist/bin/`. + +Some platforms (Linux, Mac, Android) require stack fixing, which adds +missing filenames, function names and line number information. This will +occur automatically the first time you run `dmd.py` on the output file. +This can take 10s of seconds or more to complete. (This will fail if +your build does not contain symbols. However, if you have crash reporter +symbols for your build -- as tryserver builds do -- you can use [this +script](https://github.com/mstange/analyze-tryserver-profiles/blob/master/resymbolicate_dmd.py) +instead: clone the whole repo, edit the paths at the top of +`resymbolicate_dmd.py` and run it.) The simplest way to do this is to +just run the `dmd.py` script on your DMD report while your working +directory is `$OBJDIR/dist/bin`. This will allow the local libraries to +be found and used. + +If you invoke `dmd.py` without arguments you will get output appropriate +for the mode in which DMD was invoked. + +### "Dark matter" mode output + +For "dark matter" mode, `dmd.py`'s output describes how the live heap +blocks are covered by memory reports. This output is broken into +multiple sections. + +1. "Invocation". This tells you how DMD was invoked, i.e. what + options were used. +2. "Twice-reported stack trace records". This tells you which heap + blocks were reported twice or more. The presence of any such records + indicates bugs in one or more memory reporters. +3. "Unreported stack trace records". This tells you which heap blocks + were not reported, which indicate where additional memory reporters + would be most helpful. +4. "Once-reported stack trace records": like the "Unreported stack + trace records" section, but for blocks reported once. +5. "Summary": gives measurements of the total heap, and the + unreported/once-reported/twice-reported portions of it. + +The "Twice-reported stack trace records" and "Unreported stack trace +records" sections are the most important, because they indicate ways in +which the memory reporters can be improved. + +Here's an example stack trace record from the "Unreported stack trace +records" section. + + Unreported { + 150 blocks in heap block record 283 of 5,495 + 21,600 bytes (20,400 requested / 1,200 slop) + Individual block sizes: 144 x 150 + 0.00% of the heap (16.85% cumulative) + 0.02% of unreported (94.68% cumulative) + Allocated at { + #01: replace_malloc (/home/njn/moz/mi5/go64dmd/memory/replace/dmd/../../../../memory/replace/dmd/DMD.cpp:1286) + #02: malloc (/home/njn/moz/mi5/go64dmd/memory/build/../../../memory/build/replace_malloc.c:153) + #03: moz_xmalloc (/home/njn/moz/mi5/memory/mozalloc/mozalloc.cpp:84) + #04: nsCycleCollectingAutoRefCnt::incr(void*, nsCycleCollectionParticipant*) (/home/njn/moz/mi5/go64dmd/dom/xul/../../dist/include/nsISupportsImpl.h:250) + #05: nsXULElement::Create(nsXULPrototypeElement*, nsIDocument*, bool, bool,mozilla::dom::Element**) (/home/njn/moz/mi5/dom/xul/nsXULElement.cpp:287) + #06: nsXBLContentSink::CreateElement(char16_t const**, unsigned int, mozilla::dom::NodeInfo*, unsigned int, nsIContent**, bool*, mozilla::dom::FromParser) (/home/njn/moz/mi5/dom/xbl/nsXBLContentSink.cpp:874) + #07: nsCOMPtr::StartAssignment() (/home/njn/moz/mi5/go64dmd/dom/xml/../../dist/include/nsCOMPtr.h:753) + #08: nsXMLContentSink::HandleStartElement(char16_t const*, char16_t const**, unsigned int, unsigned int, bool) (/home/njn/moz/mi5/dom/xml/nsXMLContentSink.cpp:1007) + } + } + +It tells you that there were 150 heap blocks that were allocated from +the program point indicated by the "Allocated at" stack trace, that +these blocks took up 21,600 bytes, that all 150 blocks had a size of 144 +bytes, and that 1,200 of those bytes were "slop" (wasted space caused +by the heap allocator rounding up request sizes). It also indicates what +percentage of the total heap size and the unreported portion of the heap +these blocks represent. + +Within each section, records are listed from largest to smallest. + +Once-reported and twice-reported stack trace records also have stack +traces for the report point(s). For example: + + Reported at { + #01: mozilla::dmd::Report(void const*) (/home/njn/moz/mi2/memory/replace/dmd/DMD.cpp:1740) 0x7f68652581ca + #02: CycleCollectorMallocSizeOf(void const*) (/home/njn/moz/mi2/xpcom/base/nsCycleCollector.cpp:3008) 0x7f6860fdfe02 + #03: nsPurpleBuffer::SizeOfExcludingThis(unsigned long (*)(void const*)) const (/home/njn/moz/mi2/xpcom/base/nsCycleCollector.cpp:933) 0x7f6860fdb7af + #04: nsCycleCollector::SizeOfIncludingThis(unsigned long (*)(void const*), unsigned long*, unsigned long*, unsigned long*, unsigned long*, unsigned long*) const (/home/njn/moz/mi2/xpcom/base/nsCycleCollector.cpp:3029) 0x7f6860fdb6b1 + #05: CycleCollectorMultiReporter::CollectReports(nsIMemoryMultiReporterCallback*, nsISupports*) (/home/njn/moz/mi2/xpcom/base/nsCycleCollector.cpp:3075) 0x7f6860fde432 + #06: nsMemoryInfoDumper::DumpMemoryReportsToFileImpl(nsAString_internal const&) (/home/njn/moz/mi2/xpcom/base/nsMemoryInfoDumper.cpp:626) 0x7f6860fece79 + #07: nsMemoryInfoDumper::DumpMemoryReportsToFile(nsAString_internal const&, bool, bool) (/home/njn/moz/mi2/xpcom/base/nsMemoryInfoDumper.cpp:344) 0x7f6860febaf9 + #08: mozilla::(anonymous namespace)::DumpMemoryReportsRunnable::Run() (/home/njn/moz/mi2/xpcom/base/nsMemoryInfoDumper.cpp:58) 0x7f6860fefe03 + } + +You can tell which memory reporter made the report by the name of the +`MallocSizeOf` function near the top of the stack trace. In this case it +was the cycle collector's reporter. + +By default, DMD does not record an allocation stack trace for most +blocks, to make it run faster. The decision on whether to record is done +probabilistically, and larger blocks are more likely to have an +allocation stack trace recorded. All unreported blocks that lack an +allocation stack trace will end up in a single record. For example: + + Unreported { + 420,010 blocks in heap block record 2 of 5,495 + 29,203,408 bytes (27,777,288 requested / 1,426,120 slop) + Individual block sizes: 2,048 x 3; 1,024 x 103; 512 x 147; 496 x 7; 480 x 31; 464 x 6; 448 x 50; 432 x 41; 416 x 28; 400 x 53; 384 x 43; 368 x 216; 352 x 141; 336 x 58; 320 x 104; 304 x 5,130; 288 x 150; 272 x 591; 256 x 6,017; 240 x 1,372; 224 x 93; 208 x 488; 192 x 1,919; 176 x 18,903; 160 x 1,754; 144 x 5,041; 128 x 36,709; 112 x 5,571; 96 x 6,280; 80 x 40,738; 64 x 37,925; 48 x 78,392; 32 x 136,199; 16 x 31,001; 8 x 4,706 + 3.78% of the heap (10.24% cumulative) + 21.24% of unreported (57.53% cumulative) + Allocated at { + #01: (no stack trace recorded due to --stacks=partial) + } + } + +In contrast, stack traces are always recorded when a block is reported, +which means you can end up with records like this where the allocation +point is unknown but the reporting point *is* known: + + Once-reported { + 104,491 blocks in heap block record 13 of 4,689 + 10,392,000 bytes (10,392,000 requested / 0 slop) + Individual block sizes: 512 x 124; 256 x 242; 192 x 813; 128 x 54,664; 64 x 48,648 + 1.35% of the heap (48.65% cumulative) + 1.64% of once-reported (59.18% cumulative) + Allocated at { + #01: (no stack trace recorded due to --stacks=partial) + } + Reported at { + #01: mozilla::dmd::DMDFuncs::Report(void const*) (/home/njn/moz/mi5/go64dmd/memory/replace/dmd/../../../../memory/replace/dmd/DMD.cpp:1646) + #02: WindowsMallocSizeOf(void const*) (/home/njn/moz/mi5/dom/base/nsWindowMemoryReporter.cpp:189) + #03: nsAttrAndChildArray::SizeOfExcludingThis(unsigned long (*)(void const*)) const (/home/njn/moz/mi5/dom/base/nsAttrAndChildArray.cpp:880) + #04: mozilla::dom::FragmentOrElement::SizeOfExcludingThis(unsigned long (*)(void const*)) const (/home/njn/moz/mi5/dom/base/FragmentOrElement.cpp:2337) + #05: nsINode::SizeOfIncludingThis(unsigned long (*)(void const*)) const (/home/njn/moz/mi5/go64dmd/parser/html/../../../dom/base/nsINode.h:307) + #06: mozilla::dom::NodeInfo::NodeType() const (/home/njn/moz/mi5/go64dmd/dom/base/../../dist/include/mozilla/dom/NodeInfo.h:127) + #07: nsHTMLDocument::DocAddSizeOfExcludingThis(nsWindowSizes*) const (/home/njn/moz/mi5/dom/html/nsHTMLDocument.cpp:3710) + #08: nsIDocument::DocAddSizeOfIncludingThis(nsWindowSizes*) const (/home/njn/moz/mi5/dom/base/nsDocument.cpp:12820) + } + } + +The choice of whether to record an allocation stack trace for all blocks +is controlled by an option (see below). + +### "Live" mode output + + +For "live" mode, dmd.py's output describes what live heap blocks are +present. This output is broken into multiple sections. + +1. "Invocation". This tells you how DMD was invoked, i.e. what + options were used. +2. "Live stack trace records". This tells you which heap blocks were + present. +3. "Summary": gives measurements of the total heap. + +The individual records are similar to those output in "dark matter" +mode. + +### "Cumulative" mode output + +For "cumulative" mode, dmd.py's output describes how the live heap +blocks are covered by memory reports. This output is broken into +multiple sections. + +1. "Invocation". This tells you how DMD was invoked, i.e. what + options were used. +2. "Cumulative stack trace records". This tells you which heap blocks + were allocated during the session. +3. "Summary": gives measurements of the total (cumulative) heap. + +The individual records are similar to those output in "dark matter" +mode. + +### "Scan" mode output + +For "scan" mode, the output of `dmd.py` is the same as "live" mode. +A separate script, `block_analyzer.py`, can be used to find out +information about which blocks refer to a particular block. +`dmd.py --clamp-contents` needs to be run on the log first. See [this +other page](heap_scan_mode.md) for an +overview of how to use heap scan mode to fix a leak involving refcounted +objects. + +## Options + +### Runtime + +When you run `mach run --dmd` you can specify additional options to +control how DMD runs. Run `mach help run` for documentation on these. + +The most interesting one is `--mode`. Acceptable values are +`dark-matter` (the default), `live`, `cumulative`, and `scan`. + +Another interesting one is `--stacks`. Acceptable values are `partial` +(the default) and `full`. In the former case most blocks will not have +an allocation stack trace recorded. However, because larger blocks are +more likely to have one recorded, most allocated bytes should have an +allocation stack trace even though most allocated blocks do not. Use +`--stacks=full` if you want complete information, but note that DMD will +run substantially slower in that case. + +The options may also be put in the environment variable DMD, or set DMD +to 1 to enable DMD with default options (dark-matter and partial +stacks). + +### Post-processing + +`dmd.py` also takes options that control how it works. Run `dmd.py -h` +for documentation. The following options are the most interesting ones. + +- `-f` / `--max-frames`. By default, records show up to 8 stack + frames. You can choose a smaller number, in which case more + allocations will be aggregated into each record, but you'll have + less context. Or you can choose a larger number, in which cases + allocations will be split across more records, but you will have + more context. There is no single best value, but values in the range + 2..10 are often good. The maximum is 24. + +- `-a` / `--ignore-alloc-fns`. Many allocation stack traces start + with multiple frames that mention allocation wrapper functions, e.g. + `js_calloc()` calls `replace_calloc()`. This option filters these + out. It often helps improve the quality of the output when using a + small `--max-frames` value. + +- `-s` / `--sort-by`. This controls how records are sorted. Acceptable + values are `usable` (the default), `req`, `slop` and `num-blocks`. + +- `--clamp-contents`. For a heap scan log, this performs a + conservative pointer analysis on the contents of each block, + changing any value that is a pointer into the middle of a live block + into a pointer to the start of that block. All other values are + changes to null. In addition, all trailing nulls are removed from + the block contents. + +As an example that combines multiple options, if you apply the following +command to a profile obtained in "live" mode: + + dmd.py -r -f 2 -a -s slop + +it will give you a good idea of where the major sources of slop are. + +`dmd.py` can also compute the difference between two DMD output files, +so long as those files were produced in the same mode. Simply pass it +two filenames instead of one to get the difference. + +## Which heap blocks are reported? + +At this stage you might wonder how DMD knows, in "dark matter" mode, +which allocations have been reported and which haven't. DMD only knows +about heap blocks that are measured via a function created with one of +the following two macros: + + MOZ_DEFINE_MALLOC_SIZE_OF + MOZ_DEFINE_MALLOC_SIZE_OF_ON_ALLOC + +Fortunately, most of the existing memory reporters do this. See +[Performance/Memory_Reporting](https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Memory_reporting "Platform/Memory Reporting") +for more details about how memory reporters are written. diff --git a/docs/performance/memory/dominators.md b/docs/performance/memory/dominators.md new file mode 100644 index 0000000000..e64c465e62 --- /dev/null +++ b/docs/performance/memory/dominators.md @@ -0,0 +1,90 @@ +# Dominators + +This article provides an introduction to the concepts of *Reachability*, +*Shallow* versus *Retained* size, and *Dominators*, as they apply in +garbage-collected languages like JavaScript. + +These concepts matter in memory analysis, because often an object may +itself be small, but may hold references to other much larger objects, +and by doing this will prevent the garbage collector from freeing that +extra memory. + +You can see the dominators in a page using the [Dominators +view](dominators_view.md) in the Memory tool. + +With a garbage-collected language, like JavaScript, the programmer +doesn\'t generally have to worry about deallocating memory. They can +just create and use objects, and when the objects are no longer needed, +the runtime takes care of cleaning up, and frees the memory the objects +occupied. + +## Reachability + +In modern JavaScript implementations, the runtime decides whether an +object is no longer needed based on *reachability*. In this system the +heap is represented as one or more graphs. Each node in the graph +represents an object, and each connection between nodes (edge) +represents a reference from one object to another. The graph starts at a +root node, indicated in these diagrams with \"R\". + +![](../img/memory-graph.svg) + +During garbage collection, the runtime traverses the graph, starting at +the root, and marks every object it finds. Any objects it doesn\'t find +are unreachable, and can be deallocated. + +So when an object becomes unreachable (for example, because it is only +referenced by a single local variable which goes out of scope) then any +objects it references also become unreachable, as long as no other +objects reference them: + +![](../img/memory-graph-unreachable.svg) + +Conversely, this means that objects are kept alive as long as some other +reachable object is holding a reference to them. + +## Shallow and retained size + +This gives rise to a distinction between two ways to look at the size of +an object: + +- *shallow size*: the size of the object itself +- *retained size*: the size of the object itself, plus the size of + other objects that are kept alive by this object + +Often, objects will have a small shallow size but a much larger retained +size, through the references they contain to other objects. Retained +size is an important concept in analyzing memory usage, because it +answers the question \"if this object ceases to exist, what\'s the total +amount of memory freed?\". + +## Dominators + +A related concept is that of the *dominator*. Node B is said to dominate +node A if every path from the root to A passes through B: + +![](../img/memory-graph-dominators.svg) + +If any of node A\'s dominators are freed, then node A itself becomes +eligible for garbage collection. + +[If node B dominates node A, but does not dominate any of A\'s other +dominators, then B is the *immediate dominator* of +A:] + +![](../img/memory-graph-immediate-dominator.svg) + +[One slight subtlety here is that if an object A is referenced by two +other objects B and C, then neither object is its +dominator], because you could remove either B or C from +the graph, and A would still be retained by its other referrer. Instead, +the immediate dominator of A would be its first common ancestor:\ +![](../img/memory-graph-dominator-multiple-references.svg) + +## See also + +[Dominators in graph +theory](https://en.wikipedia.org/wiki/Dominator_%28graph_theory%29). + +[Tracing garbage +collection](https://en.wikipedia.org/wiki/Tracing_garbage_collection). diff --git a/docs/performance/memory/dominators_view.md b/docs/performance/memory/dominators_view.md new file mode 100644 index 0000000000..05de01fa4e --- /dev/null +++ b/docs/performance/memory/dominators_view.md @@ -0,0 +1,221 @@ +# Dominators view + +The Dominators view is new in Firefox 46. + +Starting in Firefox 46, the Memory tool includes a new view called the +Dominators view. This is useful for understanding the \"retained size\" +of objects allocated by your site: that is, the size of the objects +themselves plus the size of the objects that they keep alive through +references. + +If you already know what shallow size, retained size, and dominators +are, skip to the Dominators UI section. Otherwise, you might want to +review the article on [Dominators +concepts](dominators.md). + +## Dominators UI + +To see the Dominators view for a snapshot, select \"Dominators\" in the +\"View\" drop-down list. It looks something like this: + +![](../img/dominators-1.png) + +The Dominators view consists of two panels: + +- the [Dominators Tree + panel](#dominators-tree-panel) + shows you which nodes in the snapshot are retaining the most memory +- the [Retaining Paths + panel](#retaining-paths-panel) + (new in Firefox 47) shows the 5 shortest retaining paths for a + single node. + +![](../img/dominators-2.png) + +### Dominators Tree panel + +The Dominators Tree tells you which objects in the snapshot are +retaining the most memory. + +In the main part of the UI, the first row is labeled \"GC Roots\". +Immediately underneath that is an entry for: + +- Every GC root node. In Gecko, there is more than one memory graph, + and therefore more than one root. There may be many (often + temporary) roots. For example: variables allocated on the stack need + to be rooted, or internal caches may need to root their elements. +- Any other node that\'s referenced from two different roots (since in + this case, neither root dominates it). + +Each entry displays: + +- the retained size of the node, as bytes and as a percentage of the + total +- the shallow size of the node, as bytes and as a percentage of the + total +- the nodes\'s name and address in memory. + +Entries are ordered by the amount of memory that they retain. For +example: + +![](../img/dominators-3.png) + +In this screenshot we can see five entries under \"GC Roots\". The first +two are Call and Window objects, and retain about 21% and 8% of the +total size of the memory snapshot, respectively. You can also see that +these objects have a relatively tiny \"Shallow Size\", so almost all of +the retained size is in the objects that they dominate. + +Immediately under each GC root, you\'ll see all the nodes for which this +root is the [immediate +dominator](/dominators.html#immediate_dominator). +These nodes are also ordered by their retained size. + +For example, if we click on the first Window object: + +![](../img/dominators-4.png) + +We can see that this Window dominates a CSS2Properties object, whose +retained size is 2% of the total snapshot size. Again the shallow size +is very small: almost all of its retained size is in the nodes that it +dominates. By clicking on the disclosure arrow next to the Function, we +can see those nodes. + +In this way you can quickly get a sense of which objects retain the most +memory in the snapshot. + +You can use [Alt]{.kbd} + click to expand the whole graph under a node. + +#### Call Stack {#Call_Stack} + +In the toolbar at the top of the tool is a dropdown called \"Label by\": + +![](../img/dominators-5.png) + +By default, this is set to \"Type\". However, you can set it instead to +\"Call Stack\" to see exactly where in your code the objects are being +allocated. + +::: {.note} +This option is called \"Allocation Stack\" in Firefox 46. +::: + +To enable this, you must check the box labeled \"Record call stacks\" +*before* you run the code that allocates the objects. Then take a +snapshot, then select \"Call Stack\" in the \"Label by\" drop-down. + +Now the node\'s name will contain the name of the function that +allocated it, and the file, line number and character position of the +exact spot where the function allocated it. Clicking the file name will +take you to that spot in the Debugger. + + + +::: +Sometimes you\'ll see \"(no stack available)\" here. In particular, +allocation stacks are currently only recorded for objects, not for +arrays, strings, or internal structures. +::: + +### Retaining Paths panel + +::: {.geckoVersionNote} +The Retaining Paths panel is new in Firefox 47. +::: + +The Retaining Paths panel shows you, for a given node, the 5 shortest +paths back from this node to a GC root. This enables you to see all the +nodes that are keeping the given node from being garbage-collected. If +you suspect that an object is being leaked, this will show you exactly +which objects are holding a reference to it. + +To see the retaining paths for a node, you have to select the node in +the Dominators Tree panel: + +![](../img/dominators-6.png) + +Here, we\'ve selected an object, and can see a single path back to a GC +root. + +The `Window` GC root holds a reference to an `HTMLDivElement` object, +and that holds a reference to an `Object`, and so on. If you look in the +Dominators Tree panel, you can trace the same path there. If either of +these references were removed, the items below them could be +garbage-collected. + +Each connection in the graph is labeled with the variable name for the +referenced object. + +Sometimes there\'s more than one retaining path back from a node: + +![](../img/dominators-7.png) + +Here there are three paths back from the `DocumentPrototype` node to a +GC root. If one were removed, then the `DocumentPrototype` would still +not be garbage-collected, because it\'s still retained by the other two +path. + +## Example {#Example} + +Let\'s see how some simple code is reflected in the Dominators view. + +We\'ll use the [monster allocation +example](monster_example.md), which creates three +arrays, each containing 5000 monsters, each monster having a +randomly-generated name. + +### Taking a snapshot + +To see what it looks like in the Dominators view: + +- load the page +- enable the Memory tool in the + [Settings](https://developer.mozilla.org/en-US/docs/Tools/Tools_Toolbox#settings), if you + haven\'t already +- open the Memory tool +- check \"Record call stacks\" +- press the button labeled \"Make monsters!\" +- take a snapshot +- switch to the \"Dominators\" view + + + +### Analyzing the Dominators Tree + +You\'ll see the three arrays as the top three GC roots, each retaining +about 23% of the total memory usage: + +![](../img/dominators-8.png) + +If you expand an array, you\'ll see the objects (monsters) it contains. +Each monster has a relatively small shallow size of 160 bytes. This +includes the integer eye- and tentacle-counts. Each monster has a bigger +retained size, which is accounted for by the string used for the +monster\'s name: + +![](../img/dominators-9.png) + +All this maps closely to the [memory graph we were expecting to +see](/monster_example.html#allocation-graph). One +thing you might be wondering, though, is: where\'s the top-level object +that retains all three arrays? If we look at the Retaining Paths panel +for one of the arrays, we\'ll see it: + +![](../img/dominators-10.png) + +Here we can see the retaining object, and even that this particular +array is the array of `fierce` monsters. But the array is also rooted +directly, so if the object were to stop referencing the array, it would +still not be eligible for garbage collection. + +This means that the object does not dominate the array, and is therefore +not shown in the Dominators Tree view. [See the relevant section of the +Dominators concepts +article](dominators.html#multiple-paths). + +### Using the Call Stack view {#Using_the_Call_Stack_view} + +Finally, you can switch to the Call Stack view, see where the objects +are being allocated, and jump to that point in the Debugger: + + diff --git a/docs/performance/memory/gc_and_cc_logs.md b/docs/performance/memory/gc_and_cc_logs.md new file mode 100644 index 0000000000..62e151dff4 --- /dev/null +++ b/docs/performance/memory/gc_and_cc_logs.md @@ -0,0 +1,109 @@ +# GC and CC logs + +Garbage collector (GC) and cycle collector (CC) logs give information +about why various JS and C++ objects are alive in the heap. Garbage +collector logs and cycle collector logs can be analyzed in various ways. +In particular, CC logs can be used to understand why the cycle collector +is keeping an object alive. These logs can either be manually or +automatically generated, and they can be generated in both debug and +non-debug builds. + +This logs the contents of the Javascript heap to a file named +`gc-edges-NNNN.log`. It also creates a file named `cc-edges-NNNN.log` to +which it dumps the parts of the heap visible to the cycle collector, +which includes native C++ objects that participate in cycle collection, +as well as JS objects being held alive by those C++ objects. + +## Generating logs + +### From within Firefox + +To manually generate GC and CC logs, navigate to `about:memory` and use +the buttons under \"Save GC & CC logs.\" \"Save concise\" will generate +a smaller CC log, \"Save verbose\" will provide a more detailed CC log. +(The GC log will be the same size in either case.) + +With multiprocess Firefox, you can't record logs from the content +process, due to sandboxing. You'll need to disable sandboxing by +setting `MOZ_DISABLE_CONTENT_SANDBOX=t` when you run Firefox. + +### From the commandline + +TLDR: if you just want shutdown GC/CC logs to debug leaks that happen in +our automated tests, you probably want something along the lines of: + + MOZ_DISABLE_CONTENT_SANDBOX=t MOZ_CC_LOG_DIRECTORY=/full/path/to/log/directory/ MOZ_CC_LOG_SHUTDOWN=1 MOZ_CC_ALL_TRACES=shutdown ./mach ... + +As noted in the previous section, with multiprocess Firefox, you can't +record logs from the content process, due to sandboxing. You'll need to +disable sandboxing by setting `MOZ_DISABLE_CONTENT_SANDBOX=t` when you +run Firefox. + +On desktop Firefox you can override the default location of the log +files by setting the `MOZ_CC_LOG_DIRECTORY` environment variable. By +default, they go to a temporary directory which differs per OS - it's +`/tmp/` on Linux/BSD, `$LOCALAPPDATA\Temp\` on Windows, and somewhere in +`/var/folders/` on Mac (whatever the directory service returns for +`TmpD`/`NS_OS_TEMP_DIR`). Note that just `MOZ_CC_LOG_DIRECTORY=.` won't +work - you need to specify a full path. On Firefox for Android you can +use the cc-dump.xpi +extension to save the files to `/sdcard`. By default, the file is +created in some temp directory, and the path to the file is printed to +the Error Console. + +To log every cycle collection, set the `MOZ_CC_LOG_ALL` environment +variable. To log only shutdown collections, set `MOZ_CC_LOG_SHUTDOWN`. +To make all CCs verbose, set `MOZ_CC_ALL_TRACES to "all`\", or to +\"`shutdown`\" to make only shutdown CCs verbose. + +Live GC logging can be enabled with the pref +`javascript.options.mem.log`. Output to a file can be controlled with +the MOZ_GCTIMER environment variable. See the [Statistics +API](https://developer.mozilla.org/en-US/docs/Tools/Tools_Toolbox#settings/en-US/docs/SpiderMonkey/Internals/GC/Statistics_API) page for +details on values. + +Set the environment variable `MOZ_CC_LOG_THREAD` to `main` to only log +main thread CCs, or to `worker` to only log worker CCs. The default +value is `all`, which will log all CCs. + +To get cycle collector logs on Try server, set `MOZ_CC_LOG_DIRECTORY` to +`MOZ_UPLOAD_DIR`, then set the other variables appropriately to generate +CC logs. The way to set environment variables depends on the test +harness, or you can modify the code in nsCycleCollector to set that +directly. To find the CC logs once the try run has finished, click on +the particular job, then click on \"Job Details\" in the bottom pane in +TreeHerder, and you should see download links. + +To set the environment variable, find the `buildBrowserEnv` method in +the Python file for the test suite you are interested in, and add +something like this code to the file: + + browserEnv["MOZ_CC_LOG_DIRECTORY"] = os.environ["MOZ_UPLOAD_DIR"] + browserEnv["MOZ_CC_LOG_SHUTDOWN"] = "1" + +## Analyzing GC and CC logs + +There are numerous scripts that analyze GC and CC logs on +[GitHub](https://github.com/amccreight/heapgraph/) + + +To find out why an object is being kept alive, you should use `find_roots.py` +in the root of the github repository. Calling `find_roots.py` on a CC log +with a specific object or kind of object will produce paths from rooting +objects to the specified objects. Most big leaks include an `nsGlobalWindow`, +so that's a good class to try if you don't have any better idea. + +To fix a leak, the next step is to figure out why the rooting object is +alive. For a C++ object, you need to figure out where the missing +references are from. For a JS object, you need to figure out why the JS +object is reachable from a JS root. For the latter, you can use the +corresponding [`find_roots.py` for +JS](https://github.com/amccreight/heapgraph/tree/master/g) +on the GC log. + +## Alternatives + +There are two add-ons that can be used to create and analyze CC graphs. + +- [about:cc](https://bugzilla.mozilla.org/show_bug.cgi?id=726346) + is simple, ugly, but rather powerful. diff --git a/docs/performance/memory/heap_scan_mode.md b/docs/performance/memory/heap_scan_mode.md new file mode 100644 index 0000000000..ea5a45016a --- /dev/null +++ b/docs/performance/memory/heap_scan_mode.md @@ -0,0 +1,313 @@ +# DMD heap scan mode + +Firefox's DMD heap scan mode tracks the set of all live blocks of +malloc-allocated memory and their allocation stacks, and allows you to +log these blocks, and the values stored in them, to a file. When +combined with cycle collector logging, this can be used to investigate +leaks of refcounted cycle collected objects, by figuring out what holds +a strong reference to a leaked object. + +**When should you use this?** DMD heap scan mode is intended to be used +to investigate leaks of cycle collected (CCed) objects. DMD heap scan +mode is a "tool of last resort" that should only be used when all +other avenues have been tried and failed, except possibly [ref count +logging](refcount_tracing_and_balancing.md). +It is particularly useful if you have no idea what is causing the leak. +If you have a patch that introduces a leak, you are probably better off +auditing all of the strong references that your patch creates before +trying this. + +The particular steps given below are intended for the case where the +leaked object is alive all the way through shutdown. You could modify +these steps for leaks that go away in shutdown by collecting a CC and +DMD log prior to shutdown. However, in that case it may be easier to use +refcount logging, or rr with a conditional breakpoint set on calls to +`Release()` for the leaking object, to see what object actually does the +release that causes the leaked object to go away. + +## Prerequisites + +- A debug DMD build of Firefox. [This + page](dmd.md) + describes how to do that. This should probably be an optimized + build. Non-optimized DMD builds will generate better stack traces, + but they can be so slow as to be useless. +- The build is going to be very slow, so you may need to disable some + shutdown checks. First, in + `toolkit/components/terminator/nsTerminator.cpp`, delete everything + in `RunWatchDog` but the call to `NS_SetCurrentThreadName`. This + will keep the watch dog from killing the browser when shut down + takes multiple minutes. Secondly, you may need to comment out the + call to `MOZ_CRASH("NSS_Shutdown failed");` in + `xpcom/build/XPCOMInit.cpp`, as this also seems to trigger when + shutdown is extremely slow. +- You need the cycle collector analysis script `find_roots.py`, which + can be downloaded as part of [this repo on + Github](https://github.com/amccreight/heapgraph). + +## Generating Logs + +The next step is to generate a number of log files. You need to get a +shutdown CC log and a DMD log, for a single run. + +**Definitions** I'll write `$objdir` for the object directory for your +Firefox DMD build, `$srcdir` for the top level of the Firefox source +directory, and `$heapgraph` for the location of the heapgraph repo, and +`$logdir` for the location you want logs to go to. `$logdir` should end +in a path separator. For instance, `~/logs/leak/`. + +The command you need to run Firefox will look something like this: + + XPCOM_MEM_BLOAT_LOG=1 MOZ_CC_LOG_SHUTDOWN=1 MOZ_DISABLE_CONTENT_SANDBOX=t MOZ_CC_LOG_DIRECTORY=$logdir + MOZ_CC_LOG_PROCESS=content MOZ_CC_LOG_THREAD=main MOZ_DMD_SHUTDOWN_LOG=$logdir MOZ_DMD_LOG_PROCESS=tab ./mach run --dmd --mode=scan + +Breaking this down: + +- `XPCOM_MEM_BLOAT_LOG=1`: This reports a list of the counts of every + object created and destroyed and tracked by the XPCOM leak tracking + system. From this chart, you can see how many objects of a + particular type were leaked through shutdown. This can come in handy + during the manual analysis phase later, to get evidence to support + your hunches. For instance, if you think that an `nsFoo` object + might be holding your leaking object alive, you can use this to + easily see if we leaked an `nsFoo` object. +- `MOZ_CC_LOG_SHUTDOWN=1`: This generates a cycle collector log during + shutdown. Creating this log during shutdown is nice because there + are less things unrelated to the leak in the log, and various cycle + collector optimizations are disabled. A garbage collector log will + also be created, which you may not need. +- `MOZ_DISABLE_CONTENT_SANDBOX=t`: This disables the content process + sandbox, which is needed because the DMD and CC log files are + created directly by the child processes. +- `MOZ_CC_LOG_DIRECTORY=$logdir`: This selects the location for cycle + collector logs to be saved. +- `MOZ_CC_LOG_PROCESS=content MOZ_CC_LOG_THREAD=main`: These options + specify that we only want CC logs for the main thread of content + processes, to make shutdown less slow. If your leak is happening in + a different process or thread, change the options, which are listed + in `xpcom/base/nsCycleCollector.cpp`. +- `MOZ_DMD_SHUTDOWN_LOG=$logdir`: This option specifies that we want a + DMD log to be taken very late in XPCOM shutdown, and the location + for that log to be saved. Like with the CC log, we want this log + very late to avoid as many non-leaking things as possible. +- `MOZ_DMD_LOG_PROCESS=tab`: As with the CC, this means that we only + want these logs in content processes, in order to make shutdown + faster. The allowed values here are the same as those returned by + `XRE_GetProcessType()`, so adjust as needed. +- Finally, the `--dmd` option need to be passed in so that DMD will be + run. `--mode=scan` is needed so that when we get a DMD log the + entire contents of each block of memory is saved for later analysis. + +With that command line in hand, you can start Firefox. Be aware that +this may take multiple minutes if you have optimization disabled. + +Once it has started, go through the steps you need to reproduce your +leak. If your leak is a ghost window, it can be handy to get an +`about:memory` report and write down the PID of the leaking process. You +may want to wait 10 or so seconds after this to make sure as much as +possible is cleaned up. + +Next, exit the browser. This will cause a lot of logs to be written out, +so it can take a while. + +## Analyzing the Logs + +### Getting the PID and address of the leaking object + +The first step is to figure out the **PID** of the leaking process. The +second step is to figure out **the address of the leaking object**, +usually a window. Conveniently, you can usually do both at once using +the cycle collector log. If you are investigating a leak of +`www.example.com`, then from `$logdir` you can do +`"grep nsGlobalWindow cc-edges* | grep example.com"`. This looks through +all of the windows in all of the CC logs (which may leaked, this late in +shutdown), and then filters out windows where the URL contains +`example.com`. + +The result of that grep will contain output that looks something like +this: + + cc-edges.15873.log:0x7f0897082c00 [rc=1285] nsGlobalWindowInner # 2147483662 inner https://www.example.com/ + +cc-edges.15873.log: The first part is the file name where it was +found. `15873` is the PID of the process that leaked. You'll want to +write down the name of the file and the PID. Let's call the file +`$cclog` and the pid `$pid`. + +0x7f0897082c00: This is the address of the leaking window. You'll +also want to write that down. Let's call this `$winaddr`. + +If there are multiple files, you'll end up with one that looks like +`cc-edges.$pid.log` and one or more that look like +`cc-edges.$pid-$n.log` for various values of `$n`. You want the one with +the largest `$n`, as this was recorded the latest, and so it will +contain the least non-garbage. + +### Identifying the root in the cycle collector log + +The next step is to figure out why the cycle collector could not collect +the window, using the `find_roots.py` script from the heapgraph +repository. The command to invoke this looks like this: + + python $heapgraph/find_roots.py $cclog $winaddr + +This may take a few seconds. It will eventually produce some output. +You'll want to save a copy of this output for later. + +The output will look something like this, after a message about loading +progress: + + 0x7f0882fe3230 [FragmentOrElement (xhtml) script https://www.example.com] + --[[via hash] mListenerManager]--> 0x7f0899b4e550 [EventListenerManager] + --[mListeners event=onload listenerType=3 [i]]--> 0x7f0882ff8f80 [CallbackObject] + --[mIncumbentGlobal]--> 0x7f0897082c00 [nsGlobalWindowInner # 2147483662 inner https://www.example.com] + +Root 0x7f0882fe3230 is a ref counted object with 1 unknown edge(s). + known edges: + 0x7f08975a24c0 [FragmentOrElement (xhtml) head https://www.example.com] --[mAttrsAndChildren[i]]--> 0x7f0882fe3230 + 0x7f08967e7b20 [JS Object (HTMLScriptElement)] --[UnwrapDOMObject(obj)]--> 0x7f0882fe3230 + +The first two lines mean that the script element `0x7f0882fe3230` +contains a strong reference to the EventListenerManager +`0x7f0899b4e550`. "[via hash] mListenerManager" is a description of +that strong reference. Together, these lines show a chain of strong +references from an object the cycle collector thinks needs to be kept +alive, `0x7f0899b4e550`, to the object` 0x7f0897082c00` that you asked +about. Most of the time, the actual chain is not important, because the +cycle collector can only tell us about what went right. Let us call the +address of the leaking object (`0x7f0882fe3230` in this case) +`$leakaddr`. + +Besides `$leakaddr`, the other interesting part is the chunk at the +bottom. It tells us that there is 1 unknown edge, and 2 known edges. +What this means is that the leaking object has a refcount of 3, but the +cycle collector was only told about these two references. In this case, +a head element and a JS object (the JS reflector of the script element). +We need to figure out what the unknown reference is from, as that is +where our leak really is. + +### Figure out what is holding the leaking object alive. + +Now we need to use the DMD heap scan logs. These contain the contents of +every live block of memory. + +The first step to using the DMD heap scan logs is to do some +pre-processing for the DMD log. Stacks need to be symbolicated, and we +need to clamp the values contained in the heap. Clamping is the same +kind of analysis that a conservative GC does: if a word-aligned value in +a heap block points to somewhere within another heap block, replace that +value with the address of the block. + +Both kinds of preprocessing are done by the `dmd.py` script, which can +be invoked like this: + + $objdir/dist/bin/dmd.py --clamp-contents dmd-$pid.log.gz + +This can take a few minutes due to symbolification, but you only need to +run it once on a log file. + +You can also locally symbolicate stacks from DMD logs generated on TreeHerder, +but it will [take a few extra steps](/contributing/debugging/local_symbols.rst) +that you need to do before running `dmd.py`. + +After that is done, we can finally find out which objects (possibly) +point to other objects, using the block_analyzer script: + + python $srcdir/memory/replace/dmd/block_analyzer.py dmd-$pid.log.gz $leakaddr + +This will look through every block of memory in the log, and give some +basic information about any block of memory that (possibly) contains a +pointer to that object. You can pass some additional options to affect +how the results are displayed. "-sfl 10000 -a" is useful. The -sfl 10000 +tells it to not truncate stack frames, and -a tells it to not display +generic frames related to the allocator. + +Caveat: I think block_analyzer.py does not attempt to clamp the address +you pass into it, so if it is an offset from the start of the block, it +won't find it. + + block_analyzer.py` will return a series of entries that look like this + with the [...] indicating where I have removed things): + 0x7f089306b000 size = 4096 bytes at byte offset 2168 + nsAttrAndChildArray::GrowBy[...] + nsAttrAndChildArray::InsertChildAt[...] + [...] + +`0x7f089306b000` is the address of the block that contains `$leakaddr`. +144 bytes is the size of that block. That can be useful for confirming +your guess about what class the block actually is. The byte offset tells +you were in the block the pointer is. This is mostly useful for larger +objects, and you can potentially combine this with debugging information +to figure out exactly what field this is. The rest of the entry is the +stack trace for the allocation of the block, which is the most useful +piece of information. + +What you need to do now is to go through every one of these entries and +place it into three categories: strong reference known to the cycle +collector, weak reference, or something else! The goal is to eventually +shrink down the "something else" category until there are only as many +things in it as there are unknown references to the leaking object, and +then you have your leaker. + +To place an entry into one of the categories, you must look at the code +locations given in the stack trace, and see if you can tell what the +object is based on that, then compare that to what `find_roots.py` told +you. + +For instance, one of the strong references in the CC log is from a head +element to its child via `mAttrsAndChildren`, and that sounds a lot like +this, so we can mark it as being a strong known reference. + +This is an iterative process, where you first go through and mark off +the things that are easily categorizable, and repeat until you have a +small list of things to analyze. + +### Example analysis of block_analyzer.py results + +In one debugging session where I was investigating the leak from bug +1451985, I eventually reduced the list of entries until this was the +most suspicious looking entry: + + 0x7f0892f29630 size = 392 bytes at byte offset 56 + mozilla::dom::ScriptLoader::ProcessExternalScript[...] + [...] + +I went to that line of `ScriptLoader::ProcessExternalScript()`, and it +contained a call to ScriptLoader::CreateLoadRequest(). Fortunately, this +method mostly just contains two calls to `new`, one for +`ScriptLoadRequest` and one for `ModuleLoadRequest`. (This is where an +unoptimized build comes in handy, as it would have pointed out the exact +line. Unfortunately, in this particular case, the unoptimized build was +so slow I wasn't getting any logs.) I then looked through the list of +leaked objects generated by `XPCOM_MEM_BLOAT_LOG` and saw that we were +leaking a `ScriptLoadRequest`, so I went and looked at its class +definition, where I noticed that `ScriptLoadRequest` had a strong +reference to an element that it wasn't telling the cycle collector +about, which seemed suspicious. + +The first thing I did to try to confirm that this was the source of the +leak was pass the address of this object into the cycle collector +analysis log, `find_roots.py`, that we used at an earlier step. That +gave a result that contained this: + + 0x7f0882fe3230 [FragmentOrElement (xhtml) script [...] + --[mNodeInfo]--> 0x7f0897431f00 [NodeInfo (xhtml) script] + [...] + --[mLoadingAsyncRequests]--> 0x7f0892f29630 [ScriptLoadRequest] + +This confirms that this block is actually a ScriptLoadRequest. Secondly, +notice that the load request is being held alive by the very same script +element that is causing the window leak! This strongly suggests that +there is a cycle of strong references between the script element and the +load request. I then added the missing field to the traverse and unlink +methods of ScriptLoadRequest, and confirmed that I couldn't reproduce +the leak. + +Keep in mind that you may need to run `block_analyzer.py` multiple +times. For instance, if the script element was being held alive by some +container being held alive by a runnable, we'd first need to figure out +that the container was holding the element. If it isn't possible to +figure out what is holding that alive, you'd have to run block_analyzer +again. This isn't too bad, because unlike ref count logging, we have the +full state of memory in our existing log, so we don't need to run the +browser again. diff --git a/docs/performance/memory/leak_gauge.md b/docs/performance/memory/leak_gauge.md new file mode 100644 index 0000000000..153303549e --- /dev/null +++ b/docs/performance/memory/leak_gauge.md @@ -0,0 +1,45 @@ +# Leak Gauge + +Leak Gauge is a tool that can be used to detect certain kinds of leaks +in Gecko, including those involving documents, window objects, and +docshells. It has two parts: instrumentation in Gecko that produces a +log file, and a script to post-process the log file. + +## Getting a log file + +To get a log file, run the browser with these settings: + + NSPR_LOG_MODULES=DOMLeak:5,DocumentLeak:5,nsDocShellLeak:5,NodeInfoManagerLeak:5 + NSPR_LOG_FILE=nspr.log # or any other filename of your choice + +This overwrites any existing file named `nspr.log`. The browser runs +with a negligible slowdown. For reliable results, exit the browser +before post-processing the log file. + +## Post-processing the log file + +Post-process the log file with +[tools/leak-gauge/leak-gauge.pl](https://searchfox.org/mozilla-central/source/tools/leak-gauge/leak-gauge.html) + +If there are no leaks, the output looks like this: + + Results of processing log leak.log : + Summary: + Leaked 0 out of 11 DOM Windows + Leaked 0 out of 44 documents + Leaked 0 out of 3 docshells + Leaked content nodes in 0 out of 0 documents + +If there are leaks, the output looks like this: + + Results of processing log leak2.log : + Leaked outer window 2c6e410 at address 2c6e410. + Leaked outer window 2c6ead0 at address 2c6ead0. + Leaked inner window 2c6ec80 (outer 2c6ead0) at address 2c6ec80. + Summary: + Leaked 13 out of 15 DOM Windows + Leaked 35 out of 46 documents + Leaked 4 out of 4 docshells + Leaked content nodes in 42 out of 53 documents + +If you find leaks, please file a bug report. diff --git a/docs/performance/memory/leak_hunting_strategies_and_tips.md b/docs/performance/memory/leak_hunting_strategies_and_tips.md new file mode 100644 index 0000000000..a5689223ea --- /dev/null +++ b/docs/performance/memory/leak_hunting_strategies_and_tips.md @@ -0,0 +1,219 @@ +# Leak hunting strategies and tips + +This document is old and some of the information is out-of-date. Use +with caution. + +## Strategy for finding leaks + +When trying to make a particular testcase not leak, I recommend focusing +first on the largest object graphs (since these entrain many smaller +objects), then on smaller reference-counted object graphs, and then on +any remaining individual objects or small object graphs that don't +entrain other objects. + +Because (1) large graphs of leaked objects tend to include some objects +pointed to by global variables that confuse GC-based leak detectors, +which can make leaks look smaller (as in [bug +99180](https://bugzilla.mozilla.org/show_bug.cgi?id=99180){.external +.text}) or hide them completely and (2) large graphs of leaked objects +tend to hide smaller ones, it's much better to go after the large +graphs of leaks first. + +A good general pattern for finding and fixing leaks is to start with a +task that you want not to leak (for example, reading email). Start +finding and fixing leaks by running part of the task under nsTraceRefcnt +logging, gradually building up from as little as possible to the +complete task, and fixing most of the leaks in the first steps before +adding additional steps. (By most of the leaks, I mean the leaks of +large numbers of different types of objects or leaks of objects that are +known to entrain many non-logged objects such as JS objects. Seeing a +leaked `GlobalWindowImpl`, `nsXULPDGlobalObject`, +`nsXBLDocGlobalObject`, or `nsXPCWrappedJS` is a sign that there could +be significant numbers of JS objects leaked.) + +For example, start with bringing up the mail window and closing the +window without doing anything. Then go on to selecting a folder, then +selecting a message, and then other activities one does while reading +mail. + +Once you've done this, and it doesn't leak much, then try the action +under trace-malloc or LSAN or Valgrind to find the leaks of smaller +graphs of objects. (When I refer to the size of a graph of objects, I'm +referring to the number of objects, not the size in bytes. Leaking many +copies of a string could be a very large leak, but the object graphs are +small and easy to identify using GC-based leak detection.) + +## What leak tools do we have? + +| Tool | Finds | Platforms | Requires | +|------------------------------------------|------------------------------------------------------|---------------------|--------------| +| Leak tools for large object graphs | | | | +| [Leak Gauge](leak_gauge.md) | Windows, documents, and docshells only | All platforms | Any build | +| [GC and CC logs](gc_and_cc_logs.md) | JS objects, DOM objects, many other kinds of objects | All platforms | Any build | +| Leak tools for medium-size object graphs | | | | +| [BloatView](bloatview.md), [refcount tracing and balancing](refcount_tracing_and_balancing.md) | Objects that implement `nsISupports` or use `MOZ_COUNT_{CTOR,DTOR}` | All tier 1 platforms | Debug build (or build opt with `--enable-logrefcnt`)| +| Leak tools for debugging memory growth that is cleaned up on shutdown | | | + +## Common leak patterns + +When trying to find a leak of reference-counted objects, there are a +number of patterns that could cause the leak: + +1. Ownership cycles. The most common source of hard-to-fix leaks is + ownership cycles. If you can avoid creating cycles in the first + place, please do, since it's often hard to be sure to break the + cycle in every last case. Sometimes these cycles extend through JS + objects (discussed further below), and since JS is + garbage-collected, every pointer acts like an owning pointer and the + potential for fan-out is larger. See [bug + 106860](https://bugzilla.mozilla.org/show_bug.cgi?id=106860){.external + .text} and [bug + 84136](https://bugzilla.mozilla.org/show_bug.cgi?id=84136){.external + .text} for examples. (Is this advice still accurate now that we have + a cycle collector? \--Jesse) +2. Dropping a reference on the floor by: + 1. Forgetting to release (because you weren't using `nsCOMPtr` + when you should have been): See [bug + 99180](https://bugzilla.mozilla.org/show_bug.cgi?id=99180){.external + .text} or [bug + 93087](https://bugzilla.mozilla.org/show_bug.cgi?id=93087){.external + .text} for an example or [bug + 28555](https://bugzilla.mozilla.org/show_bug.cgi?id=28555){.external + .text} for a slightly more interesting one. This is also a + frequent problem around early returns when not using `nsCOMPtr`. + 2. Double-AddRef: This happens most often when assigning the result + of a function that returns an AddRefed pointer (bad!) into an + `nsCOMPtr` without using `dont_AddRef()`. See [bug + 76091](https://bugzilla.mozilla.org/show_bug.cgi?id=76091){.external + .text} or [bug + 49648](https://bugzilla.mozilla.org/show_bug.cgi?id=49648){.external + .text} for an example. + 3. \[Obscure\] Double-assignment into the same variable: If you + release a member variable and then assign into it by calling + another function that does the same thing, you can leak the + object assigned into the variable by the inner function. (This + can happen equally with or without `nsCOMPtr`.) See [bug + 38586](https://bugzilla.mozilla.org/show_bug.cgi?id=38586){.external + .text} and [bug + 287847](https://bugzilla.mozilla.org/show_bug.cgi?id=287847){.external + .text} for examples. +3. Dropping a non-refcounted object on the floor (especially one that + owns references to reference counted objects). See [bug + 109671](https://bugzilla.mozilla.org/show_bug.cgi?id=109671){.external + .text} for an example. +4. Destructors that should have been virtual: If you expect to override + an object's destructor (which includes giving a derived class of it + an `nsCOMPtr` member variable) and delete that object through a + pointer to the base class using delete, its destructor better be + virtual. (But we have many virtual destructors in the codebase that + don't need to be -- don't do that.) + +## Debugging leaks that go through XPConnect + +Many large object graphs that leak go through +[XPConnect](http://www.mozilla.org/scriptable/){.external .text}. This +can mean there will be XPConnect wrapper objects showing up as owning +the leaked objects, but it doesn't mean it's XPConnect's fault +(although that [has been known to +happen](https://bugzilla.mozilla.org/show_bug.cgi?id=76102){.external +.text}, it's rare). Debugging leaks that go through XPConnect requires +a basic understanding of what XPConnect does. XPConnect allows an XPCOM +object to be exposed to JavaScript, and it allows certain JavaScript +objects to be exposed to C++ code as normal XPCOM objects. + +When a C++ object is exposed to JavaScript (the more common of the two), +an XPCWrappedNative object is created. This wrapper owns a reference to +the native object until the corresponding JavaScript object is +garbage-collected. This means that if there are leaked GC roots from +which the wrapper is reachable, the wrapper will never release its +reference on the native object. While this can be debugged in detail, +the quickest way to solve these problems is often to simply debug the +leaked JS roots. These roots are printed on shutdown in DEBUG builds, +and the name of the root should give the type of object it is associated +with. + +One of the most common ways one could leak a JS root is by leaking an +`nsXPCWrappedJS` object. This is the wrapper object in the reverse +direction \-- when a JS object is used to implement an XPCOM interface +and be used transparently by native code. The `nsXPCWrappedJS` object +creates a GC root that exists as long as the wrapper does. The wrapper +itself is just a normal reference-counted object, so a leaked +`nsXPCWrappedJS` can be debugged using the normal refcount-balancer +tools. + +If you really need to debug leaks that involve JS objects closely, you +can get detailed printouts of the paths JS uses to mark objects when it +is determining the set of live objects by using the functions added in +[bug +378261](https://bugzilla.mozilla.org/show_bug.cgi?id=378261){.external +.text} and [bug +378255](https://bugzilla.mozilla.org/show_bug.cgi?id=378255){.external +.text}. (More documentation of this replacement for GC_MARK_DEBUG, the +old way of doing it, would be useful. It may just involve setting the +`XPC_SHUTDOWN_HEAP_DUMP` environment variable to a file name, but I +haven't tested that.) + +## Post-processing of stack traces + +On Mac and Linux, the stack traces generated by our internal debugging +tools don't have very good symbol information (since they just show the +results of `dladdr`). The stacks can be significantly improved (better +symbols, and file name / line number information) by post-processing. +Stacks can be piped through the script `tools/rb/fix_stacks.py` to do +this. These scripts are designed to be run on balance trees in addition +to raw stacks; since they are rather slow, it is often **much faster** +to generate balance trees (e.g., using `make-tree.pl` for the refcount +balancer or `diffbloatdump.pl --use-address` for trace-malloc) and*then* +run the balance trees (which are much smaller) through the +post-processing. + +## Getting symbol information for system libraries + +### Windows + +Setting the environment variable `_NT_SYMBOL_PATH` to something like +`symsrv*symsrv.dll*f:\localsymbols*http://msdl.microsoft.com/download/symbols` +as described in [Microsoft's +article](http://support.microsoft.com/kb/311503){.external .text}. This +needs to be done when running, since we do the address to symbol mapping +at runtime. + +### Linux + +Many Linux distros provide packages containing external debugging +symbols for system libraries. `fix_stacks.py` uses this debugging +information (although it does not verify that they match the library +versions on the system). + +For example, on Fedora, these are in \*-debuginfo RPMs (which are +available in yum repositories that are disabled by default, but easily +enabled by editing the system configuration). + +## Tips + +### Disabling Arena Allocation + +With many lower-level leak tools (particularly trace-malloc based ones, +like leaksoup) it can be helpful to disable arena allocation of objects +that you're interested in, when possible, so that each object is +allocated with a separate call to malloc. Some places you can do this +are: + +layout engine +: Define `DEBUG_TRACEMALLOC_FRAMEARENA` where it is commented out in + `layout/base/nsPresShell.cpp` + +glib +: Set the environment variable `G_SLICE=always-malloc` + +## Other References + +- [Performance + tools](https://wiki.mozilla.org/Performance:Tools "Performance:Tools") +- [Leak Debugging Screencasts](https://dbaron.org/mozilla/leak-screencasts/){.external + .text} +- [LeakingPages](https://wiki.mozilla.org/LeakingPages "LeakingPages") - + a list of pages known to leak +- [mdc:Performance](https://developer.mozilla.org/en/Performance "mdc:Performance"){.extiw} - + contains documentation for all of our memory profiling and leak + detection tools diff --git a/docs/performance/memory/memory.md b/docs/performance/memory/memory.md new file mode 100644 index 0000000000..d571fb6b9c --- /dev/null +++ b/docs/performance/memory/memory.md @@ -0,0 +1,64 @@ +# Memory Tools + +The Memory tool lets you take a snapshot of the current tab's memory +[heap](https://en.wikipedia.org/wiki/Memory_management#HEAP). +It then provides a number of views of the heap that can +show you which objects account for memory usage and exactly where in +your code you are allocating memory. + + + +------------------------------------------------------------------------ + +## The basics +- Opening [the memory + tool](basic_operations.md#opening-the-memory-tool) +- [Taking a heap + snapshot](basic_operations.md#saving-and-loading-snapshots) +- [Comparing two + snapshots](basic_operations.md#comparing-snapshots) +- [Deleting + snapshots](basic_operations.md#clearing-a-snapshot) +- [Saving and loading + snapshots](basic_operations.md#saving-and-loading-snapshots) +- [Recording call + stacks](basic_operations.md#recording-call-stacks) + +------------------------------------------------------------------------ + +## Analyzing snapshots + +The Tree map view is new in Firefox 48, and the Dominators view is new +in Firefox 46. + +Once you've taken a snapshot, there are three main views the Memory +tool provides: + +- [the Tree map view](tree_map_view.md) shows + memory usage as a + [treemap](https://en.wikipedia.org/wiki/Treemapping). +- [the Aggregate view](aggregate_view.md) shows + memory usage as a table of allocated types. +- [the Dominators view](dominators_view.md) + shows the "retained size" of objects: that is, the size of objects + plus the size of other objects that they keep alive through + references. + +If you've opted to record allocation stacks for the snapshot, the +Aggregate and Dominators views can show you exactly where in your code +allocations are happening. + +------------------------------------------------------------------------ + +## Concepts + +- What are [Dominators](dominators.md)? + +------------------------------------------------------------------------ + +## Example pages + +Examples used in the Memory tool documentation. + +- The [Monster example](monster_example.md) +- The [DOM allocation example](DOM_allocation_example.md) diff --git a/docs/performance/memory/monster_example.md b/docs/performance/memory/monster_example.md new file mode 100644 index 0000000000..d351803a8d --- /dev/null +++ b/docs/performance/memory/monster_example.md @@ -0,0 +1,79 @@ +# Monster example slug + +This article describes a very simple web page that we'll use to +illustrate some features of the Memory tool. + +You can try the site at +. +Heres the code: + +```js +var MONSTER_COUNT = 5000; +var MIN_NAME_LENGTH = 2; +var MAX_NAME_LENGTH = 48; + +function Monster() { + + function randomInt(min, max) { + return Math.floor(Math.random() * (max - min + 1)) + min; + } + + function randomName() { + var chars = "abcdefghijklmnopqrstuvwxyz"; + var nameLength = randomInt(MIN_NAME_LENGTH, MAX_NAME_LENGTH); + var name = ""; + for (var j = 0; j < nameLength; j++) { + name += chars[randomInt(0, chars.length-1)]; + } + return name; + } + + this.name = randomName(); + this.eyeCount = randomInt(0, 25); + this.tentacleCount = randomInt(0, 250); +} + +function makeMonsters() { + var monsters = { + "friendly": [], + "fierce": [], + "undecided": [] + }; + + for (var i = 0; i < MONSTER_COUNT; i++) { + monsters.friendly.push(new Monster()); + } + + for (var i = 0; i < MONSTER_COUNT; i++) { + monsters.fierce.push(new Monster()); + } + + for (var i = 0; i < MONSTER_COUNT; i++) { + monsters.undecided.push(new Monster()); + } + + console.log(monsters); +} + +var makeMonstersButton = document.getElementById("make-monsters"); +makeMonstersButton.addEventListener("click", makeMonsters); +``` + +The page contains a button: when you push the button, the code creates +some monsters. Specifically: + +- the code creates an object with three properties, each an array: + - one for fierce monsters + - one for friendly monsters + - one for monsters who haven't decided yet. +- for each array, the code creates and appends 5000 + randomly-initialized monsters. Each monster has: + - a string, for the monster's name + - a number representing the number of eyes it has + - a number representing the number of tentacles it has. + +So the structure of the memory allocated on the JavaScript heap is an +object containing three arrays, each containing 5000 objects (monsters), +each object containing a string and two integers: + +[![](../img/monsters.svg)] diff --git a/docs/performance/memory/refcount_tracing_and_balancing.md b/docs/performance/memory/refcount_tracing_and_balancing.md new file mode 100644 index 0000000000..fe3e3c7ae4 --- /dev/null +++ b/docs/performance/memory/refcount_tracing_and_balancing.md @@ -0,0 +1,235 @@ +# Refcount Tracing and Balancing + +Refcount tracing and balancing are advanced techniques for tracking down +leak of refcounted objects found with +[BloatView](bloatview.md). The first step +is to run Firefox with refcount tracing enabled, which produces one or +more log files. Refcount tracing logs calls to `Addref` and `Release`, +preferably for a particular set of classes, including call-stacks in +symbolic form (on platforms that support this). Refcount balancing is a +follow-up step that analyzes the resulting log to help a developer +figure out where refcounting went wrong. + +## How to build for refcount tracing + +Build with `--enable-debug` or `--enable-logrefcnt`. + +## How to run with refcount tracing on + +There are several environment variables that can be used. + +First, you select one of three environment variables to choose what kind +of logging you want. You almost certainly want `XPCOM_MEM_REFCNT_LOG`. + +NOTE: Due to an issue with the sandbox on Windows (bug +[1345568](https://bugzilla.mozilla.org/show_bug.cgi?id=1345568) +refcount logging currently requires the MOZ_DISABLE_CONTENT_SANDBOX +environment variable to be set. + +`XPCOM_MEM_REFCNT_LOG` + +Setting this environment variable enables refcount tracing. If you set +this environment variable to the name of a file, the log will be output +to that file. You can also set it to 1 to log to stdout or 2 to log to +stderr, but these logs are large and expensive to capture, so you +probably don't want to do that. **WARNING**: you should never use this +without `XPCOM_MEM_LOG_CLASSES` and/or `XPCOM_MEM_LOG_OBJECTS`, because +without some filtering the logging will be completely useless due to how +slow the browser will run and how large the logs it produces will be. + +`XPCOM_MEM_COMPTR_LOG` + +This environment variable enables logging of additions and releases of +objects into `nsCOMPtr`s. This requires C++ dynamic casts, so it is not +supported on all platforms. However, having an nsCOMPtr log and using it +in the creation of the balance tree allows AddRef and Release calls that +we know are matched to be eliminated from the tree, so it makes it much +easier to debug reference count leaks of objects that have a large +amount of reference counting traffic. + +`XPCOM_MEM_ALLOC_LOG` + +For platforms that don't have stack-crawl support, XPCOM supports +logging at the call site to `AddRef`/`Release` using the usual cpp +`__FILE__` and __LINE__ number macro expansion hackery. This results +in slower code, but at least you get some data about where the leaks +might be occurring from. + +You must also set one or two additional environment variables, +`XPCOM_MEM_LOG_CLASSES` and `XPCOM_MEM_LOG_OBJECTS,` to reduce the set +of objects being logged, in order to improve performance to something +vaguely tolerable. + +`XPCOM_MEM_LOG_CLASSES` + +This variable should contain a comma-separated list of names which will +be used to compare against the types of the objects being logged. For +example: + + env XPCOM_MEM_LOG_CLASSES=nsDocShell XPCOM_MEM_REFCNT_LOG=./refcounts.log ./mach run + +This will log the `AddRef` and `Release` calls only for instances of +`nsDocShell` while running the browser using `mach`, to a file +`refcounts.log`. Note that setting `XPCOM_MEM_LOG_CLASSES` will also +list the *serial number* of each object that leaked in the "bloat log" +(that is, the file specified by the `XPCOM_MEM_BLOAT_LOG` variable; see +[the BloatView documentation](bloatview.md) +for more details). An object's serial number is simply a unique number, +starting at one, that is assigned to the object when it is allocated. + +You may use an object's serial number with the following variable to +further restrict the reference count tracing: + + XPCOM_MEM_LOG_OBJECTS + +Set this variable to a comma-separated list of object *serial number* or +ranges of *serial number*, e.g., `1,37-42,73,165` (serial numbers start +from 1, not 0). When this is set, along with `XPCOM_MEM_LOG_CLASSES` and +`XPCOM_MEM_REFCNT_LOG`, a stack track will be generated for *only* the +specific objects that you list. For example, + + env XPCOM_MEM_LOG_CLASSES=nsDocShell XPCOM_MEM_LOG_OBJECTS=2 XPCOM_MEM_REFCNT_LOG=./refcounts.log ./mach run + +will log stack traces to `refcounts.log` for the 2nd `nsDocShell` object +that gets allocated, and nothing else. + +## **Post-processing step 1: finding the leakers** + +First you have to figure out which objects leaked. The script +`tools/rb/find_leakers.py` does this. It grovels through the log file, +and figures out which objects got allocated (it knows because they were +just allocated because they got `AddRef()`-ed and their refcount became +1). It adds them to a list. When it finds an object that got freed (it +knows because its refcount goes to 0), it removes it from the list. +Anything left over is leaked. + +The scripts output looks like the following. + + 0x00253ab0 (1) + 0x00253ae0 (2) + 0x00253bd0 (4) + +The number in parentheses indicates the order in which it was allocated, +if you care. Pick one of these pointers for use with Step 2. + +## Post-processing step 2: filtering the log + +Once you've picked an object that leaked, you can use +`tools/rb/filter-log.pl` to filter the log file to drop the call +stack for other objects; This process reduces the size of the log file +and also improves the performance. + + perl -w tools/rb/filter-log.pl --object 0x00253ab0 < ./refcounts.log > my-leak.log + +### Symbolicating stacks + +The log files often lack function names, file +names and line numbers. You'll need to run a script to fix the call +stack. + + python3 tools/rb/fix_stacks.py < ./refcounts.log > fixed_stack.log + +Also, it is possible to [locally symbolicate](/contributing/debugging/local_symbols.rst) +logs generated on TreeHerder. + +## **Post-processing step 3: building the balance tree** + +Now that you've the log file fully prepared, you can build a *balance +tree*. This process takes all the stack `AddRef()` and `Release()` stack +traces and munges them into a call graph. Each node in the graph +represents a call site. Each call site has a *balance factor*, which is +positive if more `AddRef()`s than `Release()`es have happened at the +site, zero if the number of `AddRef()`s and `Release()`es are equal, and +negative if more `Release()`es than `AddRef()`s have happened at the +site. + +To build the balance tree, run `tools/rb/make-tree.pl`, specifying the +object of interest. For example: + + perl -w tools/rb/make-tree.pl --object 0x00253ab0 < my-leak.log + +This will build an indented tree that looks something like this (except +probably a lot larger and leafier): + + .root: bal=1 + main: bal=1 + DoSomethingWithFooAndReturnItToo: bal=2 + NS_NewFoo: bal=1 + +Let's pretend in our toy example that `NS_NewFoo()` is a factory method +that makes a new foo and returns it. +`DoSomethingWithFooAndReturnItToo()` is a method that munges the foo +before returning it to `main()`, the main program. + +What this little tree is telling you is that you leak *one refcount* +overall on object `0x00253ab0`. But, more specifically, it shows you +that: + +- `NS_NewFoo()` "leaks" a refcount. This is probably "okay" + because it's a factory method that creates an `AddRef()`-ed object. +- `DoSomethingWithFooAndReturnItToo()` leaks *two* refcounts. + Hmm...this probably isn't okay, especially because... +- `main()` is back down to leaking *one* refcount. + +So from this, we can deduce that `main()` is correctly releasing the +refcount that it got on the object returned from +`DoSomethingWithFooAndReturnItToo()`, so the leak *must* be somewhere in +that function. + +So now say we go fix the leak in `DoSomethingWithFooAndReturnItToo()`, +re-run our trace, grovel through the log by hand to find the object that +corresponds to `0x00253ab0` in the new run, and run `make-tree.pl`. What +we'd hope to see is a tree that looks like: + + .root: bal=0 + main: bal=0 + DoSomethingWithFooAndReturnItToo: bal=1 + NS_NewFoo: bal=1 + +That is, `NS_NewFoo()` "leaks" a single reference count; this leak is +"inherited" by `DoSomethingWithFooAndReturnItToo()`; but is finally +balanced by a `Release()` in `main()`. + +## Hints + +Clearly, this is an iterative and analytical process. Here are some +tricks that make it easier. + +**Check for leaks from smart pointers.** If the leak comes from a smart +pointer that is logged in the XPCOM_MEM_COMPTR_LOG, then +find-comptr-leakers.pl will find the exact stack for you, and you don't +have to look at trees. + +**Ignore balanced trees**. The `make-tree.pl` script accepts an option +`--ignore-balanced`, which tells it *not* to bother printing out the +children of a node whose balance factor is zero. This can help remove +some of the clutter from an otherwise noisy tree. + +**Ignore matched releases from smart pointers.** If you've checked (see +above) that the leak wasn't from a smart pointer, you can ignore the +references that came from smart pointers (where we can use the pointer +identity of the smart pointer to match the AddRef and the Release). This +requires using an XPCOM_MEM_REFCNT_LOG and an XPCOM_MEM_COMPTR_LOG that +were collected at the same time. For more details, see the [old +documentation](http://www-archive.mozilla.org/performance/leak-tutorial.html) +(which should probably be incorporated here). This is best used with +`--ignore-balanced` + +**Play Mah Jongg**. An unbalanced tree is not necessarily an evil thing. +More likely, it indicates that one `AddRef()` is cancelled by another +`Release()` somewhere else in the code. So the game is to try to match +them with one another. + +**Exclude Functions.** To aid in this process, you can create an +"excludes file", that lists the name of functions that you want to +exclude from the tree building process (presumably because you've +matched them). `make-tree.pl` has an option `--exclude [file]`, where +`[file]` is a newline-separated list of function names that will be +*excluded* from consideration while building the tree. Specifically, any +call stack that contains that call site will not contribute to the +computation of balance factors in the tree. + +## How to instrument your objects for refcount tracing and balancing + +The process is the same as instrumenting them for BloatView because BloatView +and refcount tracing share underlying infrastructure. diff --git a/docs/performance/memory/tree_map_view.md b/docs/performance/memory/tree_map_view.md new file mode 100644 index 0000000000..30d9968db6 --- /dev/null +++ b/docs/performance/memory/tree_map_view.md @@ -0,0 +1,62 @@ +# Tree map view + +The Tree map view is new in Firefox 48. + +The Tree map view provides a visual representation of the snapshot, that +helps you quickly get an idea of which objects are using the most +memory. + +A treemap displays [\"hierarchical (tree-structured) data as a set of +nested rectangles\"](https://en.wikipedia.org/wiki/Treemapping). The +size of the rectangles corresponds to some quantitative aspect of the +data. + +For the treemaps shown in the Memory tool, things on the heap are +divided at the top level into four categories: + +- **objects**: JavaScript and DOM objects, such as `Function`, + `Object`, or `Array`, and DOM types like `Window` and + `HTMLDivElement`. +- **scripts**: JavaScript sources loaded by the page. +- **strings** +- **other**: this includes internal + [SpiderMonkey](https://developer.mozilla.org/en-US/docs/Tools/Tools_Toolbox#settings/en-US/docs/Mozilla/Projects/SpiderMonkey) objects. + +Each category is represented with a rectangle, and the size of the +rectangle corresponds to the proportion of the heap occupied by items in +that category. This means you can quickly get an idea of roughly what +sorts of things allocated by your site are using the most memory. + +Within top-level categories: + +- **objects** is further divided by the object's type. +- **scripts** is further subdivided by the script's origin. It also + includes a separate rectangle for code that can't be correlated + with a file, such as JIT-optimized code. +- **other** is further subdivided by the object's type. + +Here are some example snapshots, as they appear in the Tree map view: + +![](../img/treemap-domnodes.png) + +This treemap is from the [DOM allocation +example](DOM_allocation_example.md), which runs a +script that creates a large number of DOM nodes (200 `HTMLDivElement` +objects and 4000 `HTMLSpanElement` objects). You can see how almost all +the heap usage is from the `HTMLSpanElement` objects that it creates. + +![](../img/treemap-monsters.png) + +This treemap is from the [monster allocation +example](monster_example.md), which creates three +arrays, each containing 5000 monsters, each monster having a +randomly-generated name. You can see that most of the heap is occupied +by the strings used for the monsters' names, and the objects used to +contain the monsters' other attributes. + +![](../img/treemap-bbc.png) + +This treemap is from , and is probably more +representative of real life than the examples. You can see the much +larger proportion of the heap occupied by scripts, that are loaded from +a large number of origins. diff --git a/docs/performance/perf.md b/docs/performance/perf.md new file mode 100644 index 0000000000..47177c3cfd --- /dev/null +++ b/docs/performance/perf.md @@ -0,0 +1,57 @@ +# Perf + +`perf` is a powerful system-wide instrumentation service that is part of +Linux. This article discusses how it can be relevant to power profiling. + +**Note**: The [power profiling +overview](power_profiling_overview.md) is +worth reading at this point if you haven't already. It may make parts +of this document easier to understand. + +## Energy estimates + +`perf` can access the Intel RAPL energy estimates. The following example +shows how to invoke it for this purpose. + +``` +sudo perf stat -a -r 1 \ + -e "power/energy-pkg/" \ + -e "power/energy-cores/" \ + -e "power/energy-gpu/" \ + -e "power/energy-ram/" \ + +``` + +The `-a` is necessary; it means \"all cores\", and without it all the +measurements will be zero. The `-r 1` means `` is executed +once; higher values can be used to get variations. + +The output will look like the following. + +``` +Performance counter stats for 'system wide': + + 51.58 Joules power/energy-pkg/ [100.00%] + 14.80 Joules power/energy-cores/ [100.00%] + 9.93 Joules power/energy-gpu/ [100.00%] + 27.38 Joules power/energy-ram/ [100.00%] + +5.003049064 seconds time elapsed +``` + +It's not clear from the output, but the following relationship holds. + +``` +energy-pkg >= energy-cores + energy-gpu +``` + +The measurement is in Joules, which is usually less useful than Watts. + +For these reasons +[rapl](tools_power_rapl.md) is usually a +better tool for measuring power consumption on Linux. + +## Wakeups {#Wakeups} + +`perf` can also be used to do [high-context profiling of +wakeups](http://robertovitillo.com/2014/02/04/idle-wakeups-are-evil/). diff --git a/docs/performance/perfstats.md b/docs/performance/perfstats.md new file mode 100644 index 0000000000..6ffaa55da9 --- /dev/null +++ b/docs/performance/perfstats.md @@ -0,0 +1,30 @@ +# PerfStats + +PerfStats is a framework for the low-overhead selective collection of internal performance metrics. +The results are accessible through ChromeUtils, Browsertime output, and in select performance tests. + +## Adding a new PerfStat +Define the new PerfStat by adding it to [this list](https://searchfox.org/mozilla-central/rev/b1e5f2c7c96be36974262551978d54f457db2cae/tools/performance/PerfStats.h#34-53) in [`PerfStats.h`](https://searchfox.org/mozilla-central/rev/52da19becaa3805e7f64088e91e9dade7dec43c8/tools/performance/PerfStats.h). +Then, in C++ code, wrap execution in an RAII object, e.g. +``` +PerfStats::AutoMetricRecording() +``` +or call the following function manually: +``` +PerfStats::RecordMeasurement(PerfStats::Metric::MyMetric, Start, End) +``` +For incrementing counters, use the following: +``` +PerfStats::RecordMeasurementCount(PerfStats::Metric::MyMetric, incrementCount) +``` + +[Here's an example of a patch where a new PerfStat was added and used.](https://hg.mozilla.org/mozilla-central/rev/3e85a73d1fa5c816fdaead66ecee603b38f9b725) + +## Enabling collection +To enable collection, use `ChromeUtils.SetPerfStatsCollectionMask(MetricMask mask)`, where `mask=0` disables all metrics and `mask=0xFFFFFFFF` enables all of them. +`MetricMask` is a bitmask based on `Metric`, i.e. `Metric::LayerBuilding (2)` is synonymous to `1 << 2` in `MetricMask`. + +## Accessing results +Results can be accessed with `ChromeUtils.CollectPerfStats()`. +The Browsertime test framework will sum results across processes and report them in its output. +The raptor-browsertime Windows essential pageload tests also collect all PerfStats. diff --git a/docs/performance/platform_microbenchmarks/platform_microbenchmarks.md b/docs/performance/platform_microbenchmarks/platform_microbenchmarks.md new file mode 100644 index 0000000000..761cd0f30d --- /dev/null +++ b/docs/performance/platform_microbenchmarks/platform_microbenchmarks.md @@ -0,0 +1,21 @@ +# Platform microbenchmarks + +Platform microbenchmarks benchmarks specific low-level operations used +by the gecko platform. If a test regresses, it could result in the +degradation in the performance of some user-visible feature. + +The list of tests and their descriptions is currently incomplete. If +something is missing, please search for it in the gecko source and +update this page (or ask the original author to do so, if you're still +not sure). + +## String tests + +* PerfStripWhitespace +* PerfCompressWhitespace +* PerfStripCharsWhitespace +* PerfStripCRLF +* PerfStripCharsCRLF + +These tests measure the amount of time it takes to perform a large +number of operations on low-level strings. diff --git a/docs/performance/power_profiling_overview.md b/docs/performance/power_profiling_overview.md new file mode 100644 index 0000000000..bb8f511fe2 --- /dev/null +++ b/docs/performance/power_profiling_overview.md @@ -0,0 +1,326 @@ +# Power profiling + +This article covers important background information about power +profiling, with an emphasis on Intel processors used in desktop and +laptop machines. It serves as a starting point for anybody doing power +profiling for the first time. + +## Basic physics concepts + +In physics, *[power](https://en.wikipedia.org/wiki/Power_%28physics%29)* +is the rate of doing +*[work](https://en.wikipedia.org/wiki/Work_%28physics%29 "Work (physics)")*. +It is equivalent to an amount of +*[energy](https://en.wikipedia.org/wiki/Energy_%28physics%29 "Energy (physics)"){.mw-redirect}* +consumed per unit time. In SI units, energy is measured in Joules, and +power is measured in Watts, which is equivalent to Joules per second. + +Although power is an instantaneous concept, in practice measurements of +it are determined in a non-instantaneous fashion, i.e. by dividing an +energy amount by a non-infinitesimal time period. Strictly speaking, +such a computation gives the *average power* but this is often referred +to as just the *power* when context makes it clear. + +In the context of computing, a fully-charged mobile device battery (as +found in a laptop or smartphone) holds a certain amount of energy, and +the speed at which that stored energy is depleted depends on the power +consumption of the mobile device. That in turn depends on the software +running on the device. Web browsers are popular applications and can be +power-intensive, and therefore can significantly affect battery life. As +a result, it is worth optimizing (i.e. reducing) the power consumption +caused by Firefox and Firefox OS. + +## Intel processor basics + +### Processor layout + +The following diagram (from the [Intel Power Governor +documentation)](https://software.intel.com/en-us/articles/intel-power-governor) +shows how machines using recent Intel processors are constructed. + +![](img/power-planes.jpg) + +The important points are as follows. + +- The processor has one or more *packages*. These are part of the + actual processor that you buy from Intel. Client processors (e.g. + Core i3/i5/i7) have one package. Server processors (e.g. Xeon) + typically have two or more packages. +- Each package contains multiple *cores*. +- Each core typically has + [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading), + which means it contains two logical *CPUs*. +- The part of the package outside the cores is called the [*uncore* or + *system agent*](https://en.wikipedia.org/wiki/Uncore)*.* It includes + various components including the L3 cache, memory controller, and, + for processors that have one, the integrated GPU. +- RAM is separate from the processor. + +### C-states + +Intel processors have aggressive power-saving features. The first is the +ability to switch frequently (thousands of times per second) between +active and idle states, and there are actually several different kinds +of idle states. These different states are called +*[C-states](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Processor_states).* +C0 is the active/busy state, where instructions are being executed. The +other states have higher numbers and reflect increasing deeper idle +states. The deeper an idle state is, the less power it uses, but the +longer it takes to wake up from. + +Note: the [ACPI +standard](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) +specifies four states, C0, C1, C2 and C3. Intel maps these to +processor-specific states such as C0, C1, C2, C6 and C7. and many tools +report C-states using the latter names. The exact relationship is +confusing, and chapter 13 of the [Intel optimization +manual](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html) +has more details. The important thing is that C0 is always the active +state, and for the idle states a higher number always means less power +consumption. + +The other thing to note about C-states is that they apply both to cores +and the entire package --- i.e. if all cores are idle then the entire +package can also become idle, which reduces power consumption even +further. + +The fraction of time that a package or core spends in an idle C-state is +called the *C-state residency*. This is a misleading term --- the active +state, C0, is also a C-state --- but one that is nonetheless common. + +Intel processors have model-specific registers (MSRs) containing +measurements of how much time is spent in different C-states, and tools +such as [powermetrics](powermetrics.md) +(Mac), powertop and +[turbostat](turbostat.md) (Linux) can +expose this information. + +A *wakeup* occurs when a core or package transitions from an idle state +to the active state. This happens when the OS schedules a process to run +due to some kind of event. Common causes of wakeups include scheduled +timers going off and blocked I/O system calls receiving data. +Maintaining C-state residency is crucial to keep power consumption low, +and so reducing wakeup frequency is one of the best ways to reduce power +consumption. + +One consequence of the existence of C-states is that observations made +during power profiling --- even more than with other kinds of profiling +--- can disturb what is being observed. For example, the Gecko Profiler +takes samples at 1000Hz using a timer. Each of these samples can trigger +a wakeup, which consumes power and obscures Firefox's natural wakeup +patterns. For this reason, integrating power measurements into the Gecko +Profiler is unlikely to be useful, and other power profiling tools +typically use much lower sampling rates (e.g. 1Hz.) + +### P-states + +Intel processors also support multiple *P-states*. P0 is the state where +the processor is operating at maximum frequency and voltage, and +higher-numbered P-states operate at a lower frequency and voltage to +reduce power consumption. Processors can have dozens of P-states, but +the transitions are controlled by the hardware and OS and so P-states +are of less interest to application developers than C-states. + +## Power and power-related measurements + +There are several kinds of power and power-related measurements. Some +are global (whole-system) and some are per-process. The following +sections list them from best to worst. + +### Power measurements + +The best measurements are measured in joules and/or watts, and are taken +by measuring the actual hardware in some fashion. These are global +(whole-system) measurements that are affected by running programs but +also by other things such as (for laptops) how bright the monitor +backlight is. + +- Devices such as ammeters give the best results, but these can be + expensive and difficult to set up. +- A cruder technique that works with mobile machines and devices is to + run a program for a long time and simply time how long it takes for + the battery to drain. The long measurement times required are a + disadvantage, though. + +### Power estimates + +The next best measurements come from recent (Sandy Bridge and later) +Intel processors that implement the *RAPL* (Running Average Power Limit) +interface that provides MSRs containing energy consumption estimates for +up to four *power planes* or *domains* of a machine, as seen in the +diagram above. + +- PKG: The entire package. + - PP0: The cores. + - PP1: An uncore device, usually the GPU (not available on all + processor models.) +- DRAM: main memory (not available on all processor models.) + +The following relationship holds: PP0 + PP1 \<= PKG. DRAM is independent +of the other three domains. + +These values are computed using a power model that uses +processor-internal counts as inputs, and they have been +[verified](http://www.computer.org/csdl/proceedings/ispass/2013/5776/00/06557170.pdf) +as being fairly accurate. They are also updated frequently, at +approximately 1,000 Hz, though the variability in their update latency +means that they are probably only accurate at lower frequencies, e.g. up +to 20 Hz or so. See section 14.9 of Volume 3 of the [Intel Software +Developer's +Manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) +for more details about RAPL. + +Tools that can take RAPL readings include the following. + +- `tools/power/rapl`: all planes; Linux and Mac. +- [Intel Power + Gadget](intel_power_gadget.md): PKG and + PP0 planes; Windows, Mac and Linux. +- [powermetrics](powermetrics.md): PKG + plane; Mac. +- [perf](perf.md): all planes; Linux. +- [turbostat](turbostat.md): PKG, PP0 and + PP1 planes; Linux. + +Of these, +[tools/power/rapl](tools_power_rapl.md) is +generally the easiest and best to use because it reads all power planes, +it's a command line utility, and it doesn't measure anything else. + +### Proxy measurements + +The next best measurements are proxy measurements, i.e. measurements of +things that affect power consumption such as CPU activity, GPU activity, +wakeup frequency, C-state residency, disk activity, and network +activity. Some of these are measured on a global basis, and some can be +measured on a per-process basis. Some can also be measured via +instrumentation within Firefox itself. + +The correlation between each proxy measure and power consumption is hard +to know and can vary greatly. When used carefully, however, they can +still be useful. This is because they can often be measured in a more +fine-grained fashion than power measurements and estimates, which is +vital for gaining insight into how a program can reduce power +consumption. + +Most profiling tools provide at least some proxy measurements. + +### Hybrid proxy measurements + +These are combinations of proxy measurements. The combinations are +semi-arbitrary, they amplify the unreliability of proxy measurements, +and unlike non-hybrid proxy measurements, they don't have a clear +physical meaning. Avoid them. + +The most notable example of a hybrid proxy measurement is the ["Energy +Impact" used by OS X's Activity +[Monitor](activity_monitor_and_top.md#What-does-Energy-Impact-measure). + +## Ways to user power-related measurements + +### Low-context measurements + +Most power-related measurements are global or per-process. Such +low-context measurements are typically good for understand *if* power +consumption is good or bad, but in the latter case they often don't +provide much insight into why the problem is occurring, which part of +the code is at fault, or how it can be fixed. Nonetheless, they can +still help improve understanding of a problem by using *differential +profiling*. + +- Compare browsers to see if Firefox is doing better or worse than + another browser on a particular workload. +- Compare different versions of Firefox to see if Firefox has improved + or worsened over time on a particular workload. This can identify + specific changes that caused regressions, for example. +- Compare different configurations of Firefox to see if a particular + feature is affecting things. +- Compare different workloads. This can be particularly useful if the + workloads only vary slightly. For example, it can be useful to + gradually remove elements from a web page and see how the + power-related measurements change. Even just switching a tab from + the foreground to the background can make a difference. + +### High-context measurements + +A few power-related measurements can be obtained in a high-context +fashion, e.g. with stack traces that clearly pinpoint specific parts of +the code as being responsible. + +- Standard performance profiling tools that measure CPU usage or + proxies of CPU usage (such as instruction counts) typically provide + high-context measurements. This is useful because high CPU usage + typically causes high power consumption. +- Some tools can provide high-context wakeup measurements: + [dtrace](dtrace.md) (on Mac) and + [perf](perf.md) (on Linux.) +- Source-level instrumentation, such as [TimerFirings + logging](timerfirings_logging.md), can + identify which timers are firing frequently. + +## Power profiling how-to + +This section aims to put together all the above information and provide +a set of strategies for finding, diagnosing and fixing cases of high +power consumption. + +- First of all, all measurements are best done on a quiet machine that + is running little other than the program of interest. Global + measurements in particular can be completely skewed and unreliable + if this is not the case. +- Find or confirm a test case where Firefox's power consumption is + high. "High" can most easily be gauged by comparing against other + browsers. Use power measurements or estimates (e.g. via + [tools/power/rapl](tools_power_rapl.md), + or `mach power` on Mac, or [Intel Power + Gadget](intel_power_gadget.md) on + Windows) for the comparisons. Avoid lower-quality measurements, + especially Activity Monitor's "Energy Impact". +- Try using differential profiling to narrow down the cause. + - Try turning hardware acceleration on or off; e10s on or off; + Flash on or off. + - Try putting the relevant tab in the foreground vs. in the + background. + - If the problem manifests on a particular website, try saving a + local copy of the site and then manually removing HTML elements + to see if a particular page feature is causing the problem +- Many power problems are caused by either high CPU usage or high + wakeup frequency. Use one of the low-context tools to determine if + this is the case (e.g. on Mac use + [powermetrics](powermetrics.md).) If + so, follow that up by using a tool that gives high-context + measurements, which hopefully will identify the cause of the + problem. + - For high CPU usage, many profilers can be used: Firefox's dev + tools profiler, the Gecko Profiler, or generic performance + profilers. + - For high wakeup counts, use + [dtrace](dtrace.md) or + [perf](perf.md) or [TimerFirings logging](timerfirings_logging.md). +- On Mac workloads that use graphics, Activity Monitor's "Energy" + tab can tell you if the high-performance GPU is being used, which + uses more power than the integrated GPU. +- If neither CPU usage nor wakeup frequency identifies the problem, + more ingenuity may be needed. Looking at other measurements (C-state + residency, GPU usage, etc.) may be helpful. +- Animations are sometimes the cause of high power consumption. The + [animation + inspector](/devtools-user/page_inspector/how_to/work_with_animations/index.rst#animation-inspector) + in the Firefox Devtools can identify them. Alternatively, [here is + an + explanation](https://bugzilla.mozilla.org/show_bug.cgi?id=1190721#c10) + of how one developer diagnosed two animation-related problems the + hard way (which required genuine platform expertise). +- The approximate cause of power problems often isn't that hard to + find. Fixing them is often the hard part. Good luck. +- If you do fix a problem by improving a proxy measurement, you should + verify that it also improves a power measurement or estimate. That + way you know the fix had a genuine effect. + +## Further reading + +Chapter 13 of the [Intel optimization +manual](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html) +has many details about optimizing for power consumption. Section 13.5 +("Tuning Software for Intelligent Power Consumption") in particular is +worth reading. diff --git a/docs/performance/powermetrics.md b/docs/performance/powermetrics.md new file mode 100644 index 0000000000..44df0eda9c --- /dev/null +++ b/docs/performance/powermetrics.md @@ -0,0 +1,167 @@ +# powermetrics + +`powermetrics` is a Mac-only command-line utility that provides many +high-quality power-related measurements. It is most useful for getting +CPU, GPU and wakeup measurements in a precise and easily scriptable +fashion (unlike [Activity Monitor and +top](activity_monitor_and_top.md)) +especially in combination with +[rapl](tools_power_rapl.md) via the +`mach power` command. This document describes the version of +`powermetrics` that comes with Mac OS 10.10. The one that comes with +10.9 is less powerful. + +**Note**: The [power profiling +overview](power_profiling_overview.md) is +worth reading at this point if you haven\'t already. It may make parts +of this document easier to understand. + +## Quick start + +`powermetrics` provides a vast number of measurements. The following +command encompasses the most useful ones: + +sudo powermetrics --samplers tasks --show-process-coalition --show-process-gpu -n 1 -i 5000 + +- `--samplers tasks` tells it to just do per-process measurements. +- `--show-process-coalition`` `tells it to group *coalitions* of + related processes, e.g. the Firefox parent process and child + processes. +- `--show-process-gpu` tells it to show per-process GPU measurements. +- `-n 1` tells it to take one sample and then stop. +- `-i 5000` tells it to use a sample length of 5 seconds (5000 ms). + Change this number to get shorter or longer samples. + +The following is example output from such an invocation: + + *** Sampled system activity (Fri Sep 4 17:15:14 2015 +1000) (5009.63ms elapsed) *** + + *** Running tasks *** + + Name ID CPU ms/s User% Deadlines (<2 ms, 2-5 ms) Wakeups (Intr, Pkg idle) GPU ms/s + com.apple.Terminal 293 447.66 274.83 120.35 221.74 + firefox 84627 77.59 55.55 15.37 2.59 91.42 42.12 204.47 + plugin-container 84628 377.22 37.18 43.91 18.56 178.65 75.85 17.29 + Terminal 694 9.86 79.94 0.00 0.00 4.39 2.20 0.00 + powermetrics 84694 1.21 31.53 0.00 0.00 0.20 0.20 0.00 + com.google.Chrome 489 233.83 48.10 25.95 0.00 + Google Chrome Helper 84688 181.57 92.81 0.00 0.00 23.95 12.77 0.00 + Google Chrome 84681 57.26 76.07 4.39 0.00 23.75 12.97 0.00 + Google Chrome Helper 84685 0.13 48.08 0.00 0.00 0.40 0.20 0.00 + kernel_coalition 1 128.64 780.19 330.52 0.00 + kernel_task 0 109.97 0.00 0.20 0.00 779.47 330.35 0.00 + launchd 1 18.88 2.44 0.00 0.00 0.40 0.20 0.00 + com.apple.Safari 488 90.60 108.58 56.48 26.65 + com.apple.WebKit.WebContent 84679 64.21 84.69 0.00 0.00 104.19 54.89 26.66 + com.apple.WebKit.Networking 84678 26.89 58.89 0.40 0.00 1.60 0.00 0.00 + Safari 84676 1.56 55.74 0.00 0.00 2.59 1.40 0.00 + com.apple.Safari.SearchHelper 84690 0.15 49.49 0.00 0.00 0.20 0.20 0.00 + org.mozilla.firefox 482 76.56 124.34 63.47 0.00 + firefox 84496 76.70 89.18 10.58 5.59 124.55 63.48 0.00 + +This sample was taken while the following programs were running: + +- Firefox Beta (single process, invoked from the Mac OS dock, shown in + the `org.mozilla.firefox` coalition.) +- Firefox Nightly (multi-process, invoked from the command line, shown + in the `com.apple.Terminal` coalition.) +- Google Chrome. +- Safari. + +The grouping of parent and child processes (in coalitions) is obvious. +The meaning of the columns is as follows. + +- **Name**: Coalition/process name. Process names within coalitions + are indented. +- **ID**: Coalition/process ID number. +- **CPU ms/s**: CPU time used by the coalition/process, per second, + during the sample period. The sum of the process values typically + exceeds the coalition value slightly, for unknown reasons. +- **User%**: Percentage of that CPU time spent in user space (as + opposed to kernel mode.) +- **Deadlines (\<2 ms, 2-5 ms)**: These two columns count how many + \"short\" timers woke up threads in the process, per second, during + the sample period. High frequency timers, which typically have short + time-to-deadlines, can cause high power consumption and should be + avoided if possible. +- **Wakeups (Intr, Pkg idle)**: These two columns count how many + wakeups occurred, per second, during the sample period. The first + column counts interrupt-level wakeups that resulted in a thread + being dispatched in the process. The second column counts \"package + idle exit\" wakeups, which wake up the entire package as opposed to + just a single core; such wakeups are particularly expensive, and + this count is a subset of the first column\'s count. +- **GPU ms/s**: GPU time used by the coalition/process, per second, + during the sample period. + +Other things to note. + +- Smaller is better --- i.e. results in lower power consumption --- + for all of these measurements. +- There is some overlap between the two \"Deadlines\" columns and the + two \"Wakeups\" columns. For example, firing a single sub-2ms + deadline can also cause a package idle exit wakeup. +- Many of these measurements are also obtainable by passing the + `TASK_POWER_INFO` flag and a `task_power_info` struct to the + `task_info` function. +- By default, the coalitions/processes are sorted by a composite value + computed from several factors, though this can be changed via + command-line options. + +## Other measurements + +`powermetrics` can also report measurements of backlight usage, network +activity, disk activity, interrupt distribution, device power states, +C-state residency, P-state residency, quality of service classes, and +thermal pressure. These are less likely to be useful for profiling +Firefox, however. Run with the `--show-all` to see all of these at once, +but note that you\'ll need a very wide window to see all the data. + +Also note that `powermetrics -h` is a better guide to the the +command-line options than `man powermetrics`. + +## mach power + +You can use the `mach power` command to run `powermetrics` in +combination with `rapl` in a way that gives the most useful summary +measurements for each of Firefox, Chrome and Safari. The following is +sample output. + + total W = _pkg_ (cores + _gpu_ + other) + _ram_ W + #01 17.14 W = 14.98 ( 5.50 + 1.19 + 8.29) + 2.16 W + + 1 sample taken over a period of 30.000 seconds + + Name ID CPU ms/s User% Deadlines (<2 ms, 2-5 ms) Wakeups (Intr, Pkg idle) GPU ms/s + com.google.Chrome 500 439.64 585.35 218.62 19.17 + Google Chrome Helper 67319 284.75 83.03 296.67 0.00 454.05 172.74 0.00 + Google Chrome Helper 67304 55.23 64.83 0.03 0.00 9.43 4.33 19.17 + Google Chrome 67301 63.77 68.09 29.46 0.13 76.11 22.26 0.00 + Google Chrome Helper 67320 38.30 66.70 17.83 0.00 45.78 19.29 0.00 + com.apple.WindowServer 68 102.58 112.36 43.15 80.52 + WindowServer 141 103.03 58.19 60.48 6.40 112.36 43.15 80.53 + com.apple.Safari 499 267.19 110.53 46.05 1.69 + com.apple.WebKit.WebContent 67372 190.15 79.34 2.02 0.14 129.28 53.79 2.33 + com.apple.WebKit.Networking 67292 65.23 52.74 0.07 0.00 4.33 1.40 0.00 + Safari 67290 29.09 77.65 0.23 0.00 7.13 3.37 0.00 + com.apple.Safari.SearchHelper 67371 13.88 91.18 0.00 0.00 0.36 0.05 0.00 + com.apple.WebKit.WebContent 67297 0.81 56.84 0.10 0.00 2.20 1.30 0.00 + com.apple.WebKit.WebContent 67293 0.46 76.40 0.03 0.00 0.57 0.20 0.00 + com.apple.WebKit.WebContent 67295 0.24 67.72 0.00 0.00 0.90 0.37 0.00 + com.apple.WebKit.WebContent 67298 0.17 59.88 0.00 0.00 0.50 0.13 0.00 + com.apple.WebKit.WebContent 67296 0.07 43.51 0.00 0.00 0.10 0.03 0.00 + kernel_coalition 1 111.76 724.80 213.09 0.12 + kernel_task 0 107.06 0.00 5.86 0.00 724.46 212.99 0.12 + org.mozilla.firefox 498 92.17 212.69 75.67 1.81 + firefox 63865 61.00 87.18 1.00 0.87 25.79 9.00 1.81 + plugin-container 67269 31.49 72.46 1.80 0.00 186.90 66.68 0.00 + com.apple.WebKit.Plugin.64 67373 55.55 74.38 0.74 0.00 9.51 3.13 0.02 + com.apple.Terminal 109 6.22 0.40 0.23 0.00 + Terminal 208 6.25 92.99 0.00 0.00 0.33 0.20 0.00 + +The `rapl` output is first, then the `powermetrics` output. As well as +the browser processes, the `WindowServer` and kernel tasks are shown +because browsers often trigger significant load in them. + +The default sample period is 30,000 milliseconds (30 seconds), but that +can be changed with the `-i` option. diff --git a/docs/performance/profiling_with_concurrency_visualizer.md b/docs/performance/profiling_with_concurrency_visualizer.md new file mode 100644 index 0000000000..495fa15538 --- /dev/null +++ b/docs/performance/profiling_with_concurrency_visualizer.md @@ -0,0 +1,5 @@ +# Profiling with Concurrency Visualizer + +Concurrency Visualizer is an excellent alternative to xperf. In newer versions of Visual Studio, it is an addon that needs to be downloaded. + +Here are some scripts that you can be used for manipulating the profiles that have been exported to CSV: [https://github.com/jrmuizel/concurrency-visualizer-scripts](https://github.com/jrmuizel/concurrency-visualizer-scripts) diff --git a/docs/performance/profiling_with_instruments.md b/docs/performance/profiling_with_instruments.md new file mode 100644 index 0000000000..ac37bb2660 --- /dev/null +++ b/docs/performance/profiling_with_instruments.md @@ -0,0 +1,110 @@ +# Profiling with Instruments + +Instruments can be used for memory profiling and for statistical +profiling. + +## Official Apple documentation + +- [Instruments User + Guide](https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/InstrumentsUserGuide/) +- [Instruments User + Reference](https://developer.apple.com/library/mac/documentation/AnalysisTools/Reference/Instruments_User_Reference/) +- [Instruments Help + Articles](https://developer.apple.com/library/mac/recipes/Instruments_help_articles/) +- [Instruments + Help](https://developer.apple.com/library/mac/recipes/instruments_help-collection/) +- [Performance + Overview](https://developer.apple.com/library/mac/documentation/Performance/Conceptual/PerformanceOverview/) + +### Basic Usage + +- Select \"Time Profiler\" from the \"Choose a profiling template + for:\" dialog. +- In the top left, next to the record and pause button, there will be + a \"\[machine name\] \> All Processes\". Click \"All Processes\" and + select \"firefox\" from the \"Running Applications\" section. +- Click the record button (red circle in top left) +- Wait for the amount of time that you want to profile +- Click the stop button + +## Command line tools + +There is +[instruments](https://developer.apple.com/library/mac/documentation/Darwin/Reference/Manpages/man1/instruments.1.html) +and +[iprofiler](https://developer.apple.com/library/mac/documentation/Darwin/Reference/Manpages/man1/iprofiler.1.html). + +How do we monitor performance counters (cache miss etc.)? Instruments +has a \"Counters\" instrument that can do this. + +## Memory profiling + +Instruments will record a call stack at each allocation point. The call +tree view can be quite helpful here. Switch from \"Statistics\". This +`malloc` profiling is done using the `malloc_logger` infrastructure +(similar to `MallocStackLogging`). Currently this means you need to +build with jemalloc disabled (`ac_add_options --disable-jemalloc`). You +also need the fix to [Bug +719427](https://bugzilla.mozilla.org/show_bug.cgi?id=719427 "https://bugzilla.mozilla.org/show_bug.cgi?id=719427") + +## Kernel stacks + +Under "File" -> "Recording Options" you can enable "Record Kernel Callstacks". +To get full symbols and not just the exported ones, you'll to install the matching +[Kernel Debug Kit](https://developer.apple.com/download/all/?q=Kernel%20Debug%20Kit). +Make sure you install the one whose macOS version exactly matches your version, +including the identifier at the end (e.g. "12.3.1 (21E258)"). + +### Allow Instruments to find kernel symbols + +Installing the KDK is often not enough for Instruments to find the symbols. +Instruments uses Spotlight to find the dSYMs with the matching UUID, so you +need to put the dSYM in a place where Spotlight will index it. + +First, check the UUID of your macOS installation's kernel. To do so, run the +following: + +``` +% dwarfdump --uuid /System/Library/Kernels/kernel.release.t6000 +UUID: C342869F-FFB9-3CCE-A5A3-EA711C1E87F6 (arm64e) /System/Library/Kernels/kernel.release.t6000 +``` + +Then, find the corresponding dSYM file in the KDK that you installed, and +run `mdls` on it. For example: + +``` +% mdls /Library/Developer/KDKs/KDK_12.3.1_21E258.kdk/System/Library/Kernels/kernel.release.t6000.dSYM +``` + +(Make sure you use the `.release` variant, not the `.development` variant +or any of the others.) + +If the output from `mdls` contains the string `com_apple_xcode_dsym_uuids` +and the UUID matches, you're done. + +Otherwise, try copying the `kernel.release.t6000.dSYM` bundle to your home +directory, and then run `mdls` on the copied file. For example: + +``` +% cp -r /Library/Developer/KDKs/KDK_12.3.1_21E258.kdk/System/Library/Kernels/kernel.release.t6000.dSYM ~/ +% mdls ~/kernel.release.t6000.dSYM +_kMDItemDisplayNameWithExtensions = "kernel.release.t6000.dSYM" +com_apple_xcode_dsym_paths = ( + "Contents/Resources/DWARF/kernel.release.t6000" +) +com_apple_xcode_dsym_uuids = ( + "C342869F-FFB9-3CCE-A5A3-EA711C1E87F6" +) +kMDItemContentCreationDate = 2022-03-21 15:25:57 +0000 +[...] +``` + +Now Instruments should be able to pick up the kernel symbols. + +## Misc + +The `DTPerformanceSession` api can be used to control profiling from +applications like the old CHUD API we use in Shark builds. [Bug +667036](https://bugzilla.mozilla.org/show_bug.cgi?id=667036 "https://bugzilla.mozilla.org/show_bug.cgi?id=667036") + +System Trace might be useful. diff --git a/docs/performance/profiling_with_xperf.md b/docs/performance/profiling_with_xperf.md new file mode 100644 index 0000000000..030dae7c68 --- /dev/null +++ b/docs/performance/profiling_with_xperf.md @@ -0,0 +1,180 @@ +# Profiling with xperf + +Xperf is part of the Microsoft Windows Performance Toolkit, and has +functionality similar to that of Shark, oprofile, and (for some things) +dtrace/Instruments. For stack walking, Windows Vista or higher is +required; I haven't tested it at all on XP. + +This page applies to xperf version **4.8.7701 or newer**. To see your +xperf version, either run '`xperf`' on a command line with no +arguments, or start '`xperfview`' and look at Help -\> About +Performance Analyzer. (Note that it's not the first version number in +the About window; that's the Windows version.) + +If you have an older version, you will experience bugs, especially +around symbol loading for local builds. + +## Installation + +For all versions, the tools are part of the latest [Windows 7 SDK (SDK +Version +7.1)](http://www.microsoft.com/downloads/details.aspx?FamilyID=6b6c21d2-2006-4afa-9702-529fa782d63b&displaylang=en "http://www.microsoft.com/downloads/details.aspx?FamilyID=6b6c21d2-2006-4afa-9702-529fa782d63b&displaylang=en"){.external}. +Use the web installer to install at least the \"Win32 Development +Tools\". Once the SDK installs, execute either `wpt_x86.msi` or +`wpt_x64.msi` in the `Redist/Windows Performance Toolkit `folder of the +SDK's install location (typically Program Files/Microsoft +SDKs/Windows/v7.1/Redist/Windows Performance Toolkit) to actually +install the Windows Performance Toolkit tools. + +It might already be installed by the Windows SDK. Check if C:\\Program +Files\\Microsoft Windows Performance Toolkit already exists. + +For 64-bit Windows 7 or Vista, you'll need to do a registry tweak and +then restart to enable stack walking:\ +\ +`REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f` + +## Symbol Server Setup + +With the latest versions of the Windows Performance Toolkit, you can +modify the symbol path directly from within the program via the Trace +menu. Just make sure you set the symbol paths before enabling \"Load +Symbols\" and before opening a summary view. You can also modify the +`_NT_SYMBOL_PATH` and `_NT_SYMCACHE_PATH` environment variables to make +these changes permanent. + +The standard symbol path that includes both Mozilla's and Microsoft's +symbol server configuration is as follows: + +`_NT_SYMCACHE_PATH: C:\symbols _NT_SYMBOL_PATH: srv*c:\symbols*http://msdl.microsoft.com/download/symbols;srv*c:\symbols*http://symbols.mozilla.org/firefox/` + +To add symbols **from your own builds**, add +`C:\path\to\objdir\dist\bin` to `_NT_SYMBOL_PATH`. As with all Windows +paths, the symbol path uses semicolons (`;`) as separators. + +Make sure you select the Trace -\> Load Symbols menu option in the +Windows Performance Analyzer (xperfview). + +There seems to be a bug in xperf and symbols; it is very sensitive to +when the symbol path is edited. If you change it within the program, +you'll have to close all summary tables and reopen them for it to pick +up the new symbol path data. + +You'll have to agree to a EULA for the Microsoft symbols \-- if you're +not prompted for this, then something isn't configured right in your +symbol path. (Again, make sure that the directories exist; if they +don't, it's a silent error.) + +## Quick Start + +All these tools will live, by default, in C:\\Program Files\\Microsoft +Windows Performance Toolkit. Either run these commands from there, or +add the directory to your path. You will need to use an elevated command +prompt to start or stop profiling. + +Start recording data: + +`xperf -on latency -stackwalk profile` + +\"Latency\" is a special provider name that turns on a few predefined +kernel providers; run \"xperf -providers k\" to view a full list of +providers and groups. You can combine providers, e.g., \"xperf -on +DiagEasy+FILE_IO\". \"-stackwalk profile\" tells xperf to capture a +stack for each PROFILE event; you could also do \"-stackwalk +profile+file_io\" to capture a stack on each cpu profile tick and each +file io completion event. + +Stop: + +`xperf -d out.etl` + +View: + +`xperfview out.etl` + +The MSDN +\"[Quickstart](http://msdn.microsoft.com/en-us/library/ff190971%28v=VS.85%29.aspx){.external}\" +page goes over this in more detail, and also has good explanations of +how to use xperfview. I'm not going to repeat it here, because I'd be +using essentially the same screenshots, so go look there. + +The 'stack' view will give results similar to shark. + +## Heap Profiling + +xperf has good tools for heap allocation profiling, but they have one +major limitation: you can't build with jemalloc and get heap events +generated. The stock windows CRT allocator is horrible about +fragmentation, and causes memory usage to rise drastically even if only +a small fraction of that memory is in use. However, even despite this, +it's a useful way to track allocations/deallocations. + +### Capturing Heap Data + +The \"-heap\" option is used to set up heap tracing. Firefox generates +lots of events, so you may want to play with the +BufferSize/MinBuffers/MaxBuffers options as well to ensure that you +don't get dropped events. Also, when recording the stack, I've found +that a heap trace is often missing module information (I believe this is +a bug in xperf). It's possible to get around that by doing a +simultaneous capture of non-heap data. + +To start a trace session, launching a new Firefox instance: + +`xperf -on base xperf -start heapsession -heap -PidNewProcess "./firefox.exe -P test -no-remote" -stackwalk HeapAlloc+HeapRealloc -BufferSize 512 -MinBuffers 128 -MaxBuffers 512` + +To stop a session and merge the resulting files: + +`xperf -stop heapsession -d heap.etl xperf -d main.etl xperf -merge main.etl heap.etl result.etl` + +\"result.etl\" will contain your merged data; you can delete main.etl +and heap.etl. Note that it's possible to capture even more data for the +non-heap profile; for example, you might want to be able to correlate +heap events with performance data, so you can do +\"`xperf -on base -stackwalk profile`\". + +In the viewer, when summary data is viewed for heap events (Heap +Allocations Outstanding, etc. all lead to the same summary graphs), 3 +types of allocations are listed \-- AIFI, AIFO, AOFI. This is shorthand +for \"Allocated Inside, Freed Inside\", \"Allocated Inside, Freed +Outside\", \"Allocated Outside, Freed Inside\". These refer to the time +range that was selected for the summary graph; for example, something +that's in the AOFI category was allocated before the start of the +selected time range, but the free event happened inside. + +## Tips + +- In the summary views, the yellow bar can be dragged left and right + to change the grouping \-- for example, drag it to the left of the + Module column to have grouping happen only by process (stuff that's + to the left), so that you get symbols in order of weight, regardless + of what module they're in. +- Dragging the columns around will change grouping in various ways; + experiment to get the data that you're looking for. Also experiment + with turning columns on and off; removing a column will allow data + to be aggregated without considering that column's contributions. +- Disabling all but one core will make the numbers add up to 100%. + This can be done by running 'msconfig' and going to Advance + Options from the \"Boot\" tab. + +## Building Firefox + +To get good data from a Firefox build, it is important to build with the +following options in your mozconfig: + +`export CFLAGS="-Oy-" export CXXFLAGS="-Oy-"` + +This disables frame-pointer optimization which lets xperf do a much +better job unwinding the stack. Traces can be captured fine without this +option (for example, from nightlies), but the stack information will not +be useful. + +`ac_add_options --enable-debug-symbols` + +This gives us symbols. + +## For More Information + +Microsoft's [documentation for xperf](http://msdn.microsoft.com/en-us/library/ff191077.aspx "http://msdn.microsoft.com/en-us/library/ff191077.aspx") +is pretty good; there is a lot of depth to this tool, and you should +look there for more details. diff --git a/docs/performance/profiling_with_zoom.md b/docs/performance/profiling_with_zoom.md new file mode 100644 index 0000000000..053fa0cbce --- /dev/null +++ b/docs/performance/profiling_with_zoom.md @@ -0,0 +1,5 @@ +# Profiling with Zoom + +Zoom is a profiler very similar to Shark for Linux. + +You can get the profiler from here: diff --git a/docs/performance/reporting_a_performance_problem.md b/docs/performance/reporting_a_performance_problem.md new file mode 100644 index 0000000000..efe4f09f9c --- /dev/null +++ b/docs/performance/reporting_a_performance_problem.md @@ -0,0 +1,94 @@ +# Reporting a Performance Problem + +This article will guide you in reporting a performance problem using the +built-in Gecko Profiler tool. + +## Enabling the Profiler toolbar button + +These steps only work in Firefox 75+. + +1. Visit [https://profiler.firefox.com/](https://profiler.firefox.com/) +2. Click on *Enable Profiler Menu Button* +3. The profiler toolbar button will show up in the top right of the URL + bar as a small stopwatch icon. + +![image1](img/reportingperf1.png) + +4. You can right-click on the button and remove it from the toolbar + when you're done with it. + +## Using the Profiler + +When enabled, the profiler toolbar button is not recording by default. +Recording can be done by clicking on the toolbar icon to open its panel. +Make sure to choose an appropriate setting for the recording (if you're +not sure, choose Firefox Platform), and then choosing **Start +Recording**. The toolbar icon turns blue when it is recording. + +The profiler uses a fixed size buffer to store sample data. When it runs +out of space in its buffer, it discards old entries so you may want to +increase the buffer size if you find you are unable to capture the +profile quickly enough after you notice a performance problem. If you +choose Custom Settings (and then clicking Edit Settings) for the +profiler, you can adjust the size of the buffer (presently defaults to +90 MB) and the time interval between data collection (presently defaults +to 1 ms). Note that increasing the buffer size uses more memory and can +make capturing a profile take longer. + +![image2](img/reportingperf2.png) + +Using the keyboard shortcuts is often more convenient than using the +mouse to interact with the UI: + +* Ctrl+Shift+1 - Start/Stop the profiler +* Ctrl+Shift+2 - Take a profile and launch the viewer to view it + +## Capturing and sharing a profile + +1. While the profiler is recording, reproduce the performance problem. + If possible let the problem manifest itself for 5-10 seconds. +2. Press **Ctrl+Shift+2** or click on the profiler toolbar icon in the + top right and select **Capture**. Try to do this within a few + seconds from reproducing the performance problem as only the last + few seconds are recorded. If the timeline has a large red block + it's a good sign. ![Jank markers appearing in the Perf.html profile analysis tool.](img/PerfDotHTMLRedLines.png) +3. The data will open in a new tab. Wait until the \"Symbolicating call + stacks\" notification disappears before sharing the profile. +4. There will be a button in the top right labeled **Upload Local Profile** which + will allow you to upload this profile and once completed will write + out a link. Before uploading, the Upload button asks you what data + you'd like to publish to our servers. +5. Note that while it's possible to strip profiles of potentially + privacy sensitive information, the less information a profile + contains, *the harder it is to analyze and turn into actionable + data.* +6. Once uploaded, copy permalink URL to your clipboard by right + clicking and [add the profile URL to a bug](https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Performance) + for your performance problem and/or send it to the appropriate + person. Try to give some context about what you were doing when the + performance problem arose such as the URL you were viewing and what + actions were you doing (ex. scrolling on gmail.com). + +![image3](img/reportingperf3.png) + +## Viewing addon performance in GeckoView + +Sometimes an addon or more are slowing down Firefox. These addons might +be using the extension API in ways that were not meant to. You can see +which of these addons are causing problems by adding the +**moz-extension** filter. + +![moz-extension filter print screen](img/EJCrt4N.png) + +Make sure you are selecting the process that is using up the CPU since +all of the processes are shown. You might have a content process using +up the CPU and not the main one. + +Make sure you are doing whatever it is that slows down Firefox while +recording the profile. For example you might have one addon that slows down page load +and another one that slows down tab switch. + +Your first reflex once you find what addon is slowing down the profile +might be to disable it and search for alternatives. Before you do this, +please share the performance profile with the addon authors through a +bug report. Gecko profiler allows you to share a link with the profile. diff --git a/docs/performance/scroll-linked_effects.md b/docs/performance/scroll-linked_effects.md new file mode 100644 index 0000000000..90d3c33ed1 --- /dev/null +++ b/docs/performance/scroll-linked_effects.md @@ -0,0 +1,177 @@ +# Scroll-linked effects + +The definition of a scroll-linked effect is an effect implemented on a +webpage where something changes based on the scroll position, for +example updating a positioning property with the aim of producing a +parallax scrolling effect. This article discusses scroll-linked effects, +their effect on performance, related tools, and possible mitigation +techniques. + +## Scrolling effects explained + +Often scrolling effects are implemented by listening for the `scroll` +event and then updating elements on the page in some way (usually the +CSS +[`position`]((https://developer.mozilla.org/en-US/docs/Web/CSS/position "The position CSS property sets how an element is positioned in a document. The top, right, bottom, and left properties determine the final location of positioned elements.") +or +[`transform`](https://developer.mozilla.org/en-US/docs/Web/CSS/transform "The transform CSS property lets you rotate, scale, skew, or translate an element. It modifies the coordinate space of the CSS visual formatting model.") +property.) You can find a sampling of such effects at [CSS Scroll API: +Use +Cases](https://github.com/RByers/css-houdini-drafts/blob/master/css-scroll-api/UseCases.md). + +These effects work well in browsers where the scrolling is done +synchronously on the browser's main thread. However, most browsers now +support some sort of asynchronous scrolling in order to provide a +consistent 60 frames per second experience to the user. In the +asynchronous scrolling model, the visual scroll position is updated in +the compositor thread and is visible to the user before the `scroll` +event is updated in the DOM and fired on the main thread. This means +that the effects implemented will lag a little bit behind what the user +sees the scroll position to be. This can cause the effect to be laggy, +janky, or jittery --- in short, something we want to avoid. + +Below are a couple of examples of effects that would not work well with +asynchronous scrolling, along with equivalent versions that would work +well: + +### Example 1: Sticky positioning + +Here is an implementation of a sticky-positioning effect, where the +\"toolbar\" div will stick to the top of the screen as you scroll down. + +```html + +
+ +``` + +This implementation of sticky positioning relies on the scroll event +listener to reposition the "toolbar" div. As the scroll event listener +runs in the JavaScript on the browser's main thread, it will be +asynchronous relative to the user-visible scrolling. Therefore, with +asynchronous scrolling, the event handler will be delayed relative to +the user-visible scroll, and so the div will not stay visually fixed as +intended. Instead, it will move with the user's scrolling, and then +\"snap\" back into position when the scroll event handler runs. This +constant moving and snapping will result in a jittery visual effect. One +way to implement this without the scroll event listener is to use the +CSS property designed for this purpose: + +```html + +
+ +``` + +This version works well with asynchronous scrolling because position of +the \"toolbar\" div is updated by the browser as the user scrolls. + +### Example 2: Scroll snapping + +Below is an implementation of scroll snapping, where the scroll position +snaps to a particular destination when the user's scrolling stops near +that destination. + +```html + + +
+ +``` + +In this example, there is a scroll event listener which detects if the +scroll position is within 200 pixels of the top of the \"snaptarget\" +div. If it is, then it triggers an animation to \"snap\" the scroll +position to the top of the div. As this animation is driven by +JavaScript on the browser's main thread, it can be interrupted by other +JavaScript running in other tabs or other windows. Therefore, the +animation can end up looking janky and not as smooth as intended. +Instead, using the CSS snap-points property will allow the browser to +run the animation asynchronously, providing a smooth visual effect to +the user. + +```html + + +
+ +``` + +This version can work smoothly in the browser even if there is +slow-running Javascript on the browser's main thread. + +### Other effects + +In many cases, scroll-linked effects can be reimplemented using CSS and +made to run on the compositor thread. However, in some cases the current +APIs offered by the browser do not allow this. In all cases, however, +Firefox will display a warning to the developer console (starting in +version 46) if it detects the presence of a scroll-linked effect on a +page. Pages that use scrolling effects without listening for scroll +events in JavaScript will not get this warning. See the [Asynchronous +scrolling in Firefox](https://staktrace.com/spout/entry.php?id=834) blog +post for some more examples of effects that can be implemented using CSS +to avoid jank. + +## Future improvements + +Going forward, we would like to support more effects in the compositor. +In order to do so, we need you (yes, you!) to tell us more about the +kinds of scroll-linked effects you are trying to implement, so that we +can find good ways to support them in the compositor. Currently there +are a few proposals for APIs that would allow such effects, and they all +have their advantages and disadvantages. The proposals currently under +consideration are: + +- [Web Animations](https://w3c.github.io/web-animations/): A new API + for precisely controlling web animations in JavaScript, with an + [additional + proposal](https://wiki.mozilla.org/Platform/Layout/Extended_Timelines) + to map scroll position to time and use that as a timeline for the + animation. +- [CompositorWorker](https://docs.google.com/document/d/18GGuTRGnafai17PDWjCHHAvFRsCfYUDYsi720sVPkws/edit?pli=1#heading=h.iy9r1phg1ux4): + Allows JavaScript to be run on the compositor thread in small + chunks, provided it doesn't cause the framerate to drop. +- [Scroll + Customization](https://docs.google.com/document/d/1VnvAqeWFG9JFZfgG5evBqrLGDZYRE5w6G5jEDORekPY/edit?pli=1): + Introduces a new API for content to dictate how a scroll delta is + applied and consumed. As of this writing, Mozilla does not plan to + support this proposal, but it is included for completeness. + +### Call to action + +If you have thoughts or opinions on: + +- Any of the above proposals in the context of scroll-linked effects. +- Scroll-linked effects you are trying to implement. +- Any other related issues or ideas. + +Please get in touch with us! You can join the discussion on the +[public-houdini](https://lists.w3.org/Archives/Public/public-houdini/) +mailing list. diff --git a/docs/performance/sorting_algorithms_comparison.md b/docs/performance/sorting_algorithms_comparison.md new file mode 100644 index 0000000000..8450d116e0 --- /dev/null +++ b/docs/performance/sorting_algorithms_comparison.md @@ -0,0 +1,52 @@ +# Sorting algorithms comparison + +This program compares the performance of three different sorting +algorithms: + +- bubble sort +- selection sort +- quicksort + +It consists of the following functions: + + ----------------------- --------------------------------------------------------------------------------------------------- + **`sortAll()`** Top-level function. Iteratively (200 iterations) generates a randomized array and calls `sort()`. + **`sort()`** Calls each of `bubbleSort()`, `selectionSort()`, `quickSort()` in turn and logs the result. + **`bubbleSort()`** Implements a bubble sort, returning the sorted array. + **`selectionSort()`** Implements a selection sort, returning the sorted array. + **`quickSort()`** Implements quicksort, returning the sorted array. + `swap()` Helper function for `bubbleSort()` and `selectionSort()`. + `partition()` Helper function for `quickSort()`. + ----------------------- --------------------------------------------------------------------------------------------------- + +Its call graph looks like this: + + sortAll() // (generate random array, then call sort) x 200 + + -> sort() // sort with each algorithm, log the result + + -> bubbleSort() + + -> swap() + + -> selectionSort() + + -> swap() + + -> quickSort() + + -> partition() + +The implementations of the sorting algorithms in the program are taken +from and are +used under the MIT license. + +You can try out the example program +[here](https://mdn.github.io/performance-scenarios/js-call-tree-1/index.html) +and clone the code [here](https://github.com/mdn/performance-scenarios) +(be sure to check out the gh-pages branch). You can also [download the +specific profile we +discuss](https://github.com/mdn/performance-scenarios/tree/gh-pages/js-call-tree-1/profile) +- just import it to the Performance tool if you want to follow along. Of +course, you can generate your own profile, too, but the numbers will be +a little different. diff --git a/docs/performance/timerfirings_logging.md b/docs/performance/timerfirings_logging.md new file mode 100644 index 0000000000..dfbe8dca93 --- /dev/null +++ b/docs/performance/timerfirings_logging.md @@ -0,0 +1,136 @@ +# TimerFirings Loggings + +TimerFirings logging is a feature built into Gecko that prints a line of +data for every timer fired. This is useful because timer firings are a +major cause of wakeups, which can cause high power consumption. + +**Note**: The [power profiling +overview](power_profiling_overview.md) +is worth reading at this point if you haven\'t already. It may make +parts of this document easier to understand. + +## Invocation + +TimerFirings logging uses Gecko\'s own logging mechanism, and so is able +to be used in any build. Set the following environment variable to +enable it. + + NSPR_LOG_MODULES=TimerFirings:4 + +## Output + +Once enabled, TimerFirings will print one line of logging output per +timer fired. It\'s best to redirect this output to a file. + +The following sample shows the basics of this output. + + -991946880[7f46c365ba00]: [6775] fn timer (SLACK 100 ms): LayerActivityTracker + -991946880[7f46c365ba00]: [6775] fn timer (ONE_SHOT 250 ms): PresShell::sPaintSuppressionCallback + -991946880[7f46c365ba00]: [6775] fn timer (ONE_SHOT 160 ms): nsBrowserStatusFilter::TimeoutHandler + -991946880[7f46c365ba00]: [6775] iface timer (ONE_SHOT 200 ms): 7f46964d7f80 + -1340643584[7f46c365ec00]: [6775] obs timer (SLACK 1000 ms): 7f46a95a0200 + +Each line has the following information. + +- The first two values identify the thread. This is not especially + useful. +- The next value is the process ID (pid). This is useful in a + multi-process scenario. +- Next is the timer kind, one of `fn` (function), `iface` (interface) + or `obs` (observer), which are the three kinds of timers that Gecko + supports. +- Then comes the function kind, one of `ONE_SHOT` (a single-use + timer), `SLACK` or `PRECISE` (repeating timers of differing + precision). +- Then comes the timer period, measured in milliseconds. +- Finally there is the identifying information for the timer. Function + timers have an informative label. Interface and observer timers only + have an address, which is less useful, but they are uncommon enough + that this usually doesn\'t matter much. + +The above example shows only timers from C++ within Gecko. There are +also timers for `setTimer` or `setInterval` calls in JavaScript code, as +the following sample shows. + + -991946880[7f46c365ba00]: [6775] fn timer (ONE_SHOT 0 ms): [content] chrome://browser/content/tabbrowser.xml:1816:0 + 711637568[7f3219c48000]: [6835] fn timer (ONE_SHOT 100 ms): [content] http://edition.cnn.com/:5:7231 + 711637568[7f3219c48000]: [6835] fn timer (ONE_SHOT 100 ms): [content] http://a.visualrevenue.com/vrs.js:6:9423 + +These JS timers are annotated with `[content]` and show the JavaScript +source location where they were created. They can come from chrome code +within Firefox, or from web content. + +The informative labels are only present on function timers that have had +their creation site annotated. For unannotated function timers, there +are three possible behaviours. + +First, on Mac the code uses `dladdr` to get the name immediately, and +the output will look like the following. + + 2082435840[100445640]: [81190] fn timer (ONE_SHOT 8000 ms): [from dladdr] gfxFontInfoLoader::DelayedStartCallback(nsITimer*, void*) + +Second, on Linux the code uses `dladdr` to get the symbol library and +address, which can be post-processed by `tools/rb/fix_stacks.py`. The +following two lines show the output before and after being +post-processed by that script. + + 2088737280[7f606bf68140]: [30710] fn timer (ONE_SHOT 16 ms): [from dladdr] #0: ???[/home/njn/moz/mi1/o64/dist/bin/libxul.so +0x2144f94] + 2088737280[7f606bf68140]: [30710] fn timer (ONE_SHOT 16 ms): [from dladdr] #0: mozilla::RefreshDriverTimer::TimerTick(nsITimer*, void*) (/home/njn/moz/mi1/o64/layout/b + +Third, on other platforms `dladdr` is not implemented or doesn\'t work +well, and the output will look like the following. + + 711637568[7f3219c48000]: [6835] fn timer (ONE_SHOT 16 ms): ???[dladdr is unimplemented or doesn't work well on this OS] + +The `???` indicates that the function timer lacks an explicit name, and +the comment within the square brackets explains why the fallback +mechanism wasn\'t used`.` + +If an unannotated timer function appears frequently it is worth +explicitly annotating it so that it will be usefully identified on other +platforms. (Running on Mac or Linux is obviously necessary to learn the +timer function\'s name.) This is done by initializing it with +`initWithNamedFuncCallback` or `initWithNameableFuncCallback` instead of +`initWithNameCallback`. + +## Post-processing + +TimerFirings logging quickly produces thousands of lines of output. This +output needs post-processing for it to be useful. If the output is +redirected to a file called *`out`*, then the following command will +pull out the timer-related lines, count how many times each unique line +appears, and then print them with the most common ones first. + + cat out | grep timer | sort | uniq -c | sort -r -n + +The following is sample output from this command. + + 204 801266240[7f7c1f248000]: [7163] fn timer (ONE_SHOT 50 ms): [content] http://widgets.outbrain.com/outbrain.js:20:330 + 135 -495057024[7f74e105ba00]: [7108] fn timer (ONE_SHOT 4 ms): [content] https://self-repair.mozilla.org/en-US/repair/:7:13669 + 118 801266240[7f7c1f248000]: [7163] fn timer (ONE_SHOT 100 ms): [content] http://a.visualrevenue.com/vrs.js:6:9423 + 103 801266240[7f7c1f248000]: [7163] fn timer (ONE_SHOT 50 ms): [content] http://static.dynamicyield.com/scripts/12086/dy-min.js?v=12086:3:3389 + 94 801266240[7f7c1f248000]: [7163] fn timer (ONE_SHOT 50 ms): [content] https://ad.double-click.net/ddm/adi/N7921.1283839CADREON.COM.AU/B9038144.122190976;sz=300x600;click=http://pixel.mathtag.com/click/img?mt_aid=2744535504761193354&mt_id=1895890&mt_adid=148611&mt_sid=973379&mt_exid=9&mt_inapp=0&mt_uuid=353d5460-19f6-4400-9bbd-d0fcc3bcf595&mt_3pck=http%3A//beacon-apac-hkg1.rubiconproject.com/beacon/t/d1f9921d-4e47-448f-b6ba-36cae1c31b65/&redirect=;ord=2744535504761193354?:83:0 + 94 801266240[7f7c1f248000]: [7163] fn timer (ONE_SHOT 160 ms): nsBrowserStatusFilter::TimeoutHandler + 92 -495057024[7f74e105ba00]: [7108] fn timer (ONE_SHOT 160 ms): nsBrowserStatusFilter::TimeoutHandler + +The first column shows how many times the particular line appeared. + +It is sometimes useful to pre-process the output by stripping out +certain parts of each line before doing this aggregation step, for +example, by inserting one or more of the following commands into the +command pipeline. + + sed 's/^[^:]\+: //' # strip thread IDs + sed 's/\[[0-9]\+\] //' # strip process IDs + sed 's/ \+[0-9]\+ ms//' # strip timer periods + +The following is the previous sample output with all three of these +commands added into the pipeline. + + 204 fn timer (ONE_SHOT): [content] http://widgets.outbrain.com/outbrain.js:20:330 + 186 fn timer (ONE_SHOT): nsBrowserStatusFilter::TimeoutHandler + 138 fn timer (ONE_SHOT): [content] https://self-repair.mozilla.org/en-US/repair/:7:13669 + 118 fn timer (ONE_SHOT): [content] http://a.visualrevenue.com/vrs.js:6:9423 + 108 fn timer (SLACK): LayerActivityTracker + 104 fn timer (SLACK): nsIDocument::SelectorCache + 104 fn timer (SLACK): CCTimerFired diff --git a/docs/performance/tools_power_rapl.md b/docs/performance/tools_power_rapl.md new file mode 100644 index 0000000000..3bf2555bd6 --- /dev/null +++ b/docs/performance/tools_power_rapl.md @@ -0,0 +1,113 @@ +# tools/power/rapl + +`tools/power/rapl` (or `rapl` for short) is a command-line utility in +the Mozilla tree that periodically reads and prints all available Intel +RAPL power estimates. These are machine-wide estimates, so if you want +to estimate the power consumption of a single program you should +minimize other activity on the machine while measuring. + +**Note**: The [power profiling overview](power_profiling_overview.md) is +worth reading at this point if you haven't already. It may make parts +of this document easier to understand. + +## Invocation + +First, do a [standard build of Firefox](/setup/index.rst). + +### Mac + +On Mac, `rapl` can be run as follows. + +```bash +$OBJDIR/dist/bin/rapl +``` + +### Linux + +On Linux, `rapl` can be run as root, as follows. + + sudo $OBJDIR/dist/bin/rapl + +Alternatively, it can be run without root privileges by setting the +contents of +[/proc/sys/kernel/perf_event_paranoid](http://unix.stackexchange.com/questions/14227/do-i-need-root-admin-permissions-to-run-userspace-perf-tool-perf-events-ar) +to 0. Note that if you do change this file, its contents may reset when +the machine is next rebooted. + +You must be running Linux kernel version 3.14 or later for `rapl` to +work. Otherwise, it will fail with an error message explaining this +requirement. + +### Windows + +Unfortunately, `rapl` does not work on Windows, and porting it would be +difficult because Windows does not have APIs that allow easy access to +the relevant model-specific registers. + +## Output + +The following is 10 seconds of output from a default invocation of +`rapl`. + +```bash + total W = _pkg_ (cores + _gpu_ + other) + _ram_ W +#01 5.17 W = 1.78 ( 0.12 + 0.10 + 1.56) + 3.39 W +#02 9.43 W = 5.44 ( 1.44 + 1.20 + 2.80) + 3.98 W +#03 14.26 W = 10.21 ( 5.47 + 0.19 + 4.55) + 4.04 W +#04 10.02 W = 6.15 ( 2.62 + 0.43 + 3.10) + 3.86 W +#05 14.63 W = 10.43 ( 4.41 + 0.81 + 5.22) + 4.19 W +#06 11.16 W = 6.90 ( 1.91 + 1.68 + 3.31) + 4.26 W +#07 5.40 W = 1.97 ( 0.20 + 0.10 + 1.67) + 3.44 W +#08 5.17 W = 1.76 ( 0.07 + 0.08 + 1.60) + 3.41 W +#09 5.17 W = 1.76 ( 0.09 + 0.08 + 1.58) + 3.42 W +#10 8.13 W = 4.40 ( 1.55 + 0.11 + 2.74) + 3.73 W +``` + +Things to note include the following. + +- All measurements are in Watts. +- The first line indicates the meaning of each column. +- The underscores in `_pkg_`, `_gpu_` and `_ram_` are present so that + each column's name has five characters. +- The total power is the sum of the package power and the RAM power. +- The package estimate is divided into three parts: cores, GPU, and + \"other\". \"Other\" is computed as the package power minus the + cores power and GPU power. +- If the processor does not support GPU or RAM estimates then + \"` n/a `\" will be printed in the relevant column instead of a + number, and it will contribute zero to the total. + +Once sampling is finished --- either because the user interrupted it, or +because the requested number of samples has been taken --- the following +summary data is shown: + +```bash +10 samples taken over a period of 10.000 seconds + +Distribution of 'total' values: + mean = 8.85 W + std dev = 3.50 W + 0th percentile = 5.17 W (min) + 5th percentile = 5.17 W + 25th percentile = 5.17 W + 50th percentile = 8.13 W + 75th percentile = 11.16 W + 95th percentile = 14.63 W +100th percentile = 14.63 W (max) +``` + +The distribution data is omitted if there was zero or one samples taken. + +## Options + +- `-i --sample-interval`. The length of each sample in milliseconds. + Defaults to 1000. A warning is given if you set it below 50 because + that is likely to lead to inaccurate estimates. +- `-n --sample-count`. The number of samples to take. The default is + 0, which is interpreted as \"unlimited\". + +## Combining with `powermetrics` + +On Mac, you can use the [mach power](powermetrics.md#mach-power) command +to run `rapl` in combination with `powermetrics` in a way that gives the +most useful summary measurements for each of Firefox, Chrome and Safari. diff --git a/docs/performance/turbostat.md b/docs/performance/turbostat.md new file mode 100644 index 0000000000..3eac89c086 --- /dev/null +++ b/docs/performance/turbostat.md @@ -0,0 +1,50 @@ +# Turbostat + +`turbostat` is a Linux command-line utility that prints various +measurements, including numerous per-CPU measurements. This article +provides an introduction to using it. + +**Note**: The [power profiling overview](power_profiling_overview.md) is +worth reading at this point if you haven't already. It may make parts +of this document easier to understand. + +## Invocation + +`turbostat` must be invoked as the super-user: + +```bash +sudo turbostat +``` + +If you get an error saying `"turbostat: no /dev/cpu/0/msr"`, you need to +run the following command: + +```bash +sudo modprobe msr +``` + +The output is as follows: + +``` + Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt + - - 799 21.63 3694 3398 0 12.02 3.16 1.71 61.48 49 49 0.00 0.00 0.00 0.00 22.68 15.13 1.13 + 0 0 821 22.44 3657 3398 0 9.92 2.43 2.25 62.96 39 49 0.00 0.00 0.00 0.00 22.68 15.13 1.13 + 0 4 708 19.14 3698 3398 0 13.22 + 1 1 743 20.26 3666 3398 0 21.40 4.01 1.42 52.90 49 + 1 5 1206 31.98 3770 3398 0 9.69 + 2 2 784 21.29 3681 3398 0 11.78 3.10 1.13 62.70 40 + 2 6 782 21.15 3698 3398 0 11.92 + 3 3 702 19.14 3670 3398 0 8.39 3.09 2.03 67.36 39 + 3 7 648 17.67 3667 3398 0 9.85 +``` + +The man page has good explanations of what each column measures. The +various "Watt" measurements come from the Intel RAPL MSRs. + +If you run with the `-S` option you get a smaller range of measurements +that fit on a single line, like the following: + +``` + Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt + 3614 97.83 3694 3399 0 2.17 0.00 0.00 0.00 77 77 0.00 0.00 0.00 0.00 67.50 57.77 0.46 +``` -- cgit v1.2.3