From 26a029d407be480d791972afb5975cf62c9360a6 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Fri, 19 Apr 2024 02:47:55 +0200 Subject: Adding upstream version 124.0.1. Signed-off-by: Daniel Baumann --- testing/docs/ci-configs/index.md | 64 +++++++++++++++++++++++++++++++++++++ testing/docs/ci-configs/schedule.md | 50 +++++++++++++++++++++++++++++ 2 files changed, 114 insertions(+) create mode 100644 testing/docs/ci-configs/index.md create mode 100644 testing/docs/ci-configs/schedule.md (limited to 'testing/docs/ci-configs') diff --git a/testing/docs/ci-configs/index.md b/testing/docs/ci-configs/index.md new file mode 100644 index 0000000000..9987cec7e1 --- /dev/null +++ b/testing/docs/ci-configs/index.md @@ -0,0 +1,64 @@ +# Configuration Changes + +This process outlines how Mozilla will handle configuration changes. For a list of configuration changes, please see the [schedule](schedule.md) + +## Infrastructure setup (2-4 weeks) + +This is behind the scenes, when there is a need for a configuration change (upgrade or addition of a new platform), the first step +is to build a machine and work to get the OS working with taskcluster. This is work for hardware/cloud is done by IT. Sometimes +this is as simple as installing a package or changing an OS setting on an existing machine, but this requires automation and documentation. + +In some cases there is little to no work as the CI change is running tests with different runtime settings (environment variables or preferences). + + +## Setting up a pool on try server (1 week) + +The next step is getting some machines available on try server. This is where we add some code in tree to support the new config +(a new worker type, test variant, etc.) and validate any setup done by IT works with taskcluster client. Then Releng ensures the target tests +can run at a basic level (mozharness, testharness, os environment, logging, something passes). + + +## Green up tests (1 week) + +This is a stage where Releng will run all the target tests on try server and disable, skip, fail-if all tests that are not passing or frequently +intermittent. Typically there are a dozen or so iterations of this because a crash on one test means we don't run the rest of the tests in the +manifest. + + +## Turn on new config as tier-2 (1/2 week) + +We will time this at the start of a new release. + +Releng will land changes to manifests for all non passing tests and then schedule the new jobs by default. This will be tier-2 for a couple reasons: + * it is a new config with a lot of tests that still need attention + * in many cases there is a previous config (lets say upgrading windows 10 from 1803 -> 1903) which is still running in parallel as tier-1 + +This will now run on central and integration and be available on try server. In a few cases where there are limited machines (android phones), +there will be needs to turn off the old config, or make the try server access hidden behind `./mach try --full` + + +## Turn on new backstop jobs which run the skipped tests (1/2 week) + +Releng will turn on a new temporary job that will run the tests which are not green by default. These will run as tier-2 on mozilla-central and be sheriffed. + +The goal here is to find tests that are now passing and should be run by default. By doing this we are effectively running all the tests instead of +disabling dozens of tests and forgetting about them. + + +## Handoff to developers (1 week) + +Releng will file bugs for all failing tests (one bug per manifest) and needinfo the triage owner to raise awareness that one or more tests in their area need +attention. At this point, Releng is done and will move onto other work. Developers can reproduce the failures on try server and when fixed edit the manifest +as appropriate. + +There will be at least 6 weeks to investigate and fix the tests before they are promoted to tier-1. + + +## move config to tier-1 (6-7 weeks later) + +After the config has been running as tier-2 makes it to beta and then to the release branch (i.e. 2 new releases later), Releng will: + * turn off the old tier-1 tests (if applicable) + * promote the tier-2 jobs to tier-1 + * turn off the backstop jobs + +This allows developers to schedule time in a 6 weeks period to investigate and fix any test failures. diff --git a/testing/docs/ci-configs/schedule.md b/testing/docs/ci-configs/schedule.md new file mode 100644 index 0000000000..6b95b589ab --- /dev/null +++ b/testing/docs/ci-configs/schedule.md @@ -0,0 +1,50 @@ +# Schedule + +For each CI config change, we need to follow: + * scope of work (what will run, how frequently) + * capacity planning (cost, physical space limitations) + * will this replace anything or is this 100% new + * puppet/deployment scripts or documentation + * setup pool on try server + * documented updated on this page, communicate with release management and others as appropriate + + +## Current / Future CI config changes + +Start Date | Completed | Tracking Bug | Description +--- | --- | --- | --- +TBD | TBD | TBD | Upgrade Ubuntu 18.04 -> Ubuntu 22.04 X11 +TBD | TBD | TBD | Add Ubuntu 22.04 Wayland +TBD | TBD | TBD | Upgrade Mac M1 from 11.2.3 -> 13.2.1 +TBD | TBD | TBD | replace 2017 acer perf laptops with lower end NUCs +TBD | TBD | TBD | replace windows moonshots with mid level NUCs +TBD | TBD | TBD | Upgrade android emulators to modern version + + +## Completed CI config changes + +Start Date | Completed | Tracking Bug | Description +--- | --- | --- | --- +October 2022 | March 2023 | [Bug 1794900](https://bugzilla.mozilla.org/show_bug.cgi?id=1794900) | Migrate from win10 -> win11 +November 2022 | February 2023 | [Bug 1804790](https://bugzilla.mozilla.org/show_bug.cgi?id=1804790) | Migrate Win7 unittests from AWS -> Azure +October 2022 | February 2023 | [Bug 1794895](https://bugzilla.mozilla.org/show_bug.cgi?id=1794895) | Migrate unittests from pixel2 -> pixel5 +November 2020 | August 2021 | [Bug 1676850](https://bugzilla.mozilla.org/show_bug.cgi?id=1676850) | Windows tests migrate from AWS -> Datacenter/Azure and 1803 -> 20.04 +May 2022 | July 2022 | [Bug 1767486](https://bugzilla.mozilla.org/show_bug.cgi?id=1767486) | Migrate perftests from Moto G5 phones to Samsung A51 phones +March 2021 | October 2021 | [Bug 1699541](https://bugzilla.mozilla.org/show_bug.cgi?id=1699541) | Migrate from OSX 10.14 -> 10.15 +July 2020 | March 2021 | [Bug 1572739](https://bugzilla.mozilla.org/show_bug.cgi?id=1572739) | upgrade datacenter linux perf machines from ubuntu 16.04 to 18.04 +September 2020 | January 2021 | [Bug 1665012](https://bugzilla.mozilla.org/show_bug.cgi?id=1665012) | Android phones upgrade from version 7 -> 10 +October 2020 | February 2021 | [Bug 1673067](https://bugzilla.mozilla.org/show_bug.cgi?id=1673067) | Run tests on MacOSX Aarch64 (subset in parallel) +September 2020 | March 2021 | [Bug 1548264](https://bugzilla.mozilla.org/show_bug.cgi?id=1548264) | Python 2.7 -> 3.6 migration in CI +July 2020 | October 2020| [Bug 1653344](https://bugzilla.mozilla.org/show_bug.cgi?id=1653344) | Remove EDID dongles from MacOSX machines +August 2020 | September 2020 | [Bug 1643689](https://bugzilla.mozilla.org/show_bug.cgi?id=1643689) | Schedule tests by test selection/manifest +June 2020 | August 2020 | [Bug 1486004](https://bugzilla.mozilla.org/show_bug.cgi?id=1486004) | Android hardware tests running without rooted phones +August 2019 | January 2020 | [Bug 1572242](https://bugzilla.mozilla.org/show_bug.cgi?id=1572242) | Upgrade Ubuntu from 16.04 to 18.04 (finished in January) + + +## Appendix: + * *OS*: base operating system such as Android, Linux, Mac OSX, Windows + * *Hardware*: specific cpu/memory/disk/graphics/display/inputs that we are using, could be physical hardware we own or manage, or it could be a cloud provider. + * *Platform*: a combination of hardware and OS + * *Configuration*: what we change on a platform (can be runtime with flags), installed OS software updates (service pack), tools (python/node/etc.), hardware or OS settings (anti aliasing, display resolution, background processes, clipboard), environment variables, + * *Test Failure*: a test doesn’t report the expected result (if we expect fail and we crash, that is unexpected). Typically this is a failure, but it can be a timeout, crash, not run, or even pass + * *Greening up*: Assuming all tests return expected results (passing), they are green. When tests fail, they are orange. We need to find a way to get all tests green by investigating test failures. -- cgit v1.2.3