1 files changed, 522 insertions, 0 deletions
diff --git a/doc/dev/developer_guide/tests-integration-tests.rst b/doc/dev/developer_guide/tests-integration-tests.rst
new file mode 100644
index 000000000..c8e6dbcd4
--- /dev/null
+++ b/doc/dev/developer_guide/tests-integration-tests.rst
@@ -0,0 +1,522 @@
+.. _testing-integration-tests:
+
+Testing - Integration Tests
+===========================
+
+Ceph has two types of tests: :ref:`make check <make-check>` tests and integration tests.
+When a test requires multiple machines, root access or lasts for a
+longer time (for example, to simulate a realistic Ceph deployment), it
+is deemed to be an integration test. Integration tests are organized into
+"suites", which are defined in the `ceph/qa sub-directory`_ and run with
+the ``teuthology-suite`` command.
+
+The ``teuthology-suite`` command is part of the `teuthology framework`_.
+In the sections that follow we attempt to provide a detailed introduction
+to that framework from the perspective of a beginning Ceph developer.
+
+Teuthology consumes packages
+----------------------------
+
+It may take some time to understand the significance of this fact, but it
+is `very` significant. It means that automated tests can be conducted on
+multiple platforms using the same packages (RPM, DEB) that can be
+installed on any machine running those platforms.
+
+Teuthology has a `list of platforms that it supports
+<https://github.com/ceph/ceph/tree/master/qa/distros/supported>`_ (as
+of September 2020 the list consisted of "RHEL/CentOS 8" and "Ubuntu 18.04").  It
+expects to be provided pre-built Ceph packages for these platforms.
+Teuthology deploys these platforms on machines (bare-metal or
+cloud-provisioned), installs the packages on them, and deploys Ceph
+clusters on them - all as called for by the test.
+
+The Nightlies
+-------------
+
+A number of integration tests are run on a regular basis in the `Sepia
+lab`_ against the official Ceph repositories (on the ``master`` development
+branch and the stable branches). Traditionally, these tests are called "the
+nightlies" because the Ceph core developers used to live and work in
+the same time zone and from their perspective the tests were run overnight.
+
+The results of the nightlies are published at http://pulpito.ceph.com/. The
+developer nick shows in the
+test results URL and in the first column of the Pulpito dashboard.  The
+results are also reported on the `ceph-qa mailing list
+<https://ceph.com/irc/>`_ for analysis.
+
+Testing Priority
+----------------
+
+The ``teuthology-suite`` command includes an almost mandatory option ``-p <N>``
+which specifies the priority of the jobs submitted to the queue. The lower
+the value of ``N``, the higher the priority. The option is almost mandatory
+because the default is ``1000`` which matches the priority of the nightlies.
+Nightlies are often half-finished and cancelled due to the volume of testing
+done so your jobs may never finish. Therefore, it is common to select a
+priority less than 1000.
+
+Job priority should be selected based on the following recommendations:
+
+* **Priority < 10:** Use this if the sky is falling and some group of tests
+  must be run ASAP.
+
+* **10 <= Priority < 50:** Use this if your tests are urgent and blocking
+  other important development.
+
+* **50 <= Priority < 75:** Use this if you are testing a particular
+  feature/fix and running fewer than about 25 jobs. This range can also be
+  used for urgent release testing.
+
+* **75 <= Priority < 100:** Tech Leads will regularly schedule integration
+  tests with this priority to verify pull requests against master.
+
+* **100 <= Priority < 150:** This priority is to be used for QE validation of
+  point releases.
+
+* **150 <= Priority < 200:** Use this priority for 100 jobs or fewer of a
+  particular feature/fix that you'd like results on in a day or so.
+
+* **200 <= Priority < 1000:** Use this priority for large test runs that can
+  be done over the course of a week.
+
+In case you don't know how many jobs would be triggered by
+``teuthology-suite`` command, use ``--dry-run`` to get a count first and then
+issue ``teuthology-suite`` command again, this time without ``--dry-run`` and
+with ``-p`` and an appropriate number as an argument to it.
+
+To skip the priority check, use ``--force-priority``. In order to be sensitive
+to the runs of other developers who also need to do testing, please use it in
+emergency only.
+
+Suites Inventory
+----------------
+
+The ``suites`` directory of the `ceph/qa sub-directory`_ contains
+all the integration tests, for all the Ceph components.
+
+`ceph-deploy <https://github.com/ceph/ceph/tree/master/qa/suites/ceph-deploy>`_
+  install a Ceph cluster with ``ceph-deploy`` (:ref:`ceph-deploy man page <ceph-deploy>`)
+
+`dummy <https://github.com/ceph/ceph/tree/master/qa/suites/dummy>`_
+  get a machine, do nothing and return success (commonly used to
+  verify the :ref:`testing-integration-tests` infrastructure works as expected)
+
+`fs <https://github.com/ceph/ceph/tree/master/qa/suites/fs>`_
+  test CephFS mounted using kernel and FUSE clients, also with multiple MDSs.
+
+`krbd <https://github.com/ceph/ceph/tree/master/qa/suites/krbd>`_
+  test the RBD kernel module
+
+`powercycle <https://github.com/ceph/ceph/tree/master/qa/suites/powercycle>`_
+  verify the Ceph cluster behaves when machines are powered off
+  and on again
+
+`rados <https://github.com/ceph/ceph/tree/master/qa/suites/rados>`_
+  run Ceph clusters including OSDs and MONs, under various conditions of
+  stress
+
+`rbd <https://github.com/ceph/ceph/tree/master/qa/suites/rbd>`_
+  run RBD tests using actual Ceph clusters, with and without qemu
+
+`rgw <https://github.com/ceph/ceph/tree/master/qa/suites/rgw>`_
+  run RGW tests using actual Ceph clusters
+
+`smoke <https://github.com/ceph/ceph/tree/master/qa/suites/smoke>`_
+  run tests that exercise the Ceph API with an actual Ceph cluster
+
+`teuthology <https://github.com/ceph/ceph/tree/master/qa/suites/teuthology>`_
+  verify that teuthology can run integration tests, with and without OpenStack
+
+`upgrade <https://github.com/ceph/ceph/tree/master/qa/suites/upgrade>`_
+  for various versions of Ceph, verify that upgrades can happen
+  without disrupting an ongoing workload
+
+.. _`ceph-deploy man page`: ../../man/8/ceph-deploy
+
+teuthology-describe-tests
+-------------------------
+
+In February 2016, a new feature called ``teuthology-describe-tests`` was
+added to the `teuthology framework`_ to facilitate documentation and better
+understanding of integration tests (`feature announcement
+<http://article.gmane.org/gmane.comp.file-systems.ceph.devel/29287>`_).
+
+The upshot is that tests can be documented by embedding ``meta:``
+annotations in the yaml files used to define the tests. The results can be
+seen in the `ceph-qa-suite wiki
+<http://tracker.ceph.com/projects/ceph-qa-suite/wiki/>`_.
+
+Since this is a new feature, many yaml files have yet to be annotated.
+Developers are encouraged to improve the documentation, in terms of both
+coverage and quality.
+
+How integration tests are run
+-----------------------------
+
+Given that - as a new Ceph developer - you will typically not have access
+to the `Sepia lab`_, you may rightly ask how you can run the integration
+tests in your own environment.
+
+One option is to set up a teuthology cluster on bare metal. Though this is
+a non-trivial task, it `is` possible. Here are `some notes
+<http://docs.ceph.com/teuthology/docs/LAB_SETUP.html>`_ to get you started
+if you decide to go this route.
+
+If you have access to an OpenStack tenant, you have another option: the
+`teuthology framework`_ has an OpenStack backend, which is documented `here
+<https://github.com/dachary/teuthology/tree/openstack#openstack-backend>`__.
+This OpenStack backend can build packages from a given git commit or
+branch, provision VMs, install the packages and run integration tests
+on those VMs. This process is controlled using a tool called
+``ceph-workbench ceph-qa-suite``. This tool also automates publishing of
+test results at http://teuthology-logs.public.ceph.com.
+
+Running integration tests on your code contributions and publishing the
+results allows reviewers to verify that changes to the code base do not
+cause regressions, or to analyze test failures when they do occur.
+
+Every teuthology cluster, whether bare-metal or cloud-provisioned, has a
+so-called "teuthology machine" from which tests suites are triggered using the
+``teuthology-suite`` command.
+
+A detailed and up-to-date description of each `teuthology-suite`_ option is
+available by running the following command on the teuthology machine
+
+.. prompt:: bash $
+
+   teuthology-suite --help
+
+.. _teuthology-suite: http://docs.ceph.com/teuthology/docs/teuthology.suite.html
+
+How integration tests are defined
+---------------------------------
+
+Integration tests are defined by yaml files found in the ``suites``
+subdirectory of the `ceph/qa sub-directory`_ and implemented by python
+code found in the ``tasks`` subdirectory. Some tests ("standalone tests")
+are defined in a single yaml file, while other tests are defined by a
+directory tree containing yaml files that are combined, at runtime, into a
+larger yaml file.
+
+Reading a standalone test
+-------------------------
+
+Let us first examine a standalone test, or "singleton".
+
+Here is a commented example using the integration test
+`rados/singleton/all/admin-socket.yaml
+<https://github.com/ceph/ceph/blob/master/qa/suites/rados/singleton/all/admin-socket.yaml>`_
+
+.. code-block:: yaml
+
+      roles:
+      - - mon.a
+        - osd.0
+        - osd.1
+      tasks:
+      - install:
+      - ceph:
+      - admin_socket:
+          osd.0:
+            version:
+            git_version:
+            help:
+            config show:
+            config set filestore_dump_file /tmp/foo:
+            perf dump:
+            perf schema:
+
+The ``roles`` array determines the composition of the cluster (how
+many MONs, OSDs, etc.) on which this test is designed to run, as well
+as how these roles will be distributed over the machines in the
+testing cluster. In this case, there is only one element in the
+top-level array: therefore, only one machine is allocated to the
+test. The nested array declares that this machine shall run a MON with
+id ``a`` (that is the ``mon.a`` in the list of roles) and two OSDs
+(``osd.0`` and ``osd.1``).
+
+The body of the test is in the ``tasks`` array: each element is
+evaluated in order, causing the corresponding python file found in the
+``tasks`` subdirectory of the `teuthology repository`_ or
+`ceph/qa sub-directory`_ to be run. "Running" in this case means calling
+the ``task()`` function defined in that file.
+
+In this case, the `install
+<https://github.com/ceph/teuthology/blob/master/teuthology/task/install/__init__.py>`_
+task comes first. It installs the Ceph packages on each machine (as
+defined by the ``roles`` array). A full description of the ``install``
+task is `found in the python file
+<https://github.com/ceph/teuthology/blob/master/teuthology/task/install/__init__.py>`_
+(search for "def task").
+
+The ``ceph`` task, which is documented `here
+<https://github.com/ceph/ceph/blob/master/qa/tasks/ceph.py>`__ (again,
+search for "def task"), starts OSDs and MONs (and possibly MDSs as well)
+as required by the ``roles`` array. In this example, it will start one MON
+(``mon.a``) and two OSDs (``osd.0`` and ``osd.1``), all on the same
+machine. Control moves to the next task when the Ceph cluster reaches
+``HEALTH_OK`` state.
+
+The next task is ``admin_socket`` (`source code
+<https://github.com/ceph/ceph/blob/master/qa/tasks/admin_socket.py>`_).
+The parameter of the ``admin_socket`` task (and any other task) is a
+structure which is interpreted as documented in the task. In this example
+the parameter is a set of commands to be sent to the admin socket of
+``osd.0``. The task verifies that each of them returns on success (i.e.
+exit code zero).
+
+This test can be run with
+
+.. prompt:: bash $
+
+   teuthology-suite --machine-type smithi --suite rados/singleton/all/admin-socket.yaml fs/ext4.yaml
+
+Test descriptions
+-----------------
+
+Each test has a "test description", which is similar to a directory path,
+but not the same. In the case of a standalone test, like the one in
+`Reading a standalone test`_, the test description is identical to the
+relative path (starting from the ``suites/`` directory of the
+`ceph/qa sub-directory`_) of the yaml file defining the test.
+
+Much more commonly, tests are defined not by a single yaml file, but by a
+`directory tree of yaml files`. At runtime, the tree is walked and all yaml
+files (facets) are combined into larger yaml "programs" that define the
+tests. A full listing of the yaml defining the test is included at the
+beginning of every test log.
+
+In these cases, the description of each test consists of the
+subdirectory under `suites/
+<https://github.com/ceph/ceph/tree/master/qa/suites>`_ containing the
+yaml facets, followed by an expression in curly braces (``{}``) consisting of
+a list of yaml facets in order of concatenation. For instance the
+test description::
+
+  ceph-deploy/basic/{distros/centos_7.0.yaml tasks/ceph-deploy.yaml}
+
+signifies the concatenation of two files:
+
+* ceph-deploy/basic/distros/centos_7.0.yaml
+* ceph-deploy/basic/tasks/ceph-deploy.yaml
+
+How tests are built from directories
+------------------------------------
+
+As noted in the previous section, most tests are not defined in a single
+yaml file, but rather as a `combination` of files collected from a
+directory tree within the ``suites/`` subdirectory of the `ceph/qa sub-directory`_.
+
+The set of all tests defined by a given subdirectory of ``suites/`` is
+called an "integration test suite", or a "teuthology suite".
+
+Combination of yaml facets is controlled by special files (``%`` and
+``+``) that are placed within the directory tree and can be thought of as
+operators.  The ``%`` file is the "convolution" operator and ``+``
+signifies concatenation.
+
+Convolution operator
+^^^^^^^^^^^^^^^^^^^^
+
+The convolution operator, implemented as an empty file called ``%``, tells
+teuthology to construct a test matrix from yaml facets found in
+subdirectories below the directory containing the operator.
+
+For example, the `ceph-deploy suite
+<https://github.com/ceph/ceph/tree/master/qa/suites/ceph-deploy/>`_ is
+defined by the ``suites/ceph-deploy/`` tree, which consists of the files and
+subdirectories in the following structure
+
+.. code-block:: none
+
+  qa/suites/ceph-deploy
+  ├── %
+  ├── distros
+  │   ├── centos_latest.yaml
+  │   └── ubuntu_latest.yaml
+  └── tasks
+      ├── ceph-admin-commands.yaml
+      └── rbd_import_export.yaml
+
+This is interpreted as a 2x1 matrix consisting of two tests:
+
+1. ceph-deploy/basic/{distros/centos_7.0.yaml tasks/ceph-deploy.yaml}
+2. ceph-deploy/basic/{distros/ubuntu_16.04.yaml tasks/ceph-deploy.yaml}
+
+i.e. the concatenation of centos_7.0.yaml and ceph-deploy.yaml and
+the concatenation of ubuntu_16.04.yaml and ceph-deploy.yaml, respectively.
+In human terms, this means that the task found in ``ceph-deploy.yaml`` is
+intended to run on both CentOS 7.0 and Ubuntu 16.04.
+
+Without the file percent, the ``ceph-deploy`` tree would be interpreted as
+three standalone tests:
+
+* ceph-deploy/basic/distros/centos_7.0.yaml
+* ceph-deploy/basic/distros/ubuntu_16.04.yaml
+* ceph-deploy/basic/tasks/ceph-deploy.yaml
+
+(which would of course be wrong in this case).
+
+Referring to the `ceph/qa sub-directory`_, you will notice that the
+``centos_7.0.yaml`` and ``ubuntu_16.04.yaml`` files in the
+``suites/ceph-deploy/basic/distros/`` directory are implemented as symlinks.
+By using symlinks instead of copying, a single file can appear in multiple
+suites. This eases the maintenance of the test framework as a whole.
+
+All the tests generated from the ``suites/ceph-deploy/`` directory tree
+(also known as the "ceph-deploy suite") can be run with
+
+.. prompt:: bash $
+
+   teuthology-suite --machine-type smithi --suite ceph-deploy
+
+An individual test from the `ceph-deploy suite`_ can be run by adding the
+``--filter`` option
+
+.. prompt:: bash $
+
+   teuthology-suite \
+      --machine-type smithi \
+      --suite ceph-deploy/basic \
+      --filter 'ceph-deploy/basic/{distros/ubuntu_16.04.yaml tasks/ceph-deploy.yaml}'
+
+.. note:: To run a standalone test like the one in `Reading a standalone
+   test`_, ``--suite`` alone is sufficient. If you want to run a single
+   test from a suite that is defined as a directory tree, ``--suite`` must
+   be combined with ``--filter``. This is because the ``--suite`` option
+   understands POSIX relative paths only.
+
+Concatenation operator
+^^^^^^^^^^^^^^^^^^^^^^
+
+For even greater flexibility in sharing yaml files between suites, the
+special file plus (``+``) can be used to concatenate files within a
+directory. For instance, consider the `suites/rbd/thrash
+<https://github.com/ceph/ceph/tree/master/qa/suites/rbd/thrash>`_
+tree
+
+.. code-block:: none
+
+  qa/suites/rbd/thrash
+  ├── %
+  ├── clusters
+  │   ├── +
+  │   ├── fixed-2.yaml
+  │   └── openstack.yaml
+  └── workloads
+      ├── rbd_api_tests_copy_on_read.yaml
+      ├── rbd_api_tests.yaml
+      └── rbd_fsx_rate_limit.yaml
+
+This creates two tests:
+
+* rbd/thrash/{clusters/fixed-2.yaml clusters/openstack.yaml workloads/rbd_api_tests_copy_on_read.yaml}
+* rbd/thrash/{clusters/fixed-2.yaml clusters/openstack.yaml workloads/rbd_api_tests.yaml}
+
+Because the ``clusters/`` subdirectory contains the special file plus
+(``+``), all the other files in that subdirectory (``fixed-2.yaml`` and
+``openstack.yaml`` in this case) are concatenated together
+and treated as a single file. Without the special file plus, they would
+have been convolved with the files from the workloads directory to create
+a 2x2 matrix:
+
+* rbd/thrash/{clusters/openstack.yaml workloads/rbd_api_tests_copy_on_read.yaml}
+* rbd/thrash/{clusters/openstack.yaml workloads/rbd_api_tests.yaml}
+* rbd/thrash/{clusters/fixed-2.yaml workloads/rbd_api_tests_copy_on_read.yaml}
+* rbd/thrash/{clusters/fixed-2.yaml workloads/rbd_api_tests.yaml}
+
+The ``clusters/fixed-2.yaml`` file is shared among many suites to
+define the following ``roles``
+
+.. code-block:: yaml
+
+  roles:
+  - [mon.a, mon.c, osd.0, osd.1, osd.2, client.0]
+  - [mon.b, osd.3, osd.4, osd.5, client.1]
+
+The ``rbd/thrash`` suite as defined above, consisting of two tests,
+can be run with
+
+.. prompt:: bash $
+
+   teuthology-suite --machine-type smithi --suite rbd/thrash
+
+A single test from the rbd/thrash suite can be run by adding the
+``--filter`` option
+
+.. prompt:: bash $
+
+   teuthology-suite \
+      --machine-type smithi \
+      --suite rbd/thrash \
+      --filter 'rbd/thrash/{clusters/fixed-2.yaml clusters/openstack.yaml workloads/rbd_api_tests_copy_on_read.yaml}'
+
+Filtering tests by their description
+------------------------------------
+
+When a few jobs fail and need to be run again, the ``--filter`` option
+can be used to select tests with a matching description. For instance, if the
+``rados`` suite fails the `all/peer.yaml <https://github.com/ceph/ceph/blob/master/qa/suites/rados/singleton/all/peer.yaml>`_ test, the following will only
+run the tests that contain this file
+
+.. prompt:: bash $
+
+   teuthology-suite --machine-type smithi --suite rados --filter all/peer.yaml
+
+The ``--filter-out`` option does the opposite (it matches tests that do `not`
+contain a given string), and can be combined with the ``--filter`` option.
+
+Both ``--filter`` and ``--filter-out`` take a comma-separated list of strings
+(which means the comma character is implicitly forbidden in filenames found in
+the `ceph/qa sub-directory`_). For instance
+
+.. prompt:: bash $
+
+  teuthology-suite --machine-type smithi --suite rados --filter all/peer.yaml,all/rest-api.yaml
+
+will run tests that contain either
+`all/peer.yaml <https://github.com/ceph/ceph/blob/master/qa/suites/rados/singleton/all/peer.yaml>`_
+or
+`all/rest-api.yaml <https://github.com/ceph/ceph/blob/master/qa/suites/rados/singleton/all/rest-api.yaml>`_
+
+Each string is looked up anywhere in the test description and has to
+be an exact match: they are not regular expressions.
+
+Reducing the number of tests
+----------------------------
+
+The ``rados`` suite generates tens or even hundreds of thousands of tests out
+of a few hundred files. This happens because teuthology constructs test
+matrices from subdirectories wherever it encounters a file named ``%``. For
+instance, all tests in the `rados/basic suite
+<https://github.com/ceph/ceph/tree/master/qa/suites/rados/basic>`_ run with
+different messenger types: ``simple``, ``async`` and ``random``, because they
+are combined (via the special file ``%``) with the `msgr directory
+<https://github.com/ceph/ceph/tree/master/qa/suites/rados/basic/msgr>`_
+
+All integration tests are required to be run before a Ceph release is
+published. When merely verifying whether a contribution can be merged without
+risking a trivial regression, it is enough to run a subset. The ``--subset``
+option can be used to reduce the number of tests that are triggered. For
+instance
+
+.. prompt:: bash $
+
+   teuthology-suite --machine-type smithi --suite rados --subset 0/4000
+
+will run as few tests as possible. The tradeoff in this case is that
+not all combinations of test variations will together,
+but no matter how small a ratio is provided in the ``--subset``,
+teuthology will still ensure that all files in the suite are in at
+least one test. Understanding the actual logic that drives this
+requires reading the teuthology source code.
+
+The ``--limit`` option only runs the first ``N`` tests in the suite:
+this is rarely useful, however, because there is no way to control which
+test will be first.
+
+.. _ceph/qa sub-directory: https://github.com/ceph/ceph/tree/master/qa
+.. _Sepia Lab: https://wiki.sepia.ceph.com/doku.php
+.. _teuthology repository: https://github.com/ceph/teuthology
+.. _teuthology framework: https://github.com/ceph/teuthology