doc/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-debugging-tips.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

.. _tests-integration-testing-teuthology-debugging-tips:

Analyzing and Debugging A Teuthology Job
========================================

To learn more about how to schedule an integration test, refer to `Scheduling
Test Run`_.

Viewing Test Results
--------------------

When a teuthology run has been completed successfully, use `pulpito`_ dashboard
to view the results::

   http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/

.. _pulpito: https://pulpito.ceph.com

or ssh into the teuthology server to view the results of the integration test:

  .. prompt:: bash $

    ssh <username>@teuthology.front.sepia.ceph.com

and access `teuthology archives`_, as in this example:

  .. prompt:: bash $

     nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/

.. note:: This requires you to have access to the Sepia lab. To learn how to
          request access to the Sepia lab, see: 
          https://ceph.github.io/sepia/adding_users/

Identifying Failed Jobs
-----------------------

On pulpito, a job in red means either a failed job or a dead job. A job is
combination of daemons and configurations defined in the yaml fragments in
`qa/suites`_ . Teuthology uses these configurations and runs the tasks listed
in `qa/tasks`_, which are commands that set up the test environment and test
Ceph's components. These tasks cover a large subset of use cases and help to
expose bugs not exposed by `make check`_ testing.

.. _make check: ../tests-integration-testing-teuthology-intro/#make-check

A job failure might be caused by one or more of the following reasons:

* environment setup (`testing on varied
  systems <https://github.com/ceph/ceph/tree/master/qa/distros/supported>`_):
  testing compatibility with stable releases for supported versions.

* permutation of config values: for instance, `qa/suites/rados/thrash
  <https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash>`_ ensures
  that we run thrashing tests against Ceph under stressful workloads so that we
  can catch corner-case bugs. The final setup config yaml file used for testing
  can be accessed at::

  /a/<job-name>/<job-id>/orig.config.yaml

More details about config.yaml can be found at `detailed test config`_

Triaging the cause of failure
------------------------------

When a job fails, you will need to read its teuthology log in order to triage
the cause of its failure. Use the job's name and id from pulpito to locate your
failed job's teuthology log::

   http://qa-proxy.ceph.com/<job-name>/<job-id>/teuthology.log

Open the log file::

   /a/<job-name>/<job-id>/teuthology.log

For example:

  .. prompt:: bash $ 

     nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/teuthology.log

Every job failure is recorded in the teuthology log as a Traceback and is 
added to the job summary.

Find the ``Traceback`` keyword and search the call stack and the logs for
issues that caused the failure. Usually the traceback will include the command
that failed.

.. note:: The teuthology logs are deleted from time to time. If you are unable
          to access the link in this example, just use any other case from
          http://pulpito.front.sepia.ceph.com/

Reporting the Issue
-------------------

In short: first check to see if your job failure was caused by a known issue,
and if it wasn't, raise a tracker ticket. 

After you have triaged the cause of the failure and you have determined that it
wasn't caused by the changes that you made to the code, this might indicate
that you have encountered a known failure in the upstream branch (in the
example we're considering in this section, the upstream branch is "octopus").
If the failure was not caused by the changes you made to the code, go to
https://tracker.ceph.com and look for tracker issues related to the failure by
using keywords spotted in the failure under investigation.

If you find a similar issue on https://tracker.ceph.com, leave a comment on
that issue explaining the failure as you understand it and make sure to
include a link to your recent test run. If you don't find a similar issue,
create a new tracker ticket for this issue and explain the cause of your job's
failure as thoroughly as you can. If you're not sure what caused the job's
failure, ask one of the team members for help.

Debugging an issue using interactive-on-error
---------------------------------------------

When you encounter a job failure during testing, you should attempt to
reproduce it. This is where ``--interactive-on-error`` comes in. This
section explains how to use ``interactive-on-error`` and what it does. 

When you have verified that a job has failed, run the same job again in
teuthology but add the `interactive-on-error`_ flag::

    ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block $<your-config-yaml> --interactive-on-error

Use either `custom config.yaml`_ or the yaml file from the failed job. If
you use the yaml file from the failed job, copy ``orig.config.yaml`` to
your local directory::

    ideepika@teuthology:~/teuthology$ cp /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/orig.config.yaml test.yaml
    ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block test.yaml --interactive-on-error

If a job fails when the ``interactive-on-error`` flag is used, teuthology
will lock the machines required by ``config.yaml``. Teuthology will halt
the testing machines and hold them in the state that they were in at the
time of the job failure. You will be put into an interactive python
session. From there, you can ssh into the system to investigate the cause
of the job failure.  

After you have investigated the failure, just terminate the session.
Teuthology will then clean up the session and unlock the machines.

Suggested Resources
--------------------

  * `Testing Ceph: Pains & Pleasures <https://www.youtube.com/watch?v=gj1OXrKdSrs>`_
  * `Teuthology Training <https://www.youtube.com/playlist?list=PLrBUGiINAakNsOwHaIM27OBGKezQbUdM->`_
  * `Intro to Teuthology <https://www.youtube.com/watch?v=WiEUzoS6Nc4>`_

.. _Scheduling Test Run: ../tests-integration-testing-teuthology-workflow/#scheduling-test-run
.. _detailed test config: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html
.. _teuthology archives: ../tests-integration-testing-teuthology-workflow/#teuthology-archives
.. _qa/suites: https://github.com/ceph/ceph/tree/master/qa/suites
.. _qa/tasks: https://github.com/ceph/ceph/tree/master/qa/tasks
.. _interactive-on-error: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#troubleshooting
.. _custom config.yaml: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#test-configuration
.. _testing priority: ../tests-integration-testing-teuthology-intro/#testing-priority
.. _thrash: https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash