diff options
Diffstat (limited to 'doc/dnssec-guide/troubleshooting.rst')
-rw-r--r-- | doc/dnssec-guide/troubleshooting.rst | 589 |
1 files changed, 589 insertions, 0 deletions
diff --git a/doc/dnssec-guide/troubleshooting.rst b/doc/dnssec-guide/troubleshooting.rst new file mode 100644 index 0000000..cdc40cc --- /dev/null +++ b/doc/dnssec-guide/troubleshooting.rst @@ -0,0 +1,589 @@ +.. Copyright (C) Internet Systems Consortium, Inc. ("ISC") +.. +.. SPDX-License-Identifier: MPL-2.0 +.. +.. This Source Code Form is subject to the terms of the Mozilla Public +.. License, v. 2.0. If a copy of the MPL was not distributed with this +.. file, you can obtain one at https://mozilla.org/MPL/2.0/. +.. +.. See the COPYRIGHT file distributed with this work for additional +.. information regarding copyright ownership. + +.. _dnssec_troubleshooting: + +Basic DNSSEC Troubleshooting +---------------------------- + +In this chapter, we cover some basic troubleshooting +techniques, some common DNSSEC symptoms, and their causes and solutions. This +is not a comprehensive "how to troubleshoot any DNS or DNSSEC problem" +guide, because that could easily be an entire book by itself. + +.. _troubleshooting_query_path: + +Query Path +~~~~~~~~~~ + +The first step in troubleshooting DNS or DNSSEC should be to +determine the query path. Whenever you are working with a DNS-related issue, it is +always a good idea to determine the exact query path to identify the +origin of the problem. + +End clients, such as laptop computers or mobile phones, are configured +to talk to a recursive name server, and the recursive name server may in +turn forward requests on to other recursive name servers before arriving at the +authoritative name server. The giveaway is the presence of the +Authoritative Answer (``aa``) flag in a query response: when present, we know we are talking +to the authoritative server; when missing, we are talking to a recursive +server. The example below shows an answer to a query for +``www.example.com`` without the Authoritative Answer flag: + +:: + + $ dig @10.53.0.3 www.example.com A + + ; <<>> DiG 9.16.0 <<>> @10.53.0.3 www.example.com a + ; (1 server found) + ;; global options: +cmd + ;; Got answer: + ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62714 + ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 + + ;; OPT PSEUDOSECTION: + ; EDNS: version: 0, flags:; udp: 4096 + ; COOKIE: c823fe302625db5b010000005e722b504d81bb01c2227259 (good) + ;; QUESTION SECTION: + ;www.example.com. IN A + + ;; ANSWER SECTION: + www.example.com. 60 IN A 10.1.0.1 + + ;; Query time: 3 msec + ;; SERVER: 10.53.0.3#53(10.53.0.3) + ;; WHEN: Wed Mar 18 14:08:16 GMT 2020 + ;; MSG SIZE rcvd: 88 + +Not only do we not see the ``aa`` flag, we see an ``ra`` +flag, which indicates Recursion Available. This indicates that the +server we are talking to (10.53.0.3 in this example) is a recursive name +server: although we were able to get an answer for +``www.example.com``, we know that the answer came from somewhere else. + +If we query the authoritative server directly, we get: + +:: + + $ dig @10.53.0.2 www.example.com A + + ; <<>> DiG 9.16.0 <<>> @10.53.0.2 www.example.com a + ; (1 server found) + ;; global options: +cmd + ;; Got answer: + ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39542 + ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 + ;; WARNING: recursion requested but not available + ... + +The ``aa`` flag tells us that we are now talking to the +authoritative name server for ``www.example.com``, and that this is not a +cached answer it obtained from some other name server; it served this +answer to us right from its own database. In fact, +the Recursion Available (``ra``) flag is not present, which means this +name server is not configured to perform recursion (at least not for +this client), so it could not have queried another name server to get +cached results. + +.. _troubleshooting_visible_symptoms: + +Visible DNSSEC Validation Symptoms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After determining the query path, it is necessary to +determine whether the problem is actually related to DNSSEC +validation. You can use the ``+cd`` flag in ``dig`` to disable +validation, as described in +:ref:`how_do_i_know_validation_problem`. + +When there is indeed a DNSSEC validation problem, the visible symptoms, +unfortunately, are very limited. With DNSSEC validation enabled, if a +DNS response is not fully validated, it results in a generic +SERVFAIL message, as shown below when querying against a recursive name +server at 192.168.1.7: + +:: + + $ dig @10.53.0.3 www.example.org. A + + ; <<>> DiG 9.16.0 <<>> @10.53.0.3 www.example.org A + ; (1 server found) + ;; global options: +cmd + ;; Got answer: + ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 28947 + ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 + + ;; OPT PSEUDOSECTION: + ; EDNS: version: 0, flags:; udp: 4096 + ; COOKIE: d1301968aca086ad010000005e723a7113603c01916d136b (good) + ;; QUESTION SECTION: + ;www.example.org. IN A + + ;; Query time: 3 msec + ;; SERVER: 10.53.0.3#53(10.53.0.3) + ;; WHEN: Wed Mar 18 15:12:49 GMT 2020 + ;; MSG SIZE rcvd: 72 + +With ``delv``, a "resolution failed" message is output instead: + +:: + + $ delv @10.53.0.3 www.example.org. A +rtrace + ;; fetch: www.example.org/A + ;; resolution failed: SERVFAIL + +BIND 9 logging features may be useful when trying to identify +DNSSEC errors. + +.. _troubleshooting_logging: + +Basic Logging +~~~~~~~~~~~~~ + +DNSSEC validation error messages show up in ``syslog`` as a +query error by default. Here is an example of what it may look like: + +:: + + validating www.example.org/A: no valid signature found + RRSIG failed to verify resolving 'www.example.org/A/IN': 10.53.0.2#53 + +Usually, this level of error logging is sufficient. +Debug logging, described in +:ref:`troubleshooting_logging_debug`, gives information on how +to get more details about why DNSSEC validation may have +failed. + +.. _troubleshooting_logging_debug: + +BIND DNSSEC Debug Logging +~~~~~~~~~~~~~~~~~~~~~~~~~ + +A word of caution: before you enable debug logging, be aware that this +may dramatically increase the load on your name servers. Enabling debug +logging is thus not recommended for production servers. + +With that said, sometimes it may become necessary to temporarily enable +BIND debug logging to see more details of how and whether DNSSEC is +validating. DNSSEC-related messages are not recorded in ``syslog`` by default, +even if query log is enabled; only DNSSEC errors show up in ``syslog``. + +The example below shows how to enable debug level 3 (to see full DNSSEC +validation messages) in BIND 9 and have it sent to ``syslog``: + +:: + + logging { + channel dnssec_log { + syslog daemon; + severity debug 3; + print-category yes; + }; + category dnssec { dnssec_log; }; + }; + +The example below shows how to log DNSSEC messages to their own file +(here, ``/var/log/dnssec.log``): + +:: + + logging { + channel dnssec_log { + file "/var/log/dnssec.log"; + severity debug 3; + }; + category dnssec { dnssec_log; }; + }; + +After turning on debug logging and restarting BIND, a large +number of log messages appear in +``syslog``. The example below shows the log messages as a result of +successfully looking up and validating the domain name ``ftp.isc.org``. + +:: + + validating ./NS: starting + validating ./NS: attempting positive response validation + validating ./DNSKEY: starting + validating ./DNSKEY: attempting positive response validation + validating ./DNSKEY: verify rdataset (keyid=20326): success + validating ./DNSKEY: marking as secure (DS) + validating ./NS: in validator_callback_dnskey + validating ./NS: keyset with trust secure + validating ./NS: resuming validate + validating ./NS: verify rdataset (keyid=33853): success + validating ./NS: marking as secure, noqname proof not needed + validating ftp.isc.org/A: starting + validating ftp.isc.org/A: attempting positive response validation + validating isc.org/DNSKEY: starting + validating isc.org/DNSKEY: attempting positive response validation + validating isc.org/DS: starting + validating isc.org/DS: attempting positive response validation + validating org/DNSKEY: starting + validating org/DNSKEY: attempting positive response validation + validating org/DS: starting + validating org/DS: attempting positive response validation + validating org/DS: keyset with trust secure + validating org/DS: verify rdataset (keyid=33853): success + validating org/DS: marking as secure, noqname proof not needed + validating org/DNSKEY: in validator_callback_ds + validating org/DNSKEY: dsset with trust secure + validating org/DNSKEY: verify rdataset (keyid=9795): success + validating org/DNSKEY: marking as secure (DS) + validating isc.org/DS: in fetch_callback_dnskey + validating isc.org/DS: keyset with trust secure + validating isc.org/DS: resuming validate + validating isc.org/DS: verify rdataset (keyid=33209): success + validating isc.org/DS: marking as secure, noqname proof not needed + validating isc.org/DNSKEY: in validator_callback_ds + validating isc.org/DNSKEY: dsset with trust secure + validating isc.org/DNSKEY: verify rdataset (keyid=7250): success + validating isc.org/DNSKEY: marking as secure (DS) + validating ftp.isc.org/A: in fetch_callback_dnskey + validating ftp.isc.org/A: keyset with trust secure + validating ftp.isc.org/A: resuming validate + validating ftp.isc.org/A: verify rdataset (keyid=27566): success + validating ftp.isc.org/A: marking as secure, noqname proof not needed + +Note that these log messages indicate that the chain of trust has been +established and ``ftp.isc.org`` has been successfully validated. + +If validation had failed, you would see log messages indicating errors. +We cover some of the most validation problems in the next section. + +.. _troubleshooting_common_problems: + +Common Problems +~~~~~~~~~~~~~~~ + +.. _troubleshooting_security_lameness: + +Security Lameness +^^^^^^^^^^^^^^^^^ + +Similar to lame delegation in traditional DNS, security lameness refers to the +condition when the parent zone holds a set of DS records that point to +something that does not exist in the child zone. As a result, +the entire child zone may "disappear," having been marked as bogus by +validating resolvers. + +Below is an example attempting to resolve the A record for a test domain +name ``www.example.net``. From the user's perspective, as described in +:ref:`how_do_i_know_validation_problem`, only a SERVFAIL +message is returned. On the validating resolver, we see the +following messages in ``syslog``: + +:: + + named[126063]: validating example.net/DNSKEY: no valid signature found (DS) + named[126063]: no valid RRSIG resolving 'example.net/DNSKEY/IN': 10.53.0.2#53 + named[126063]: broken trust chain resolving 'www.example.net/A/IN': 10.53.0.2#53 + +This gives us a hint that it is a broken trust chain issue. Let's take a +look at the DS records that are published for the zone (with the keys +shortened for ease of display): + +:: + + $ dig @10.53.0.3 example.net. DS + + ; <<>> DiG 9.16.0 <<>> @10.53.0.3 example.net DS + ; (1 server found) + ;; global options: +cmd + ;; Got answer: + ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59602 + ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 + + ;; OPT PSEUDOSECTION: + ; EDNS: version: 0, flags:; udp: 4096 + ; COOKIE: 7026d8f7c6e77e2a010000005e735d7c9d038d061b2d24da (good) + ;; QUESTION SECTION: + ;example.net. IN DS + + ;; ANSWER SECTION: + example.net. 256 IN DS 14956 8 2 9F3CACD...D3E3A396 + + ;; Query time: 0 msec + ;; SERVER: 10.53.0.3#53(10.53.0.3) + ;; WHEN: Thu Mar 19 11:54:36 GMT 2020 + ;; MSG SIZE rcvd: 116 + +Next, we query for the DNSKEY and RRSIG of ``example.net`` to see if +there's anything wrong. Since we are having trouble validating, we +can use the ``+cd`` option to temporarily disable checking and return +results, even though they do not pass the validation tests. The +``+multiline`` option tells ``dig`` to print the type, algorithm type, +and key id for DNSKEY records. Again, +some long strings are shortened for ease of display: + +:: + + $ dig @10.53.0.3 example.net. DNSKEY +dnssec +cd +multiline + + ; <<>> DiG 9.16.0 <<>> @10.53.0.3 example.net DNSKEY +cd +multiline +dnssec + ; (1 server found) + ;; global options: +cmd + ;; Got answer: + ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42980 + ;; flags: qr rd ra cd; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1 + + ;; OPT PSEUDOSECTION: + ; EDNS: version: 0, flags: do; udp: 4096 + ; COOKIE: 4b5e7c88b3680c35010000005e73722057551f9f8be1990e (good) + ;; QUESTION SECTION: + ;example.net. IN DNSKEY + + ;; ANSWER SECTION: + example.net. 287 IN DNSKEY 256 3 8 ( + AwEAAbu3NX...ADU/D7xjFFDu+8WRIn + ) ; ZSK; alg = RSASHA256 ; key id = 35328 + example.net. 287 IN DNSKEY 257 3 8 ( + AwEAAbKtU1...PPP4aQZTybk75ZW+uL + 6OJMAF63NO0s1nAZM2EWAVasbnn/X+J4N2rLuhk= + ) ; KSK; alg = RSASHA256 ; key id = 27247 + example.net. 287 IN RRSIG DNSKEY 8 2 300 ( + 20811123173143 20180101000000 27247 example.net. + Fz1sjClIoF...YEjzpAWuAj9peQ== ) + example.net. 287 IN RRSIG DNSKEY 8 2 300 ( + 20811123173143 20180101000000 35328 example.net. + seKtUeJ4/l...YtDc1rcXTVlWIOw= ) + + ;; Query time: 0 msec + ;; SERVER: 10.53.0.3#53(10.53.0.3) + ;; WHEN: Thu Mar 19 13:22:40 GMT 2020 + ;; MSG SIZE rcvd: 962 + +Here is the problem: the parent zone is telling the world that +``example.net`` is using the key 14956, but the authoritative server +indicates that it is using keys 27247 and 35328. There are several +potential causes for this mismatch: one possibility is that a malicious +attacker has compromised one side and changed the data. A more likely +scenario is that the DNS administrator for the child zone did not upload +the correct key information to the parent zone. + +.. _troubleshooting_incorrect_time: + +Incorrect Time +^^^^^^^^^^^^^^ + +In DNSSEC, every record comes with at least one RRSIG, and each RRSIG +contains two timestamps: one indicating when it becomes valid, and +one when it expires. If the validating resolver's current system time does +not fall within the two RRSIG timestamps, error messages +appear in the BIND debug log. + +The example below shows a log message when the RRSIG appears to have +expired. This could mean the validating resolver system time is +incorrectly set too far in the future, or the zone administrator has not +kept up with RRSIG maintenance. + +:: + + validating example.com/DNSKEY: verify failed due to bad signature (keyid=19036): RRSIG has expired + +The log below shows that the RRSIG validity period has not yet begun. This could mean +the validation resolver's system time is incorrectly set too far in the past, or +the zone administrator has incorrectly generated signatures for this +domain name. + +:: + + validating example.com/DNSKEY: verify failed due to bad signature (keyid=4521): RRSIG validity period has not begun + +.. _troubleshooting_unable_to_load_keys: + +Unable to Load Keys +^^^^^^^^^^^^^^^^^^^ + +This is a simple yet common issue. If the key files are present but +unreadable by ``named`` for some reason, the ``syslog`` returns clear error +messages, as shown below: + +:: + + named[32447]: zone example.com/IN (signed): reconfiguring zone keys + named[32447]: dns_dnssec_findmatchingkeys: error reading key file Kexample.com.+008+06817.private: permission denied + named[32447]: dns_dnssec_findmatchingkeys: error reading key file Kexample.com.+008+17694.private: permission denied + named[32447]: zone example.com/IN (signed): next key event: 27-Nov-2014 20:04:36.521 + +However, if no keys are found, the error is not as obvious. Below shows +the ``syslog`` messages after executing ``rndc +reload`` with the key files missing from the key directory: + +:: + + named[32516]: received control channel command 'reload' + named[32516]: loading configuration from '/etc/bind/named.conf' + named[32516]: reading built-in trusted keys from file '/etc/bind/bind.keys' + named[32516]: using default UDP/IPv4 port range: [1024, 65535] + named[32516]: using default UDP/IPv6 port range: [1024, 65535] + named[32516]: sizing zone task pool based on 6 zones + named[32516]: the working directory is not writable + named[32516]: reloading configuration succeeded + named[32516]: reloading zones succeeded + named[32516]: all zones loaded + named[32516]: running + named[32516]: zone example.com/IN (signed): reconfiguring zone keys + named[32516]: zone example.com/IN (signed): next key event: 27-Nov-2014 20:07:09.292 + +This happens to look exactly the same as if the keys were present and +readable, and appears to indicate that ``named`` loaded the keys and signed the zone. It +even generates the internal (raw) files: + +:: + + # cd /etc/bind/db + # ls + example.com.db example.com.db.jbk example.com.db.signed + +If ``named`` really loaded the keys and signed the zone, you should see +the following files: + +:: + + # cd /etc/bind/db + # ls + example.com.db example.com.db.jbk example.com.db.signed example.com.db.signed.jnl + +So, unless you see the ``*.signed.jnl`` file, your zone has not been +signed. + +.. _troubleshooting_invalid_trust_anchors: + +Invalid Trust Anchors +^^^^^^^^^^^^^^^^^^^^^ + +In most cases, you never need to explicitly configure trust +anchors. ``named`` supplies the current root trust anchor and, +with the default setting of ``dnssec-validation``, updates it on the +infrequent occasions when it is changed. + +However, in some circumstances you may need to explicitly configure +your own trust anchor. As we saw in the :ref:`trust_anchors_description` +section, whenever a DNSKEY is received by the validating resolver, it is +compared to the list of keys the resolver explicitly trusts to see if +further action is needed. If the two keys match, the validating resolver +stops performing further verification and returns the answer(s) as +validated. + +But what if the key file on the validating resolver is misconfigured or +missing? Below we show some examples of log messages when things are not +working properly. + +First of all, if the key you copied is malformed, BIND does not even +start and you will likely find this error message in syslog: + +:: + + named[18235]: /etc/bind/named.conf.options:29: bad base64 encoding + named[18235]: loading configuration: failure + +If the key is a valid base64 string but the key algorithm is incorrect, +or if the wrong key is installed, the first thing you will notice is +that virtually all of your DNS lookups result in SERVFAIL, even when +you are looking up domain names that have not been DNSSEC-enabled. Below +shows an example of querying a recursive server 10.53.0.3: + +:: + + $ dig @10.53.0.3 www.example.com. A + + ; <<>> DiG 9.16.0 <<>> @10.53.0.3 www.example.org A +dnssec + ; (1 server found) + ;; global options: +cmd + ;; Got answer: + ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 29586 + ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 + + ;; OPT PSEUDOSECTION: + ; EDNS: version: 0, flags: do; udp: 4096 + ; COOKIE: ee078fc321fa1367010000005e73a58bf5f205ca47e04bed (good) + ;; QUESTION SECTION: + ;www.example.org. IN A + +``delv`` shows a similar result: + +:: + + $ delv @192.168.1.7 www.example.com. +rtrace + ;; fetch: www.example.com/A + ;; resolution failed: SERVFAIL + +The next symptom you see is in the DNSSEC log messages: + +:: + + managed-keys-zone: DNSKEY set for zone '.' could not be verified with current keys + validating ./DNSKEY: starting + validating ./DNSKEY: attempting positive response validation + validating ./DNSKEY: no DNSKEY matching DS + validating ./DNSKEY: no DNSKEY matching DS + validating ./DNSKEY: no valid signature found (DS) + +These errors are indications that there are problems with the trust +anchor. + +.. _troubleshooting_nta: + +Negative Trust Anchors +~~~~~~~~~~~~~~~~~~~~~~ + +BIND 9.11 introduced Negative Trust Anchors (NTAs) as a means to +*temporarily* disable DNSSEC validation for a zone when you know that +the zone's DNSSEC is misconfigured. + +NTAs are added using the ``rndc`` command, e.g.: + +:: + + $ rndc nta example.com + Negative trust anchor added: example.com/_default, expires 19-Mar-2020 19:57:42.000 + + +The list of currently configured NTAs can also be examined using +``rndc``, e.g.: + +:: + + $ rndc nta -dump + example.com/_default: expiry 19-Mar-2020 19:57:42.000 + + +The default lifetime of an NTA is one hour, although by default, BIND +polls the zone every five minutes to see if the zone correctly +validates, at which point the NTA automatically expires. Both the +default lifetime and the polling interval may be configured via +``named.conf``, and the lifetime can be overridden on a per-zone basis +using the ``-lifetime duration`` parameter to ``rndc nta``. Both timer +values have a permitted maximum value of one week. + +.. _troubleshooting_nsec3: + +NSEC3 Troubleshooting +~~~~~~~~~~~~~~~~~~~~~ + +BIND includes a tool called ``nsec3hash`` that runs through the same +steps as a validating resolver, to generate the correct hashed name +based on NSEC3PARAM parameters. The command takes the following +parameters in order: salt, algorithm, iterations, and domain. For +example, if the salt is 1234567890ABCDEF, hash algorithm is 1, and +iteration is 10, to get the NSEC3-hashed name for ``www.example.com`` we +would execute a command like this: + +:: + + $ nsec3hash 1234567890ABCEDF 1 10 www.example.com + RN7I9ME6E1I6BDKIP91B9TCE4FHJ7LKF (salt=1234567890ABCEDF, hash=1, iterations=10) + +Zero-length salt can be specified as ``-``. + +While it is unlikely you would construct a rainbow table of your own +zone data, this tool may be useful when troubleshooting NSEC3 problems. |