(or: the good, the bad and the ugly)
Due a bug in zone generation, all updates for the EU.ORG zone were stuck from 2020-08-29 02:19 UTC to 2020-09-04 14:40 UTC. Then an incorrect fix was made, resulting in the publication of incorrect DNSSEC signatures for the zone from 2020-09-04 14:40 UTC to 2020-09-04 19:37:00 UTC. Then the final, correct fix was implemented.
This episode, unoriginal albeit humbling, nevertheless yielded interesting returns of experience.
All times in the rest of this document are UTC times.
The software setup at eu.org
The primary DNS server for EU.ORG runs ISC‘s BIND. The zone is currently generated by Python and shell scripts from a Postgresql database. This does not include DNSSEC records for the zone (except
DS records for delegations). DNSSEC records are generated and refreshed by
dnssec-signzone, one of the tools provided with
bind. Once the zone file has been updated, it is reloaded using
rndc reload, another command-line tool provided with
Zone key rotation is handled by custom scripts which periodically check for key age and schedule key generation, pre-publication, activation and de-activation as needed, calling
dnssec-keygen to manage the key files.
Setup for the failure: blocked updates
2020-08-29 02:19: due to a race condition in the zone generation process (issue #1), the EU.ORG zone file disappeared.
The last good and published version of the EU.ORG zone file, still loaded in the primary server, had serial number 2020082907, generated at 2020-08-29 01:12. In the case of a missing file, the reload obviously fails but
bind behaves nicely and keeps serving its older in-memory version of the file.
However, the disappearance of the zone file caused all subsequent zone file generation processes to fail (issue #2), as they were accessing the current version of the file to fetch the currently published serial number.
The problem remained unnoticed (issue #3: incomplete monitoring) until 4 September 2020, when a user notified us that his new domain was still undelegated.
Around 2020-09-04 14:40, a first fix was attempted: a known good version of the zone file was reinstalled to allow the zone generation process to succeed, then a new zone was generated, freshly DNSSEC-signed, and loaded.
However, the above timeline conflicted with a scheduled key rotation of the zone-signing keys. The theoretical key rotation schedule was as follows:
The new key (14716) was due to be published from 2020-08-29 05:37, a few hours after the zone update process failed. It should have been present in concerned resolver caches about 24 hours later, alongside the previous key (22810), ready to be used to check signatures (
RRSIG records) of the zone which were supposed to be published from 2020-09-03 05:37.
However, due the zone update suspension, this happened instead. The skipped steps are shown in gray.
The zone was directly updated from the 2020-08-14/2020-08-29 key configuration to 2020-09-04 14:40.
A few minutes after 2020-09-04 14:40, it was apparent that something was amiss: the resolution of EU.ORG domains failed for people using resolvers with DNSSEC validation.
The cause was quickly identified: since pre-publication for
DNSKEY 14716 was missed, most resolvers only had the unexpired
DNSKEY 22810 in their cache, while the only
RRSIG records available in the zone servers required key 14716.
The obvious fix was to reactivate key 22810 and regenerate the zone signatures (
RRSIG records) with
dnssec-signzone. This also leaves in place the signatures with key 14716 (keeping the latter was needed for resolvers which had begun to cache key 14176).
As a side note, it helped that the EU.ORG switched a few months ago to
NSEC3 “opt-out” mode. This saves a lot of space (especially in nameserver memory) for zones with many delegations, which is especially useful if you temporarily need double signatures such as in this episode.
A first implementation attempt was made at 2020-09-04 14:52 by updating the dates in the public key file (
.key) for key 22810, pushing the inactivation date to 2020-09-07 05:37:00 and the deletion date to 2020-09-09 05:37:00.
; Created: 20200808100738 (Sat Aug 8 12:07:38 202) ; Publish: 20200809053700 (Sun Aug 9 07:37:00 202) ; Activate: 20200814053700 (Fri Aug 14 07:37:00 202) ; Inactive: 20200903053700 (Thu Sep 3 07:37:00 202) ; Delete: 20200905053700 (Sat Sep 5 07:37:00 202) EU.ORG. 172800 IN DNSKEY 256 3 8 AwEAAcHAqfeFzQqo9vFq8ZziaQs2...
- the TTL value above is ignored by
dnssec-signzone, which by default reuses the TTL in the zone file. The actual TTL is 86400.
- note the weird year 202 instead of 2020
; Created: 20200808100738 (Sat Aug 8 12:07:38 202) ; Publish: 20200809053700 (Sun Aug 9 07:37:00 202) ; Activate: 20200814053700 (Fri Aug 14 07:37:00 202) ; Inactive: 20200907053700 ; Delete: 20200909053700 EU.ORG. 172800 IN DNSKEY 256 3 8 AwEAAcHAqfeFzQqo9vFq8ZziaQs2...
However… (issue #4: when working in a hurry, expect stupid mistakes), this fix was wrong, albeit harmless. As should have been obvious from the “;” prefix, the above lines are informational. The change above was without any effect, but this was initially unnoticed for lack of adequate testing. (issue #5: don’t reset resolver caches too early, it may hamper testing; if you are expecting specific RRSIG records, test this explicitly).
The actual dates are in the adjoining
.private file, which was finally updated as follows:
Private-key-format: v1.3 Algorithm: 8 (RSASHA256) ... Successor: 14716 Created: 20200808100738 Publish: 20200809053700 Activate: 20200814053700 Inactive: 20200907053700 Delete: 20200909053700
This resulted in the following key rotation schedule, implemented from 2020-09-05 19:37, which finally fixed the issue and probably reduced the zone downtime by almost 19 hours.
It was tested on an untouched resolver which failed EU.ORG requests and recovered from the update (hypothesis: is this because of heuristics on
RRSIG records when no cached
DNSKEY matches the cached
The above incident will result in several procedural changes on the EU.ORG servers. Some of these are marked as issue #n; others are being considered, like using
bind‘s automated signature mode, coupled with dynamic zone updates, which would have made the whole episode moot (but would introduce a strong dependency on
bind). Writing this post-mortem text helped make the most of the incident.
Thanks to Stéphane Bortzmeyer, always vigilant when it comes to DNS and DNSSEC bugs, who noticed and notified us that the zone was still broken after the initial incorrect fix, and who read and commented an initial version of this text.