Post-mortem of a DNSSEC incident at eu.org

(or: the good, the bad and the ugly)

Abstract

Due a bug in zone generation, all updates for the EU.ORG zone were stuck from 2020-08-29 02:19 UTC to 2020-09-04 14:40 UTC. Then an incorrect fix was made, resulting in the publication of incorrect DNSSEC signatures for the zone from 2020-09-04 14:40 UTC to 2020-09-04 19:37:00 UTC. Then the final, correct fix was implemented.

This episode, unoriginal albeit humbling, nevertheless yielded interesting returns of experience.

All times in the rest of this document are UTC times.

The software setup at eu.org

The primary DNS server for EU.ORG runs ISC‘s BIND. The zone is currently generated by Python and shell scripts from a Postgresql database. This does not include DNSSEC records for the zone (except DS records for delegations). DNSSEC records are generated and refreshed by dnssec-signzone, one of the tools provided with bind. Once the zone file has been updated, it is reloaded using rndc reload, another command-line tool provided with bind.

Zone key rotation is handled by custom scripts which periodically check for key age and schedule key generation, pre-publication, activation and de-activation as needed, calling dnssec-keygen to manage the key files.

Setup for the failure: blocked updates

2020-08-29 02:19: due to a race condition in the zone generation process (issue #1), the EU.ORG zone file disappeared.

The last good and published version of the EU.ORG zone file, still loaded in the primary server, had serial number 2020082907, generated at 2020-08-29 01:12. In the case of a missing file, the reload obviously fails but bind behaves nicely and keeps serving its older in-memory version of the file.

However, the disappearance of the zone file caused all subsequent zone file generation processes to fail (issue #2), as they were accessing the current version of the file to fetch the currently published serial number.

The problem remained unnoticed (issue #3: incomplete monitoring) until 4 September 2020, when a user notified us that his new domain was still undelegated.

The ugly

Around 2020-09-04 14:40, a first fix was attempted: a known good version of the zone file was reinstalled to allow the zone generation process to succeed, then a new zone was generated, freshly DNSSEC-signed, and loaded.

However, the above timeline conflicted with a scheduled key rotation of the zone-signing keys. The theoretical key rotation schedule was as follows:

Theoretical key rotation schedule

The new key (14716) was due to be published from 2020-08-29 05:37, a few hours after the zone update process failed. It should have been present in concerned resolver caches about 24 hours later, alongside the previous key (22810), ready to be used to check signatures (RRSIG records) of the zone which were supposed to be published from 2020-09-03 05:37.

However, due the zone update suspension, this happened instead. The skipped steps are shown in gray.

Actual key rotation schedule (before fix)

The zone was directly updated from the 2020-08-14/2020-08-29 key configuration to 2020-09-04 14:40.

A few minutes after 2020-09-04 14:40, it was apparent that something was amiss: the resolution of EU.ORG domains failed for people using resolvers with DNSSEC validation.

The cause was quickly identified: since pre-publication for DNSKEY 14716 was missed, most resolvers only had the unexpired DNSKEY 22810 in their cache, while the only RRSIG records available in the zone servers required key 14716.

The bad

The obvious fix was to reactivate key 22810 and regenerate the zone signatures (RRSIG records) with dnssec-signzone. This also leaves in place the signatures with key 14716 (keeping the latter was needed for resolvers which had begun to cache key 14176).

As a side note, it helped that the EU.ORG switched a few months ago to NSEC3 “opt-out” mode. This saves a lot of space (especially in nameserver memory) for zones with many delegations, which is especially useful if you temporarily need double signatures such as in this episode.

A first implementation attempt was made at 2020-09-04 14:52 by updating the dates in the public key file (.key) for key 22810, pushing the inactivation date to 2020-09-07 05:37:00 and the deletion date to 2020-09-09 05:37:00.

Before update:

; Created: 20200808100738 (Sat Aug  8 12:07:38 202)
; Publish: 20200809053700 (Sun Aug  9 07:37:00 202)
; Activate: 20200814053700 (Fri Aug 14 07:37:00 202)
; Inactive: 20200903053700 (Thu Sep  3 07:37:00 202)
; Delete: 20200905053700 (Sat Sep  5 07:37:00 202)
EU.ORG. 172800 IN DNSKEY 256 3 8 AwEAAcHAqfeFzQqo9vFq8ZziaQs2...

Side remarks:

  • the TTL value above is ignored by dnssec-signzone, which by default reuses the TTL in the zone file. The actual TTL is 86400.
  • note the weird year 202 instead of 2020

After update:

; Created: 20200808100738 (Sat Aug  8 12:07:38 202)
; Publish: 20200809053700 (Sun Aug  9 07:37:00 202)
; Activate: 20200814053700 (Fri Aug 14 07:37:00 202)
; Inactive: 20200907053700
; Delete: 20200909053700
EU.ORG. 172800 IN DNSKEY 256 3 8 AwEAAcHAqfeFzQqo9vFq8ZziaQs2...

However… (issue #4: when working in a hurry, expect stupid mistakes), this fix was wrong, albeit harmless. As should have been obvious from the “;” prefix, the above lines are informational. The change above was without any effect, but this was initially unnoticed for lack of adequate testing. (issue #5: don’t reset resolver caches too early, it may hamper testing; if you are expecting specific RRSIG records, test this explicitly).

The good

The actual dates are in the adjoining .private file, which was finally updated as follows:

Private-key-format: v1.3
Algorithm: 8 (RSASHA256)
...
Successor: 14716
Created: 20200808100738
Publish: 20200809053700
Activate: 20200814053700
Inactive: 20200907053700
Delete: 20200909053700

This resulted in the following key rotation schedule, implemented from 2020-09-05 19:37, which finally fixed the issue and probably reduced the zone downtime by almost 19 hours.

It was tested on an untouched resolver which failed EU.ORG requests and recovered from the update (hypothesis: is this because of heuristics on RRSIG records when no cached DNSKEY matches the cached RRSIG records?).

Fixed key rotation schedule

Lessons learned

The above incident will result in several procedural changes on the EU.ORG servers. Some of these are marked as issue #n; others are being considered, like using bind‘s automated signature mode, coupled with dynamic zone updates, which would have made the whole episode moot (but would introduce a strong dependency on bind). Writing this post-mortem text helped make the most of the incident.

Thanks to Stéphane Bortzmeyer, always vigilant when it comes to DNS and DNSSEC bugs, who noticed and notified us that the zone was still broken after the initial incorrect fix, and who read and commented an initial version of this text.

One thought on “Post-mortem of a DNSSEC incident at eu.org”

Comments are closed.