{"id":1941,"date":"2020-09-09T13:12:43","date_gmt":"2020-09-09T11:12:43","guid":{"rendered":"https:\/\/signal.eu.org\/blog\/?p=1941"},"modified":"2020-09-09T13:37:22","modified_gmt":"2020-09-09T11:37:22","slug":"post-mortem-of-a-dnssec-incident-at-eu-org","status":"publish","type":"post","link":"https:\/\/signal.eu.org\/blog\/2020\/09\/09\/post-mortem-of-a-dnssec-incident-at-eu-org\/","title":{"rendered":"Post-mortem of a DNSSEC incident at eu.org"},"content":{"rendered":"\n<div class=\"twitter-share\"><a href=\"https:\/\/twitter.com\/intent\/tweet?via=pbeyssac\" class=\"twitter-share-button\">Tweet<\/a><\/div>\n\n<h4 class=\"wp-block-heading\">(or: the good, the bad and the ugly)<\/h4>\n\n\n\n<h4 class=\"wp-block-heading\">Abstract<\/h4>\n\n\n\n<p>Due a bug in zone generation, all updates for the EU.ORG zone were stuck from 2020-08-29 02:19 UTC to 2020-09-04 14:40 UTC. Then an incorrect fix was made, resulting in the publication of incorrect DNSSEC signatures for the zone from 2020-09-04 14:40 UTC to 2020-09-04 19:37:00 UTC. Then the final, correct fix was implemented.<\/p>\n\n\n\n<p>This episode, unoriginal albeit humbling, nevertheless yielded interesting returns of experience.<\/p>\n\n\n\n<p>All times in the rest of this document are UTC times.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">The software setup at eu.org<\/h4>\n\n\n\n<p>The primary DNS server for EU.ORG runs <a href=\"http:\/\/www.isc.org\/\">ISC<\/a>&#8216;s <a href=\"https:\/\/www.isc.org\/bind\/\">BIND<\/a>. The zone is currently generated by Python and shell scripts from a <a href=\"https:\/\/www.isc.org\/bind\/\">Postgresql<\/a> database. This does not include DNSSEC records for the zone (except <code>DS<\/code> records for delegations). DNSSEC records are generated and refreshed by <code>dnssec-signzone<\/code>, one of the tools provided with <code>bind<\/code>. Once the zone file has been updated, it is reloaded using <code>rndc reload<\/code>, another command-line tool provided with <code>bind<\/code>.<\/p>\n\n\n\n<p>Zone key rotation is handled by custom scripts which periodically check for key age and schedule key generation, pre-publication, activation and de-activation as needed, calling <code>dnssec-keygen<\/code> to manage the key files.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Setup for the failure: blocked updates<\/h4>\n\n\n\n<p><strong>2020-08-29 02:19<\/strong>: due to a race condition in the zone generation process (<em>issue #1<\/em>), the EU.ORG zone file disappeared.<\/p>\n\n\n\n<p>The last good and published version of the EU.ORG zone file, still loaded in the primary server, had serial number 2020082907, generated at 2020-08-29 01:12. In the case of a missing file, the reload obviously fails but <code>bind<\/code> behaves nicely and keeps serving its older in-memory version of the file.<\/p>\n\n\n\n<p>However, the disappearance of the zone file caused all subsequent zone file generation processes to fail (<em>issue #2<\/em>), as they were accessing the current version of the file to fetch the currently published serial number.<\/p>\n\n\n\n<p>The problem remained unnoticed (<em>issue #3: incomplete monitoring<\/em>) until 4 September 2020, when a user notified us that his new domain was still undelegated.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">The ugly<\/h4>\n\n\n\n<p>Around 2020-09-04 14:40, a first fix was attempted: a known good version of the zone file was reinstalled to allow the zone generation process to succeed, then a new zone was generated, freshly DNSSEC-signed, and loaded.<\/p>\n\n\n\n<p>However, the above timeline conflicted with a scheduled key rotation of the zone-signing keys. The theoretical key rotation schedule was as follows:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"175\" src=\"https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-1-1024x175.png\" alt=\"\" class=\"wp-image-1999\" srcset=\"https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-1-1024x175.png 1024w, https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-1-300x51.png 300w, https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-1-768x131.png 768w, https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-1.png 1055w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Theoretical key rotation schedule<\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>The new key (14716) was due to be published from 2020-08-29 05:37, a few hours after the zone update process failed. It should have been present in concerned resolver caches about 24 hours later, alongside the previous key (22810), ready to be used to check signatures (<code>RRSIG<\/code> records) of the zone which were supposed to be published from 2020-09-03 05:37.<\/p>\n\n\n\n<p>However, due the zone update suspension, this happened instead. The skipped steps are shown in gray. <\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"177\" src=\"https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-2-1024x177.png\" alt=\"\" class=\"wp-image-2005\" srcset=\"https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-2-1024x177.png 1024w, https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-2-300x52.png 300w, https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-2-768x132.png 768w, https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-2.png 1055w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Actual key rotation schedule (before fix)<\/figcaption><\/figure>\n\n\n\n<p>The zone was directly updated from the 2020-08-14\/2020-08-29 key configuration to 2020-09-04 14:40.<\/p>\n\n\n\n<p>A few minutes after 2020-09-04 14:40, it was apparent that something was amiss: the resolution of EU.ORG domains failed for people using resolvers with DNSSEC validation.<\/p>\n\n\n\n<p>The cause was quickly identified: since pre-publication for <code>DNSKEY<\/code> 14716 was missed, most resolvers only had the unexpired <code>DNSKEY<\/code> 22810 in their cache, while the only <code>RRSIG<\/code> records available in the zone servers required key 14716.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">The bad<\/h4>\n\n\n\n<p>The obvious fix was to reactivate key 22810 and regenerate the zone signatures (<code>RRSIG<\/code> records) with <code>dnssec-signzone<\/code>. This also leaves in place the signatures with key 14716 (keeping the latter was needed for resolvers which had begun to cache key 14176).<\/p>\n\n\n\n<p>As a side note, it helped that the EU.ORG switched a few months ago to <code>NSEC3<\/code> &#8220;opt-out&#8221; mode. This saves a lot of space (especially in nameserver memory) for zones with many delegations, which is especially useful if you temporarily need double signatures such as in this episode.<\/p>\n\n\n\n<p>A first implementation attempt was made at 2020-09-04 14:52 by updating the dates in the public key file (<code>.key<\/code>) for key 22810, pushing the inactivation date to 2020-09-07 05:37:00 and the deletion date to 2020-09-09 05:37:00.<\/p>\n\n\n\n<p>Before update:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">; Created: 20200808100738 (Sat Aug  8 12:07:38 202)\n; Publish: 20200809053700 (Sun Aug  9 07:37:00 202)\n; Activate: 20200814053700 (Fri Aug 14 07:37:00 202)\n; Inactive: 20200903053700 (Thu Sep  3 07:37:00 202)\n; Delete: 20200905053700 (Sat Sep  5 07:37:00 202)\nEU.ORG. 172800 IN DNSKEY 256 3 8 AwEAAcHAqfeFzQqo9vFq8ZziaQs2...<\/pre>\n\n\n\n<p>Side remarks:<\/p>\n\n\n\n<ul><li>the TTL value above is ignored by <code>dnssec-signzone<\/code>, which by default reuses the TTL in the zone file. The actual TTL is 86400.<\/li><li>note the weird year 202 instead of 2020<\/li><\/ul>\n\n\n\n<p>After update:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">; Created: 20200808100738 (Sat Aug  8 12:07:38 202)\n; Publish: 20200809053700 (Sun Aug  9 07:37:00 202)\n; Activate: 20200814053700 (Fri Aug 14 07:37:00 202)\n; Inactive: 202009<strong>07<\/strong>053700\n; Delete: 202009<strong>09<\/strong>053700\nEU.ORG. 172800 IN DNSKEY 256 3 8 AwEAAcHAqfeFzQqo9vFq8ZziaQs2...<\/pre>\n\n\n\n<p>However&#8230; (<em>issue #4: when working in a hurry, expect stupid mistakes<\/em>), this fix was wrong, albeit harmless. As should have been obvious from the &#8220;;&#8221; prefix, the above lines are informational. The change above was without any effect, but this was initially unnoticed for lack of adequate testing. (<em>issue #5: don&#8217;t reset resolver caches too early, it may hamper testing<\/em>; <em>if you are expecting specific RRSIG records, test this explicitly<\/em>).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">The good<\/h4>\n\n\n\n<p>The actual dates are in the adjoining <code>.private<\/code> file, which was finally updated as follows:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Private-key-format: v1.3\nAlgorithm: 8 (RSASHA256)\n...\nSuccessor: 14716\nCreated: 20200808100738\nPublish: 20200809053700\nActivate: 20200814053700\nInactive: 202009<strong>07<\/strong>053700\nDelete: 202009<strong>09<\/strong>053700<\/pre>\n\n\n\n<p>This resulted in the following key rotation schedule, implemented from 2020-09-05 19:37, which finally fixed the issue and probably reduced the zone downtime by almost 19 hours.<\/p>\n\n\n\n<p>It was tested on an untouched resolver which failed EU.ORG requests and recovered from the update (hypothesis: is this because of heuristics on <code>RRSIG<\/code> records when no cached <code>DNSKEY<\/code> matches the cached <code>RRSIG<\/code> records?).<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"194\" src=\"https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-3-1024x194.png\" alt=\"\" class=\"wp-image-2016\" srcset=\"https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-3-1024x194.png 1024w, https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-3-300x57.png 300w, https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-3-768x145.png 768w, https:\/\/signal.eu.org\/blog\/wp-content\/uploads\/2020\/09\/image-3.png 1056w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Fixed key rotation schedule<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Lessons learned<\/h4>\n\n\n\n<p>The above incident will result in several procedural changes on the EU.ORG servers. Some of these are marked as <em>issue #n<\/em>; others are being considered, like using <code>bind<\/code>&#8216;s automated signature mode, coupled with dynamic zone updates, which would have made the whole episode moot (but would introduce a strong dependency on <code>bind<\/code>). Writing this post-mortem text helped make the most of the incident.<\/p>\n\n\n\n<p>Thanks to St\u00e9phane Bortzmeyer, always vigilant when it comes to DNS and DNSSEC bugs, who noticed and notified us that the zone was still broken after the initial incorrect fix, and who read and commented an initial version of this text.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>(or: the good, the bad and the ugly) Abstract Due a bug in zone generation, all updates for the EU.ORG zone were stuck from 2020-08-29 02:19 UTC to 2020-09-04 14:40 UTC. Then an incorrect fix was made, resulting in the publication of incorrect DNSSEC signatures for the zone from 2020-09-04 14:40 UTC to 2020-09-04 19:37:00 &hellip; <a href=\"https:\/\/signal.eu.org\/blog\/2020\/09\/09\/post-mortem-of-a-dnssec-incident-at-eu-org\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Post-mortem of a DNSSEC incident at eu.org<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[14,7],"tags":[],"_links":{"self":[{"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/posts\/1941"}],"collection":[{"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/comments?post=1941"}],"version-history":[{"count":206,"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/posts\/1941\/revisions"}],"predecessor-version":[{"id":2151,"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/posts\/1941\/revisions\/2151"}],"wp:attachment":[{"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/media?parent=1941"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/categories?post=1941"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/signal.eu.org\/blog\/wp-json\/wp\/v2\/tags?post=1941"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}