Postmortem: 2024-03-11 infra downtime incident

# Postmortem: 2024-03-11 infra downtime incident ## Approximate timeline (all times in CET) - ~14:20 - power outage at the HSWAW building, initially interpreted as an [EPIX outage](https://t.me/uwuenterprises/118) by me due to widely broken routes - 16:28 - I confirm that the upstream came back fully, but my servers stayed unresponsive. I escalate the issue to bgp.wtf - 16:29 - response from bgp.wtf, everything on their side is green. link to my site is up, but no info otherwise - 16:35 - I start arranging for "remote hands", as i'm (still) not in Warsaw - ~19:00 - my remote hands arrive at HSWAW and start the maintenance; Due to unrelated reasons, this takes a while - 20:31 - Machines are back up; I start bringing up the services - 20:53 - VMs, mail server, Linux mirrors and a bunch of other services are up. Stuff depending on the PostgreSQL db on sakamoto is still down; I'm trying to recover the data - ~21:20 - I find the root cause (corrupted pg_control file); I investigate further - 21:41 - [TG announcement](https://t.me/uwuenterprises/123) where I state that I will attempt to repair the file by hand - ~00:00 - Being defeated with a CRC calculation not matching, I decide to restore the DB from the 03:25am backup - ~00:30 - while the DB is being restored, [mei](https://donotsta.re/mei) asks if it can attempt to help with the recovery of the old DB. We spend the following 3 hours reading PostgreSQL source code and writing an utility to recreate the meta file - ~03:50 - I bring up the *fixed* DB and attempt to make a dump to triple-check data consistency - 04:50 - Everything "green", with **no data loss** (using the backup would lead to a loss of a few morning hours of data) ## What went wrong? - Battery backup wasn't throughly tested Since October 2021, we've had UPSes installed to keep the servers up in the case of a power failure. Those units were initially tested for load and seemed to work just fine; Unfortunately, during the past 3 years, all power outages lasted less than the battery capacity, which meant that we never had a chance to see what happens *after* power depletes. It turns out that in some conditions, after the battery dies, that UPS will go into an "overload" mode, which requires manual intervention. - PostgreSQL was in autostart when it shouldn't have been [As outlined in my previous full blogpost](https://sdomi.pl/weblog/18-fixing-ext4-under-pressure/), our postgres setup is (currently) a DB on ext4 on LUKS (image file) on (another) ext4. I need to unencrypt and mount the DB before postgres can safely start; Postgres was in autostart for legacy reasons, and my assumption was that it wouldn't start correctly anyways, because there wasn't any data in `/var/lib/postgresql`; Unfortunately, Alpine's init script checks for that and initializes a DB if there isn't one present. This lead to a catastrophic chain of events, where I mounted the partition and executed `service postgresql restart` without checking if postgres was already running; This caused the main process to overwrite `/var/lib/postgresql/14/data/global/pg_control` file with an equivalent for an empty DB; Without some values in that file, there's no easy way to replay the write-ahead-log and regain consistent state in the database. We have already removed PostgreSQL from the autostart; I will be writing a full blogpost on how the internals of our fix worked, look out for that sometime this month. - Bringing things up took longer than expected When recovering from a crash, you often can optimize for time, or for data consistency. This was no different; If we decided to immediately use backups, we could have shaved off a few hours from the downtime. The reason why I decided to favor data integrity was part over-confidence, part care. We couldn't have finished restoring the data earlier than ~23:30-00:00 anyways (IOPS limitations w/ restoring huge postgres dumps), and IMO extending the downtime into late night hours was a good tradeoff. Still, I'm sorry for any and all people affected. - Servers were quite far away Like in the previous postmortem, this is still true. If I was closer, that outage could have been much shorter. I'm in the process of exploring server upgrade options to a board with embedded remote management capabilities, which should help with this in the future. - No failover Again, as outlined in our previous postmortem, our infra doesn't have redundancy. We will explore possible remedies to this later this year. ## What went right(-ish)? - We had working backups! At no point was there any risk of total data loss. - We performed some bonus outstanding tasks during the downtime, so it's not all a loss. # Conclusion I'm very sorry for the downtime. We're currently in the process of restructuring a significant chunk of our infra, which should result in lower future chance of a long downtime. If you have any questions, shoot them to ja (at) sdomi.pl, or on one of the other platforms [from my contact me section](https://sdomi.pl/). Thank you for using UwU Enterprises! :3c