Issue gold4.eq7.fra.datalix.de Monday 4th November 2024 14:07:09


The local file system of gold4.eq7.fra.datalix.de seems to have failed. (Not VM Storage) We are currently investigating.

Postmortem:

At 4:23:33 the disk /dev/sdb in gold4.eq7.fra.datalix.de failed and a monitoring alter was triggered.

As there is no customer data on these disks, we normally plan a maintenance in the next 1-2 weeks to replace this disk. As this is not really a problem at the moment, the system itself was up and running.

At around 14:00, the /dev/sda disk failed, which completely disabled the raid and set the file system to read only.

After an attempted reboot of the system, the hard drives were no longer bootable, confirming that both were defective. We replaced both disks after overcoming a number of issues and obstacles to get everything back up and running.

At around 18:20 the system was back up and running and the vms started again.

The main problem was that both disks failed within one day, which we had not expected.

Going forward, we plan to stop using system drives as these disks get a lot of reads and writes that are not a guest load and just add another component that can fail. Our Ryzen and Xeon Gold hosts in TornadoDC already have no additional system drives. In future, we will carry out emergency maintenance of the host if a system disk fails. We also plan to move all current Xeon Gold systems in eq7 to no system drives in the future. We will do this without any downtime, so that no customer impact is noticed.

All KVM Servers are up again.

We are sorry it took a bit longer then expected, but a full loss of files from the host took a bit longer to fix then expected.

All Customer data is fine, it was only the host disks, not the vm storage disks.

All Customers have been credited with 4 Days of runtime.

Postmortem will follow today.

As expected the server will no longer boot from the os drives.

A technician from EQ will swap out the drives with new ones. We will write here again once that is done.

The Server is going to be rebooted now. We will update here once it is running again or we know more.