Postmortem:
At 4:23:33 the disk /dev/sdb in gold4.eq7.fra.datalix.de failed and a monitoring alter was triggered.
As there is no customer data on these disks, we normally plan a maintenance in the next 1-2 weeks to replace this disk. As this is not really a problem at the moment, the system itself was up and running.
At around 14:00, the /dev/sda disk failed, which completely disabled the raid and set the file system to read only.
After an attempted reboot of the system, the hard drives were no longer bootable, confirming that both were defective.
We replaced both disks after overcoming a number of issues and obstacles to get everything back up and running.
At around 18:20 the system was back up and running and the vms started again.
The main problem was that both disks failed within one day, which we had not expected.
Going forward, we plan to stop using system drives as these disks get a lot of reads and writes that are not a guest load and just add another component that can fail. Our Ryzen and Xeon Gold hosts in TornadoDC already have no additional system drives.
In future, we will carry out emergency maintenance of the host if a system disk fails. We also plan to move all current Xeon Gold systems in eq7 to no system drives in the future. We will do this without any downtime, so that no customer impact is noticed.