No backup, lack of internal communication and service agreement that expired. These are some of the explanations for the extensive e-mail crash at the University of Gothenburg last autumn, when thousands of e-mail accounts were affected.
At the beginning of the autumn, 5,000 e-mail accounts at the university were knocked out by a major IT crash and half of the staff were left without e-mail. Even today, there are hundreds of e-mail accounts that do not work yet.
Now the internal audit at the university has produced a report on the IT events that shook the academic world in western Sweden. The three main factors that led to the email crash are:
- The hard disks used by the University of Gothenburg to store e-mail had a manufacturing error which meant that they stopped working after they were used for a predefined period of 40,000 hours. The manufacturing bug goes by the name “40k bug” and could have been fixed by a software update from the manufacturer Dell.
- Neither the IT unit at the University of Gothenburg nor the subcontractor Atea noticed the manufacturing defects in the hard drives from Dell, which means that the software update that handles the software defects in the hard drives is not installed.
- After hard disks in the first data center stopped working, there was knowledge about the manufacturing error and access to software update to fix the error. However, the software update was not performed successfully so the error persisted and caused the second data center to stop working before the first data center was fully restored.
In its investigation, the Internal Audit has enlisted the help of Transcendent Group and states that a system of backup would probably have reduced data loss and downtime to a marginal level and proposes for the future that the university maps the need for backup in the IT environment, clarifies how information security incidents are handled and analyzes the need of a strengthened management function for licenses and service agreements.
The report also contains a timeline of all events that the internal audit finds ultimately led to the accident:
2014-2015: A technical solution for handling mail is procured and installed. Atea, Microsoft and Truesec are assisting in this. The solution, which according to Microsoft was based on “best practice”, was based on redundancy, through two geographically distinct clusters of hard drives that are mirrored against each other. With this solution, according to Microsoft, there would be no need for any backup in the traditional sense.
2017: An agreement is signed with Atea regarding operation of the e-mail function.
2019: An inventory and update of the service agreements for, among other things, the e-mail function is carried out by Atea on behalf of the IT unit. The agreements are renewed in accordance with the submitted proposal, however, this does not happen for the hard drives and cabinets used for storing e-mail. This is probably due to the fact that the agreements in question were valid at the time of the inventory but expired the following month, ie in January 2020.
February 2020: Hardware vendor Dell is finding software bugs for some of their hard drives, the so-called 40k bug. Dell is releasing an update. However, the university is not informed, probably because there is no service agreement. Nor does the information appear to reach other avenues. The IT unit and Atea therefore remain unaware of the shortcomings.
August 11: A large number of hard drives in the server hall-Vasaparken stop working. Four of a total of eight nodes lose their functionality. The hard disks in serverhallMedicinareberget 7 enter because the information is also stored there, however, there is now no redundancy. An error message is sent to Dell and it is found that the error is linked to the “40k bug”. Atea announces that they have taken action.
21th of August: The university is now receiving attention via the case management system about the “40kbug”. The hard drives in Serverhall-Vasaparken will be replaced. The software is also said to be upgraded in the server hall-Medicinareberget 7 and the hard drives there are now said to be “out of danger”. A transfer of data / information from ServerhallMedicinareberget 7 to serverhall-Vasaparken is initiated to restore redundancy.
September 18: The hard disks in serverhall-Medicinareberget 7 are affected by the 40k bug and before the transfer of information / data to Serverhall- Vasaparken is completed. The email solution stops working. Work is being started to recreate data to the greatest possible extent.
October November: The hard disks from Serverhall-Medicinareberget 7 are sent to foreign specialists in an attempt to recreate data. Notice is received that the software in the hard drives has not been updated, as previously claimed.
Source: Audit report from the internal audit at the University of Gothenburg.
Footnote: Ny Teknik is looking for the subcontractor Atea to get an answer to the question why the University of Gothenburg was not offered renewed service agreements for the hard drives in connection with other agreements being renewed.