Delta's Virtual Crash and How to Avoid Your Own
Delta’s well publicized computer crash this Monday caused about 1,000 of its flights to be delayed followed by another 300 flights on Tuesday. The dust is still settling with lots of hands trying to get a handle on exactly what caused the worldwide carrier’s computer network to fail. But if it could happen to them, it could happen to you, if you’re not willing to adapt.
Initial statements by Delta indicate a failure of an electrical component at their Atlanta hub started the ball rolling. Of course power failures are not uncommon and world class operations like Delta provide for these kinds of incidents with backup power supplies that (should) kick in when power fails. This time it didn’t.
Delta explained that “some critical systems and network equipment didn’t switch over to Delta’s backup systems.” But the real issue was the resetting of all the network attached devices that had lost connection. This outage, in addition to one that affected Southwest Airlines just a few weeks earlier, share the common issue; resetting all the downstream devices and reconnecting them.
Delta’s computer network is large by any standard, and built on a highly complex combination of aging technology. But, newer technology may have been better able to handle the reset process. Updated equipment and systems incorporating specific designs that incorporate disaster recovery elements may not be able to prevent every outage, but can provide for shorter recovery windows.
“There are multiple levels of protection mechanisms that needs to be placed in a layered approach,” explains Shalabh Goyal, product manager at Datos IO, “and require appropriate organizational processes to ensure fast recovery.”
Goyal recommends addressing these five areas to achieve a high level of network availability and short recovery times.
- Physical infrastructure layer: Redundancy, checksums, clustering
- Storage layer: Geo-replication (synchronous, asynchronous), snapshots
- Database layer: Clustering (availability), backup & recovery, archival, logs
- Data center layer: Active/active or active/passive sites, power supply backups
- Processes: Recovery plans, periodic validation and fine tuning of recovery plans
Overall the severity of the outage may have been reduced by keeping up with best practices for system updates. In Delta’s case the combination of aging software and the hardware it runs on make upgrading to modern systems difficult.
Vadim Vladimirskiy, CEO of Nerdioexplains, “Most modern computers are designed to run for about three to five years, before the hardware starts to fail and the system becomes obsolete. Ultimately, newer, more stable, and more secure software often won’t run on antiquated equipment.”
The cost of updating and upgrading may be high, but the costs of down time and customer reaction may be higher as multiple incidents pile up.