When cybersecurity vendor CrowdStrike released an automatic update to its vulnerability scanner Falcon Sensor on July 19, 2024, the last thing it expected was to bring major industries worldwide to a grinding halt. Unfortunately, a logic error within the update caused the Windows operating system within millions of devices to stop working, giving users the “blue screen of death” and putting the machine in a bootloop that prevented the machine from completing the startup process.
The effects of the faulty code created chaos that's still reverberating weeks later, from grounded planes and canceled medical procedures to inoperable cloud-based services like Office 365. Microsoft estimates that less than 1% of Windows-based devices went down due to the incident. Still, some speculate that the number was higher. The official estimate only accounts for confirmed crash reports that users shared. Still, since the effects were so widespread, it’s possible that many additional customers never reported having issues.
So, what caused the CrowdStrike outage, and how can we prevent something similar from happening again?
Why the Update Caused Server Failures Around the World
The trouble began with CrowdStrike’s Falcon software, which provides enterprise endpoint protection. This software works at the operating system’s kernel level, the core of the system that allows unrestricted access to system memory and hardware. Running at this level provides greater threat intelligence, as it can detect concerns across the system; however, it also means that if something goes wrong, as it did with the recent update, it can take down the entire machine.
The updated version of the program, Channel File 291, included improvements to gather additional data on new adversarial techniques. CrowdStrike says that an issue within its testing software didn’t catch the fatal flaw, which only affected machines under specific circumstances. According to reports, the software bug only affected online machines running sensor version 7.11 or higher when the update was deployed.
How To Protect Your Network From Future Disruption
Microsoft’s incident response team estimates that at least 97% of the affected servers are back online. However, despite being relatively short-lived, the episode revealed several important considerations for developing an enterprise cybersecurity plan.
First, while it’s important to work with trusted security vendors for endpoint security, antivirus protection, and more, that trust shouldn’t mean unfettered access to systems to make updates or other changes. Limiting access and implementing protections help ensure mistakes don’t disrupt operations.
Second, the CrowdStrike breakdown highlights the risks inherent in relying on automatic updates. While emergencies may require automatic updates to eliminate emergent threats, allowing vendors to issue automatic updates at any time can be dangerous. A policy requiring manual updates ensures that only fully tested versions of critical software reach your system.
Microsoft is now exploring changes that will limit infosec vendors’ kernel access, similar to the access policies that Mac and Linux maintain on their operating systems.