How CrowdStrike Brought Down 8.5M Computers & What We Can Learn

A bug in CrowdStrike's software crashed 8.5 million Windows computers, crippling industries from emergency services to airlines and hospitals, and costing over $10 billion. This incident exposes severe flaws in software integration and automatic updates, questioning the reliability of systems designed to protect us from cyber threats.

On July 19, a bug in CrowdStrike's software crashed around 8.5 million Windows computers. This wasn't a cyber attack but an error by CrowdStrike in releasing commercial off-the-shelf software. Ironically, CrowdStrike aims to prevent cyber threats, like phishing attacks. The software designed to keep us safe actually caused more damage than any cyber attack I know of.

Emergency services had to work without computers, drastically limiting the number of calls they could handle and coordinate. Major airlines like Delta, United, and British Airways experienced disruptions, leading to grounded flights and delays. Hospitals had to cancel surgeries and appointments because they couldn't access patient records. No industry was spared. The damage is estimated to exceed USD $10 billion. That's over half of what we spend on Medicare in a year and two-thirds of the government's annual education spending.

This incident highlighted major flaws in how this type of software operates within Microsoft Windows. CrowdStrike and similar products are integrated at the core level of Windows. This means that an error in CrowdStrike can crash Windows itself, rendering your computer useless. In the industry, we call this "brittle." It's tough until it breaks, then it shatters. This is the same pattern we've seen with mass customisations to products like Salesforce and Dynamics. These customisations happen at the core level, making your system brittle to upgrades because the upgrade alters Salesforce at a core level, increasing the chance of breaking your customisations.

The second flaw in this system comes from the desire to quickly fix new or "zero day" flaws in customers' systems through automatic "push" updates that don't require human intervention. This means code changes to millions, potentially billions, of computers happen from one source (CrowdStrike in this case). The risks in this architecture are evident, as partially experienced in this incident. Ironically, this bug caused the computers to crash, so the "push" mechanism couldn't fix the problem. Each of those 8,500,000 computers needed technical human intervention to become operational again.

Because this wasn't a malicious attack, data wasn't generally lost permanently. But imagine a malicious actor gaining temporary control of that mechanism and pushing an update designed to bring down computers and either destroy or encrypt data, demanding a ransom for recovery. The impact of such an event would be felt for weeks and months, not just a few hours.

The elephant in the room here is, "How did the bug get through quality assurance testing?" Ironically, there was a bug in CrowdStrike's automated testing process, which meant the bug was released without being tested. So, it was a bug in the software itself and a bug in the quality assurance software that caused this outage. This shows that for software like this, there needs to be quality assurance on the quality assurance software because the impact of mistakes can be catastrophic.

Andrew Walker
Technology consulting for charities
https://www.linkedin.com/in/andrew-walker-the-impatient-futurist/

Did someone forward this email to you? Want your own subscription? Head over here and sign yourself right up!

Back issues available here.

Reply

or to participate.