Remember when the world was brought to its computing knees due to a flawed release from Crowdstrike? As a QA manager it hits very close to home, and I've been reading up on how it happened out of morbid curiosity as much as anything else. Here's my take:
Crowdstrike's Preliminary Post Incident Review does not fill me with warm and fuzzies. (links in the replies) They begin saying 'hey we have two types of releases, one is Sensor Content ( . . . ) "All Sensor Content, including Template Types, go through an extensive QA process, which includes automated testing, manual testing, validation and rollout steps."' IOW 'we test this thoroughly.'
They go on to say that the outage WAS NOT THIS KIND OF RELEASE. So, why are you describing the thorough steps you take for it, it wasn't followed in the release that caused the outage?
The other type of release is "Rapid Response Content" which is the process that caused the outage. Lots of verbiage but seems that these are tested by a suite of automated tools. This means we have code checking code: "the Content Validator that performs validation checks on the content before it is published."
"What Happened on July 19, 2024?
On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."
If I read this correctly -- which is hard given the turgid style in which their release is written -- there was no human testing. The program that checks the release had a flaw. Boom, out the door. Why the story of the other type of release that is tested more carefully? You be the judge.
Current wisdom is that humans are expensive, automation is cheaper and its repetitive nature rules out human error. But if you get too automated to the point of not knowing if your code is testing properly, you get big problems like the Crowdstrike outage.
As the saying goes, "To err is human; to really foul things up requires a computer." — BILL VAUGHAN (1969)
If I have oversimplified here, please correct me.
#Automation #QA #Softwaretesting #Crowdstrike #BSOD
IT Manager
2moWatched ur podcast with Mr. Wali Khan last night. ur story & honesty has deeply inspired me. It’s easily one of the best interviews I’ve seen in a long time