The Domino Effect

The Domino Effect

It was a nice and peaceful project. Small #kubernetescluster deployed to the #baremetal infrastructure. A standard mid-west management cluster with GitLab repository and registry, a couple of project management systems, documentation, and tracking services. All the data was stored at the persistent storage managed by the Longhorn controller. It was a nice and peaceful project.

One day a #devopsengineer was asked to address a minor issue - some users encountered an upload limitation. They were unable to upload more than 50MBs worth of data to the media system - a self-hosted application responsible for file-sharing, among other things.

“A piece of cake,” thought our seasoned engineer, “All I need to do is to increase the post_max_upload setting in the php.ini file. Anybody can do that”. But he had solid experience and knew that before making any change, it’s vital to reproduce the issue first. Maybe there’s no issue at all.

“That’s a very valid point,” might think a fellow reader and be right. Although the first, very small but no less critical domino piece had gotten the impulse. Not strong enough to change anything but sufficient to get the whole system into movement. 

Let’s get back to our hero. He decided to upload some large files to check the upper limit of the upload and select the #doctorwho episode about the weeping angles. Why? Because the file was about 3GBs in size and because the episode is just brilliant.

The upload has passed the threshold of 1.5GBs, and the #engineer reported that he couldn’t reproduce the issue. He made himself a coffee and got back to check whether the upload had succeeded or not. This was when the first domino piece fell. The media application was down. “A mere coincidence,” he thought, “Or no more than an OOM issue.” Oh, he couldn’t be more wrong. 

The application collapsed, which triggered an improper detachment of the mounted volume. A new replica tried to replace it but couldn’t start because of the multi-attach limitation - the volume was ReadWriteOnce. The engineer attempted to force-terminate the pod to speed up the process. Still, Longhorn has already detected the problem and made the volume ReadOnly to protect the data until it gets fixed by Longhorn. #Kubernetes started respawning containers which inevitably failed because the filesystem had been in read-only mode. “Weird,” he thought, “At least it’s just the media.” And this is where he was wrong again. 

- “ImagePullBackOff,” Kubernetes replied. 

- “What the heck?!” shouted our hero. 

It was the GitLab registry going down due to the same problem - Longhorn had made its volume read-only. Next, Postgres and Redis went down, which could mean only one thing - there’s no way to download any docker image anymore. 

Media was followed by other services. First, the ticketing system became unavailable, desperately trying to pull at least something from the registry. It was followed by the documentation server, tracking service, and so on until the whole cluster collapsed into nothingness.

What was our engineer doing at the time? Rapidly turning grey and asking himself why didn’t he become a gardener - nothing falls apart this quickly when it comes to flowers. He had no clue what to do at all. A moment before he started the recovery from the backups, he noticed a tiny but full of hope light. Gitlab’s Postgres was able to start. He was about to cry. What is it if not a miracle? Gandalf and the Rohirrim emerge on the down on the third day, saving Helm’s Deep a moment before fall. That’s what it was for him. Services started to launch. Which could mean just one thing, Longhorn had managed to fix the volumes, which could mean just one thing. And in a matter of two minutes, the whole cluster was back online. 

This story took place within the 20-25 minute time range. No one knew how close to the disaster they were; only the #engineer knew this. He never figured out whether this was a misconfiguration, bad luck, or destiny and never told anyone about this. A long time has passed since that moment, and our hero got to know what Disaster recovery training and chaos #engineering are and what the no-blame culture means. But anytime he uploads something bigger than 1.5GBs, our engineers get goosebumps and a cold sweat followed by his quiet inner voice: “Are you ready to face the consequences this time?”

Ha ha. All DevOps are doing things like this. It's terrifying 😅

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics