Engineer 1: “The (java) application is leaking memory and crashing every few days, we need to fix that” Engineer 2: “Don’t worry about memory, it’s cheap, we can always buy more” This was a real exchange of two developers in a place I worked at. The app in question always ballooned in memory usage and got killed by OOM every few days, causing downtime. It never got fixed, engineers and management kept pushing features instead. This is one reason why it’s important to set Service Level Objectives (SLOs), if you trespass the limits, you have to fix what’s causing them, you can’t just push for new features. In other words, SLOs can help give you time to address technical debt.
cron up a restart 😂
Engineer 1: “The (java) application is leaking memory and crashing every few days, we need to fix that” Engineer 2: "Let's rewrite it in Rust instead" 😁
If the engineer is not the one responding to alerts when app goes down and fix it, even a manual restart then its more of a cultural issue as in this kind of scenario, even if you have SLOs defined and tracked, those will be ignored. Between any smart operator in this case will just put a job to restart the application once a day or something like that. 😁
As always, it depends. How much engineering time would it take to resolve the problem? A day? A week? A month? How much RAM could you buy for that cost? How many features would get pushed back? How much revenue would those features drive? Engineering solutions rarely exist in isolation, they drive value or they don't matter.
SLOs really help with this type of thing. As an engineer a memory leak seems like "stop what you are doing and solve this first" type of problem, but for a non-engineering business owner it might be indistinguishable from just an "opinion" on code design. SLOs help align that language across the common need to serve our customers. If x amount of data loss (i.e. everything in memory when the app crashed), and y amount of downtime is perfectly fine then from a non-technical perspective why would you want to fix it? But once SLOs are established, in 2024 when everything is online, I can't imagine someone not prioritizing this (in 2004 even with SLOs I can see a lot of companies accepting that type of downtime)
NFRs are a goodness. (this Wikipedia page is a good source for ideas...) https://en.wikipedia.org/wiki/Non-functional_requirement
In situations like that I am always thankful that I learned programming (not software development, mind you) in a time and on a platform where I was happy for every byte I was able to shave off.
#NFRs expose Enginering Skills. Ask for funds & time to build AI based restart to optimally reset memory .. may be everyone see that as a feature to be built.
DevOps Engineer @ Velo Platform Group @ Wix
6moDevops 1: "let's add to the liveness probe a memory check that will gracefully close the app before it can crush. this will resolve the immediate issue and the oom will become tech debt" Devops 2: " let them suffer or won't fit the leak until app crushes every other minute"