We've just shared our annual letter, including how we engineer for reliability as we deploy new code in our core API services ~400 times daily at trillion dollar scale: - Once a change is code-complete, it's evaluated by ~1.4 million tests. (Stripe uses half a million CPU cores to execute <6 billion test runs daily!) - Changes first go to pre-production (a mock production environment with synthetic API traffic designed to mimic realistic integration patterns). - The change then rolls out to a single production machine with a small sliver of traffic, before gradually advancing to small percentages of actual production traffic. Each deploy is inspected against 55,000 different metrics. If at any moment the system detects something is wrong—traffic is redirected to an older, known-good version of Stripe. Read more in the letter: https://lnkd.in/g-Zn9GtZ.
Our 2023 annual letter: https://lnkd.in/exsKcKQi.
"Changes first go to pre-production (a mock production environment with synthetic API traffic" We implemented this at one company I worked for, and it created a massive boost in developer productivity and production stability. It caught many difficult-to-test issues that were headed to production. It can be an expensive investment, but in a situation where every transaction is critical and the stack has become complex, it is a worthwhile investment.
As a software engineer this is inspiring and as a user feel much more safer using stripe, keep it up 👍
It ll take me a team of expert cloud consultants, DevOps professionals specialising in orchestration, & few Test Infra experts just to digest these massive numbers - 500,000 CPU cores - 1,400,000 tests ( no ones gonna ask code coverage) - 6,000,000,000 Test runs daily These are mind blowing numbers and difficult to fathom the sheer scale of this testing and release infrastructure.
Apart from being best practice to test things, how do you calculate ROI of reliability cost vs infrastructure cost? assuming such infrastructure at this scale is not cheap, do you have a metric for that?
Hi I would like to introduce three exceptional candidates whom I believe would be valuable assets to your team and contribute significantly to the success of your company. While I have refrained from attaching their resumes to keep this initial communication short and concise, I would be more than happy to provide detailed CVs upon your request.
how many testers/headcount is required to achieve such large activity?
No wonder, why stripe just works. every. single. time.
Thank you for sharing David Singleton
Senior Engineering Manager at Origami | Opinions expressed on here are solely my own and are probably nonsense. Remote Working based in Cornwall
8mo"Once a change is code-complete, " Id be dead interested in what processes come before changes hit the evaluation pipelines at your scale. Do your teams practise continuous delivery, CI , TBD (assuming something close with 400 daily changes) etc and how that has impacted your reliability measures etc