At AWS for Financial Services operational resilience is a core focus, says Chief Technologist Laurent Domb AWS uses chaos engineering, AI-powered threat detection, and a culture of continuous learning to ensure its systems, and those of its clients, can withstand disruption: https://lnkd.in/e-K_rTRH
PYMNTS’ Post
More Relevant Posts
-
Earlier this month, one of our engineers led a Community of Practice (CoP) session on using AWS Fault Injection Simulator (FIS) for testing Disaster Recovery (DR) plans. This session was inspired by the Disaster Recovery work that had done for one of our pillar clients. One of the key tests involved simulating an Availability Zone failure, where we used AWS FIS to mimic the event and observe the system's response. We felt it would be beneficial to share learnings on: - DR planning and - Chaos Engineering (AWS FIS being Amazon’s Chaos Engineering tool), Khadeja gave a brief overview of: - what a DR plan involves, - why it’s critical to test, and - how AWS FIS can help ensure the plan works effectively. After all, a plan is only as good as its ability to perform under pressure! AWS FIS has a variety of other use cases, and my hope is that this introduction will encourage others to explore its potential for future projects. #opportunityseekers #wewintogether
To view or add a comment, sign in
-
Have you spent any time thinking about how to make your applications resilient in #AWS? ❓ Why is resilience important? Failures are inevitable, and downtime costs your business money, hurts your reputation with your customers, and reduces your teams productivity. 💡 We can't stop all failures, but the good news is we can architect our systems to be as resilient as possible to failures. From the Reliability pillar of the AWS Well-Architected Framework, there are concepts we can follow to allow a system to mitigate or withstand failures: 🔹 Implement graceful degredation to transform hard dependencies to soft dependencies 🔹 Throttle requests 🔹 Control and limit retry calls 🔹 Fail fast and limit queues 🔹 Set client timeouts 🔹 Make services stateless where possible 🔹 Implement emergency levers 🛠 After building resilient systems, it's also important to test your application resiliency! In addition to load testing or unit tests, you should run chaos experiments using AWS Fault Injection Simulator. Chaos experiments are a great way to inject failures into your application to make sure you are withstanding those failures as expected. This is a critical step when designing resilient applications. Check out these resources to learn more: AWS Well-Architected Reliability Pillar: https://lnkd.in/gdDfJnpb Chaos experiments using AWS Step Functions and AWS Fault Injection Simulator: https://lnkd.in/gF4myhU6 Standing on the shoulders of giants: Colm on constant work: https://lnkd.in/gZ89vDnA Find this helpful? ♻️ Share this to your network!
To view or add a comment, sign in
-
Purposeful fault injection experiments, also known as #ChaosEngineering, help teams create the real-world conditions needed to uncover hidden bugs, monitor vulnerabilities, and manage bottlenecks that are difficult to find in distributed systems. Working with teams in LSEG (London Stock Exchange Group), my colleagues Sudha Arumugam Elias Bedmar and I used the Amazon Web Services (AWS) methodology #ExperienceBasedAcceleration to run chaos engineering experiments with #AWSFaultInjectionService. Read more in our blog: https://lnkd.in/e_PzgTcT
To view or add a comment, sign in
-
And finally, this is my third session at re:Invent this year. I’ll be co-presenting this breakout session with Joshua Kahn, our second talk together. SVS324 | Implementing security best practices for serverless applications Building with Serverless enables organizations to build and deploy applications without managing underlying infrastructure. Serverless strengthens your overall security posture by reducing attack surface and shifting security operations to AWS. In this session, we explore how to implement security best practices across the software delivery lifecycle and into production deployments. We'll share lessons from working with numerous enterprise customers to enable your builders to be productive and innovative within security guardrails. See examples of practical defense in depth strategies when building and deploying your serverless applications. You’ll see core security practices with serverless like AWS Lambda and Amazon API Gateway and will learn about validation of untrusted payloads, authorization approaches, tagging strategies, and application of permission boundaries for developers. You will leave with concrete practices that you can implement. In the meantime, you can visit our governance guide on serverlessland.com for additional content: https://lnkd.in/eFszbSPf. #aws #reinvent #serverless #security
Serverless Land
serverlessland.com
To view or add a comment, sign in
-
DevSecOps Engineer | Linux | Cloud | Automation | Jenkins | Ansible | Terraform | CI/CD | Cybersecurity | SecOps | CompTIA Pentest+| ISO 27001 |
Dear Network, In today’s world of microservices and distributed architectures, disaster recovery (DR) plans must evolve to address more than just natural or man-made calamities. While traditional DR strategies focus on recovering data after major incidents, modern applications face more subtle and complex challenges. What happens if a microservice glitches or unexpectedly impacts other services? Shutting down the entire application and relocating it—along with the downtime and data loss—isn’t an efficient solution. But what if we could introduce and test potential failures in a controlled environment before they occur? This is where Chaos Engineering comes into play. Imagine injecting faults into microservices, observing how related services react, and proactively strengthening your system's resilience. By doing so, we prepare our applications to withstand real-world disruptions with minimal impact. Chaos Engineering involves thoughtful, planned experiments to understand how our systems behave under failure. The process typically follows five steps: 1.Define the steady state – Capture metrics when all components work as expected. 2.Form a hypothesis – Predict how the system should behave when something goes wrong. 3.Design a small experiment – Test your hypothesis with minimal disruption. 4.Run the experiment – Measure the impact and analyze the results. 5.Learn and iterate – Use the insights to strengthen your system. I’m excited to announce that I’ve completed my Chaos Engineering course with KodeKloud, where I practiced using AWS Fault Injection Service to test the resilience of microservices and cloud resources. #ChaosEngineering #Resilience #Microservices #Cloud #AWS #KodeKloud #DevOps #AWS_Fault_Injection
Chaos Engineering
learn.kodekloud.com
To view or add a comment, sign in
-
SDE @Amazon | Google women Techmaker Ambassador | Runner-up SBI Innovate for Bank | Microsoft Fixathon finalist | Ex-DS Intern @Poshmark | Ex-Intern @IIT Bombay | Avocation IG - Art by Abha(@flavours_of_art_)
Scalability is at the heart of building robust systems. And the more I work on scalable architectures, the more I realize how important it is to keep this one thing in mind: Optimize for future growth, not just the present problem. Here’s what I’ve learned about scaling systems effectively: * Design for Failure, Not Perfection: No system is perfect, and failures will happen. The key is to design with failure in mind—redundancy, failover strategies, and circuit breakers are essential. If you can recover from failure seamlessly, your system can scale without breaking under pressure. * Leverage Asynchronous Processing: Synchronous operations can be bottlenecks. By embracing asynchronous processing (think queues, workers, and event-driven architecture), you can keep your system responsive even when dealing with high loads. * Monitoring and Metrics Matter: You can’t improve what you don’t measure. Scalability isn’t just about performance under load, it’s about knowing when your system is struggling. Setting up effective monitoring and alerting tools is crucial to maintain system health and anticipate bottlenecks before they become critical. * Horizontal vs Vertical Scaling: While vertical scaling (adding more power to a single machine) is tempting, horizontal scaling (adding more machines to share the load) is often more sustainable and cost-effective long term. Balancing between these two strategies depends on your application, but embracing horizontal scaling gives more flexibility. When you’re building systems at scale, small decisions can have massive downstream effects. Focus on designing systems that not only work well today but will continue to work as they grow. What’s been your biggest lesson when dealing with scalable systems? Drop a comment below—let’s exchange knowledge! 💬 #Scalability #SystemDesign #SoftwareEngineering #TechTips #AWS #SDE #TechCommunity #CloudComputing #Amazon #HighAvailability #Architecture #TechKnowledge #DistributedSystems
To view or add a comment, sign in
-
Everything breaks. Faults are unavoidable. Don’t pretend you can eliminate every possible source of them, because either nature or nurture will create bigger disasters to wreck your systems. Assume the worst. We need to examine what happens after the fault creeps in. #devops #sre #observability #kubernetes #aws
How does the term unknown-unknowns refer to Kubernetes? Imagine your system facing problems it doesn't even know exist, making it unable to tackle them proactively. These issues often stem from events outside the scope of regular applications and APIs, like natural disasters or human errors. While systems, like Kubernetes can't control everything, understanding your environment's limitations, allows you to design clusters that work within those bounds. In AWS, this means factoring in failure domains and service boundaries, especially when spanning multiple Regions or Availability Zones. AWS considers Regions and Availability Zones as potential failure areas, emphasizing the need for high availability in infrastructure. For instance, if you want your website to remain operational even if a Region goes down, you must deploy it across multiple Regions. Within each Region, your application should run in multiple Availability Zones to ensure continued operation across different failure scenarios. Although running Kubernetes in a single Availability Zone can handle EC2 instance failures, it won't shield you from entire Zone outages. Best practice dictates creating highly available clusters across multiple AZs to maintain application availability even during Zone failures. The STELLA report by David D Woods of Ohio State University studied large IT outages: - Each anomaly arose from unanticipated, unappreciated interactions between system components. - There was no 'root cause. Instead, the anomalies arose from multiple latent factors combined to generate a vulnerability. - The vulnerabilities existed for weeks or months before they contributed to the evolution of an anomaly. - The events involved external software/hardware (e.g., a server or application from a vendor) and locally developed, maintained, and configured software (e.g., programs developed 'in-house,' automation scripts, and configuration files). - Specific events, conditions, or situations activated the vulnerabilities. The activators were minor events, near-nominal operating conditions, or slightly off-normal situations. It’s no coincidence that outages happen in areas where we don’t have a good mental model. They happen there because we don’t have a good mental model. What kind of outages have you seen that were caused in a system you worked with? #kubernetes #aws #devops #vulnerability #observability
To view or add a comment, sign in
-
The golden (and surprising) rule of prompt engineering: “Show your prompt to a friend and ask them if they can follow the instructions and produce the results you are looking for.” Learn this and other useful prompting techniques next week during Elina Lesyk’s and my session „Prompt engineering best practices for LLMs on Amazon Bedrock“ (AIM302) at AWS Summit Berlin, May 15 + 16! https://lnkd.in/dXZEncU4
To view or add a comment, sign in
-
Our latest blog dives into how AWS Rekognition transforms safety measures within educational networks. Learn how the START Foundation leveraged this cutting-edge tech to filter out harmful content, ensuring positive communication and secure interactions. Read more: https://buff.ly/48vqJvp
To view or add a comment, sign in
-
Don't miss these expert insights on achieving maximum resiliency for your AWS network deployment. Learn how to solve challenges of a single connection, ensure data residency compliance, and take advantage of geo-resilient options. Watch this 20-minute Tech Talk today. https://eqix.it/4cHDC7i
To view or add a comment, sign in
99,617 followers
nice Laurent Domb !