Definition

What is software resilience testing?

Sarah Lewis

By

Sarah Lewis

Software resilience testing is a method of software testing that focuses on ensuring that applications perform well in real-life or chaotic conditions. In other words, it tests an application's resiliency, or ability to withstand stressful or challenging factors. Resilience testing is one part of nonfunctional software testing that also includes compliance, endurance, load and recovery testing. This form of testing is sometimes also referred to as software resilience engineering, application resilience testing or chaos engineering.

Since failures can never be avoided, resilience testing ensures that software can continue performing core functions and avoid data loss even when under stress. Since downtime can be detrimental to the success of an organization, it is crucial to minimize disruptions and prepare for unwanted scenarios. Resilience testing can be considered one part of an organization's business continuity plan.

How resilience testing works

Resilience testing is part of the software development lifecycle and starts with an application that needs to be tested. Once an application is selected, organizations set up a test environment to conduct resilience testing. In general, a few steps involved in conducting a resilience test are the following:

Determining metrics. Developers must choose which metrics should be measured in order to reflect the performance of the software. This could include input and output rates, throughput, time to recovery, latency and the relationships between metrics.
Identifying the performance baseline. Next, a baseline for the maximum load the software can handle without experiencing performance issues needs to be gathered. This helps distinguish what the regular variance for performance is and can be used to compare metrics during testing.
Introducing and measuring disruptions. This is the step where challenges are introduced to try to break the system. Testers can break the system in a variety of ways, such as disrupting communication with external dependencies, injecting malicious input, manipulating traffic control, constraining bandwidth, shutting down interfacing systems, deleting data sources and consuming system resources. After these scenarios are complete, metrics should be measured and plotted according to how each affected performance.
Drawing conclusions and responding to results. Finally, teams should analyze results and use them to determine how to fix the software and assess developer team practices. Teams should also use these findings to improve later testing scenarios.

Importance of resilience testing

Resilience tests help minimize failure and security issues in the presence of a challenge. Examples of challenges that resilience testing helps defend against include power outages, system crashes, downtime and natural disasters. Additionally, resilience testing can help assess conformance to standards and best practices, privacy issues and scalability.

Resilience testing is especially important in multi-tier, multi-environment infrastructures. One way to improve resilience is to migrate software to the cloud in order to minimize the chance of internal system failure. While disruptions can occur in the cloud, providers tend to have advanced recovery systems in place.

Reliability vs. resilience

Two terms that often get confused when applied to software are reliability and resilience. Resilience is defined as the ability to regain an ideal state or rapidly recover after undergoing a challenge. Reliability is the target that developers aim for, a system that has perfect operation or no downtime. When testing for resilience, reliability is the planned outcome. Resiliency is also known as recoverability.

This was last updated in August 2024

Continue Reading About What is software resilience testing?

Ways to support business resilience at your organization

How to manage storage resiliency

Proven patterns for resilient software architecture design

Microservices resiliency patterns for better reliability

Functional vs. nonfunctional requirements in software engineering

Dig Deeper on Software testing tools and techniques

Cloud Computing

The future of cloud computing: Top trends and predictions
Expect GenAI, IoT, cloud-native, edge computing, power-hungry data centers, data-hungry LLMs, FinOps and privacy laws to shape ...
Microsoft, Google feud heats up with 'astroturfing' accusation
After Google filed a complaint with the EU about Microsoft's cloud practices, Microsoft accused Google of funding shadow ...
10 key characteristics of cloud computing
Evaluate how these 10 characteristics of cloud computing, such as on-demand self-service and broad network access, can help you ...

App Architecture

8 microservices best practices to remember
From integrating domain-driven design to securing APIs, explore a range of microservices best practices for building a resilient ...
When to use Rust vs. Python
This Rust vs. Python faceoff breaks down how the two programming languages stack up against each other in terms of performance, ...
13 application performance metrics and how to measure them
You've deployed your application, now what? Keep your application performing well by tracking metrics. Take a look at these 13 ...

ITOperations

IBM Apptio preps FinOps link with Terraform via GitHub
IBM drops another hint about its plans for HashiCorp with a shift-left FinOps integration in private beta, while tie-ins with Red...
Could SBOMs save lives? SecOps in critical infrastructure
'We live in glass houses,' said a seasoned cybersecurity expert of the U.S. water supply, healthcare and other lifeline services....
Relearning past lessons in assessing cloud risk
Those who do not learn from history are doomed to repeat it -- even when that history is only about a decade or two old, ...

TheServerSide.com

OKRs vs. KPIs: Driving bold outcomes and measuring steady performance
KPIs track outputs, and OKRs focus on outcomes or changes in user behavior. Here's how business can use them together to measure ...
How to solve the pinning problem in Java virtual threads
Virtual threads in Java offer many benefits, but they're not a drop-in replacement for traditional threads. Understand and plan ...
The interface segregation principle: A fun and simple guide
Want a fun way to learn the interface segregation principle of SOLID object-oriented design? Imagine a remote control that tries ...

SearchAWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Close

翻译：