Devops

Software Resilience

Software resilience testing is a method of software testing that focuses on ensuring that applications will perform well in real-life or chaotic conditions. In other words, it tests an application’s resiliency, or ability to withstand stressful or challenging factors.

Resilience testing is one part of non-functional software testing that also includes compliance, endurance, load, and recovery testing.

Since failures can never be avoided, resilience testing ensures that software can continue performing core functions and avoid data loss even when under stress.

In today’s world, system downtime is not an option. If a user can’t access an application once, chances are that they will never use it again. Resiliency, which in simple terms is the ability of a system to gracefully handle and recover from failures, thus becomes critical.

Testing resiliency ensures the system’s ability to absorb the impact of a problem while continuing to provide an acceptable level of service to the business.

This concept was originally introduced by Netflix in the Principles of Chaos Engineering.

To build your test strategies for resilient systems, you should:

1)Conduct a failure mode analysis by reviewing the design of the system. In simple terms, this means identifying all the components, internal and external interfaces, and identifying potential failures at every point. Once failure points are identified, validate that there are alternatives to failure.

2)Validate data resiliency, i.e. that there is a mechanism for data to be available to applications if the system that originally hosted the data fails. Verify that the data backup process is either documented or automated.

If automated, validate that the automated script backs up data correctly, maintaining integrity and schema.

3)From an infrastructure standpoint, configure and test health probes for load balancing and traffic management. These ensure that the system is not limited to a single region for deployment in case of latency issues.

4)From an application standpoint, conduct fault injection tests for every application in your system. Scenarios include shutting down interfacing systems, deleting certificates, consuming system resources, and deleting data sources.

5)Conduct critical tests in production with well-planned canary deployments.

Validate that there is an automated rollback mechanism for code in production in case of failure.

Thanks for reading this article. Please share your comments, questions and feedback. Would be more than happy to help.

vasu34k