Devops

Software Resilience

Software resilience testing is a method of software testing that focuses on ensuring that applications will perform well in real-life or chaotic conditions. In other words, it tests an application’s resiliency, or ability to withstand stressful or challenging factors.

Resilience testing is one part of non-functional software testing that also includes compliance, endurance, load, and recovery testing.

Since failures can never be avoided, resilience testing ensures that software can continue performing core functions and avoid data loss even when under stress.

In today’s world, system downtime is not an option. If a user can’t access an application once, chances are that they will never use it again. Resiliency, which in simple terms is the ability of a system to gracefully handle and recover from failures, thus becomes critical. 

Testing resiliency ensures the system’s ability to absorb the impact of a problem while continuing to provide an acceptable level of service to the business. 

This concept was originally introduced by Netflix in the Principles of Chaos Engineering.

To build your test strategies for resilient systems, you should:

1)Conduct a failure mode analysis by reviewing the design of the system. In simple terms, this means identifying all the components, internal and external interfaces, and identifying potential failures at every point. Once failure points are identified, validate that there are alternatives to failure. 

2)Validate data resiliency, i.e. that there is a mechanism for data to be available to applications if the system that originally hosted the data fails. Verify that the data backup process is either documented or automated.

 If automated, validate that the automated script backs up data correctly, maintaining integrity and schema.

3)From an infrastructure standpoint, configure and test health probes for load balancing and traffic management. These ensure that the system is not limited to a single region for deployment in case of latency issues.

4)From an application standpoint, conduct fault injection tests for every application in your system. Scenarios include shutting down interfacing systems, deleting certificates, consuming system resources, and deleting data sources.

5)Conduct critical tests in production with well-planned canary deployments. 

Validate that there is an automated rollback mechanism for code in production in case of failure.

Thanks for reading this article. Please share your comments, questions and feedback. Would be more than happy to help.

vasu34k

Share
Published by
vasu34k

Recent Posts

Generative AI

Generative AI is a type of AI (such as ChatGPT) that can generate new forms…

5 months ago

Pair Programming

Pair programming is a software development technique in which two programmers work together at one…

5 months ago

AWS CodeWhisperer

Amazon recently released Amazon CodeWhisperer to the public. It is an AWS real-time AI code generator…

6 months ago

Multi-hop architecture Azure

Multi-hop architecture is a design approach for organizing data in the Delta warehouse. Multi-hop architectures…

9 months ago

MuleSoft Accelerators

MuleSoft Accelerators are predefined Mule applications, API specifications, and documentation that help to speed up the implementation life…

10 months ago

Introduction to OpenAPI

OpenAPI Specification also known as Swagger Specification is an API description format for REST APIs.…

1 year ago