The Future of Resilient Architecture

by Justin Cook

Many of my customers commit to a DevOps mindset and embrace the idea of IaC and CI/CD yet this occurs too rapidly, therefore they miss the idea of long-term resiliency. Leveraging a fully managed continuous delivery service for fast and reliable application and infrastructure updates like AWS CodePipeline allows clients to model and automate software release processes. Automating your build, test, and release process allows you to quickly test each code change, but when those pipelines go down, larger ecosystems like resiliency practices will ensure the platform stays live.

Lately, many customers want to know about FIS. AWS Resilience Hub generates recommended AWS FIS Experiments that you can deploy and use to test the resilience of your application. As well as assessing the resilience, we also recommend you integrate running these tests into your pipeline.

AWS Fault Injection Simulator (FIS) is a managed service that enables you to perform fault injection experiments on your AWS workloads.

Remember that the fault injection idea or ecosystem at AWS is based on the principles of chaos engineering. These experiments stress an application by creating disruptive events so that you can observe how your application responds. You can then use this information to improve the performance and resiliency of your applications. With AWS FIS, you set up and run experiments that help you create the real-world conditions needed to uncover application issues.

This is one of my favorite graphics for Chaos Engineering from AWS white paper:

A key process is to GUARANTEE the quality of your application or infrastructure code by running each change through your staging and release process.

Automatically running a resiliency assessment within CI/CD pipelines, development teams can fail fast and understand quickly if a change negatively impacts the resilience of an application.

One of the most used and my favorite features is how the pipeline can stop the deployment into further environments that contain resilience issues. This can stop bad code execution in its tracks, allowing for users.

AWS Resilience Hub is a managed service that gives you a central place to define, validate, and track the resiliency of your AWS applications. It is integrated with AWS Fault Injection Simulator (FIS), a chaos engineering service, to provide fault-injection simulations of real-world failures. Using AWS Resilience Hub, you can assess your applications to uncover potential resilience enhancements. This will allow you to validate your applicationโ€™s recovery time (RTO), and recovery point (RPO) objectives and optimize business continuity while reducing recovery costs. Resilience Hub also provides APIs for you to integrate its assessment and testing into your CI/CD pipelines for ongoing resilience validation.

AWS CodePipeline is a fully managed continuous delivery service for fast and reliable application and infrastructure updates. You can use AWS CodePipeline to model and automate your software release processes. This enables you to increase the speed and quality of your software updates by running all new changes through a consistent set of quality checks.

KEEP QUOTAS IN MIND: I hit a wall with this. AWS Resilience Hub allows you to run 20 assessments per month per application. If you need to increase this quota, please raise a ticket with support, and they can help.

Continuous resilience assessments

Figure 1 shows the resilience assessments automation architecture in a multi-account setup. AWS CodePipeline, AWS Step Functions, and AWS Resilience Hub are defined in your deployment account while the application AWS CloudFormation stacks are imported from your workload account. This pattern relies on AWS Resilience Hub ability to import CloudFormation stacks from a different accounts, regions, or both, when discovering an application structure.

Add application to AWS Resilience Hub

Begin by adding your application to AWS Resilience Hub and assigning a resilience policy. This can be done via the AWS Management Console or using CloudFormation. In this instance, the application has been created through the AWS Management Console. Sebastien Stormacqโ€™s post, Measure and Improve Your Application Resilience with AWS Resilience Hub, walks you through how to add your application to AWS Resilience Hub.

In a multi-account environment, customers typically have dedicated AWS workload account per environment and we recommend you separate CI/CD capabilities into another account. In this post, the AWS Resilience Hub application has been created in the deployment account and the resources have been discovered using an CloudFormation stack from the workload account. Proper permissions are required to use AWS Resilience Hub to manage application in multiple accounts.

Create AWS Step Function to run resilience assessment

Here is my current dashboard:

Many users use Step Functions to construct their Resiliency Architecture in AWS. These are workflow services that developers use to build distributed applications, automate IT and business processes, and build data and machine learning pipelines using AWS services. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.


1.    The first step in the workflow is to update the resources associated with the application defined in AWS Resilience Hub by calling ImportResourcesToDraftApplication.

2.    Check for the import process to complete using a wait state, a call to DescribeDraftAppVersionResourcesImportStatusand then a choice state to decide whether to progress or continue waiting.

3.    Once complete, publish the draft application by calling PublishAppVersion to ensure we are assessing the latest version.

4.    Once published, call StartAppAssessment to kick off a resilience assessment.

5.    Check for the assessment to complete using a wait state, a call to DescribeAppAssessment , and then a choice state to decide whether to progress or continue waiting.

6.    In the choice state, use assessment status from the response to determine if the assessment is pending, in progress, or successful.

7.    If successful, use the compliance status from the response to determine whether to progress to success or fail. 

o   Compliance status will be either โ€œPolicyMetโ€ or โ€œPolicyBreachedโ€.

8.    If policy is breached, publish onto SNS to alert the development team before moving to fail.

Create stage within code pipeline

When adding the stage, it is critical that you pass the ARN of the stack which was deployed in the previous stage as well as the ARN of the application in AWS Resilience Hub. These will be required on the AWS SDK calls and you can pass this in as a literal.

Customers often run their workloads in lower environments in a less resilient way to save on cost. Itโ€™s important to add the assessment stage at the appropriate point of your pipeline. We recommend adding this to your pipeline after the deployment to a test environment which mirrors production but before deploying to production. By doing this you can fail fast and halt changes which will lower resilience in production.

The key is not not shift your architecture too quickly to weaken resiliency but to use the above tools to construct a methodology that will allow users to feel confident in their ecosystems.

Thanks and Follow Us For More Information!