AWS automated disaster recovery for insurance: PwC

Understanding the challenge

PwC worked with a leading insurance provider and faced a significant challenge when a costly outage impacted their AWS US-East-1 services. This outage resulted in a disruption to their call center operations, leaving 4,500 agents unable to provide critical services for over 10 hours. To address this issue and improve their resiliency posture, the client worked with PwC to leverage AWS capabilities and implement a multi-region cloud resilience architecture for their critical single-region AWS applications.

The reason for a multi-region cloud resilience architecture on AWS was to give organizations the flexibility and scalability that spans multiple geographic regions, allowing for the distribution of critical applications and data across these regions. This approach confirms that if one region experiences an outage or disruption, the workload can seamlessly failover to another region, reducing downtime and maintaining business continuity.

The benefit of implementing a multi-region resilience architecture is the increased level of resiliency and availability it provides. By distributing applications and data across multiple regions, organizations can mitigate the impact of localized outages or disasters. If one region becomes unavailable, the workload can automatically failover to another region, allowing operations to continue uninterrupted. This approach significantly reduces the risk of prolonged downtime and confirms that critical services can be delivered to customers without disruption.

By leveraging AWS capabilities and working with PwC, the client was able to design and implement a system that could seamlessly failover to another region in the event of a disruption, enabling uninterrupted service delivery to their customers. This approach enhanced their resiliency posture and reduced the impact of future outages, ultimately improving their overall business continuity.

The outage experienced by our client highlighted the need for a strong disaster recovery solution to provide uninterrupted business operations. The company recognized the importance of leveraging AWS services to plan, design, and implement a resilient architecture that could withstand similar incidents in the future. Their primary challenge was to establish a multi-region cloud resilience architecture for critical single-region AWS applications, enabling them to continue operations seamlessly in the event of a disaster.

Building the PwC and AWS solution

PwC worked across teams to plan, design, and implement a multi-region cloud resilience architecture following our cloud resiliency journey framework. The solution involved several key technical elements to ensure the successful implementation of automated disaster recovery (DR).

Cloud resiliency journey framework

Achieving a sustainable AWS resilience organization in an ongoing and long-term journey. The first step is evaluating your resilience capabilities, establishing a baseline and defining a target state. Identifying what processes and services are critical and what it takes to deliver them can help define resilience needs. Strategic and tactical plans should be developed to achieve your target state, taking into account the evolving risk landscape applicable to your organization.

Identify resilient capabilities in the cloud and the path to mature cloud resiliency.

PwC conducted discovery workshops and assessments to identify mission-critical applications and group workloads into application tiers based on their criticality and recovery point objective/recovery time objective (RPO/RTO) targets.

The workloads were grouped into two categories:

The foundational components that need to be always available in the secondary region in an active-active configuration such as network services (e.g., AWS Transit Gateway), identity and access management (e.g., SSO set up for human identities) and DevOps services (e.g., Amazon CodeCommit repositories, Amazon ECR repositories, GitHub repositories)
The mission critical workloads that need to meet an RTO of less than 4 hours and could utilize a warm standby or pilot light DR approach

Develop cloud design to enable DR capabilities and establish a repeatable foundation of resilient cloud architecture patterns.

PwC approached the design across multiple DR strategies such as:

Active-active design for foundational components
Warm standby and pilot light designs for mission critical workloads
As part of target state designs, PwC also documented the associated cost for each DR strategy. Both the DR strategies were outlined for design and cost to allow the business to pick the DR strategy that is most suitable for the RTO and cost

Build and deploy core baseline infrastructure for the DR Environment.

PwC parameterized the existing CI/CD deployment pipelines in Jenkins to support multi-region deployments for the AWS services utilized in the foundational components. (e.g., AWS Transit Gateway), identity and access management (e.g., SSO set up for human identities) and DevOps services (e.g., Amazon CodeCommit repositories, Amazon ECR repositories).

Incorporate end to end resilient designs into business-critical applications.

For mission critical applications supporting claims app suite, PwC edited the existing CI/CD deployment pipelines to support multi-region warm standby deployments for the AWS services.

e.g., Edited the existing infrastructure as code scripts to deploy Amazon ECS clusters in the secondary region with scaled down instances
e.g., Edited the existing infrastructure as code scripts to deploy Amazon RDS cluster as a global database across two regions

Continuously validate and test your application’s resilience to know you are prepared for planned and unplanned downtime.

PwC leveraged automation to orchestrate disaster recovery (DR) activities including the failover of critical services to a healthy AWS region in the event of a major incident and the failback of AWS services to their original state. This enables the organization to quickly and efficiently recover their critical workloads in the event of an incident or for regular DR testing with minimum developer impact.

The automation was developed using AWS Step Functions, Lambda Functions and DynamoDB. It was automated using infrastructure as code and deployed in the client’s AWS accounts as a control plane that integrates with multiple mission critical applications. Operations modules were created to conduct failover/failback for each AWS service (AWS ECS, Amazon Aurora, Route 53) enabled for multi-region in the application resilience enablement phase.

Leveraging the client's existing CI/CD pipelines in Jenkins, GitOps Modules and AWS API calls, this automation enabled the scale up/scale down of the infrastructure deployed in the secondary region.

The disaster recovery orchestration automation greatly reduced the need for manual, error prone and complex failover work, making automated failover and failover testing within reach for even the busiest teams.

What’s next? Guiding principles for designing cloud resiliency

Identify problems you’re trying to solve with high availability and resiliency.
Decide what requirements are needed in the cloud to protect data and systems from disaster and system crashes.
Challenge traditional mindsets and paradigms and identify opportunities to orchestrate and automate resiliency as part of the cloud delivery model.

Automated disaster recovery becomes reality

Challenge

Performing the failover and failback of workloads manually necessitates a dedicated team and thorough preparation to guarantee the proper and successful failover/failback of resources. This procedure is time-consuming. which in turn prolongs the overall recovery time objective (RTO), and it also presents various potential failure points due to its reliance on human intervention.

What are the benefits of automated disaster recovery?

Data driven outputs such as execution times to measure RPO/RTO

Single-click failover/failback modules executed using AVS Step Functions and Lambda

Reduced manual overhead by automating failover/failback steps

Automated health checks to validate data replication across AZ/regions

How does automated disaster recovery help?

Automated orchestration of disaster recovery (DR) activities, including the failover of critical services to a healthy AWS region and the failback of AWS services to their original state, enables organizations to quickly and efficiently recover their critical workloads in the event of an incident or for regular DR testing with minimum developer impact.

Delivering outcomes

The implementation of PwC and AWS's automated solution resulted in significant outcomes for the client enhancing their resiliency posture and providing uninterrupted business operations.

1. Automated disaster recovery

We were able to perform an automated disaster recovery operation for critical applications across regions. This allowed them to continue business operations in a secondary region within their target recovery time objective (RTO) and recovery point objective (RPO) of 0 - 4 hours.

2. Resilient architecture designs

The implementation of different disaster recovery approaches enabled the client to generate resilient architecture designs tailored to their specific application requirements and RTO/RPO needs. This flexibility enabled viable resiliency across their digital landscape.

3. Multi-region failover and failback

By implementing the disaster recovery orchestration automation, the client successfully performed an automated multi-region failover and failback of critical applications. This capability laid the foundation for replicating the solution across their organization.

4. Strategic DR initiatives

PwC's design and implementation efforts identified strategic disaster recovery initiatives for the client. These initiatives provided recommendations for adopting and implementing resilient designs across the organization's Tier 0 and Tier 1 services, further strengthening their resiliency posture.

In conclusion

By working with PwC and leveraging AWS capabilities, the client successfully improved their resiliency posture and automated their disaster recovery processes. The implementation of a multi-region cloud resilience architecture, along with the disaster recovery orchestration automation, provided uninterrupted business operations during a disaster scenario. Our client can now perform multi-region, automated failover and failback of critical applications within their desired RTO/RPO requirements. This case study demonstrates the value of resilient architecture and automated disaster recovery solutions in enhancing business continuity and mitigating the impact of outages.

Explore how PwC and AWS are accelerating breakthrough outcomes.

PwC and AWS enhanced an insurance firm’s resiliency posture through automated disaster recovery

Understanding the challenge

Building the PwC and AWS solution

Cloud resiliency journey framework

What’s next? Guiding principles for designing cloud resiliency

Automated disaster recovery becomes reality

Challenge

What are the benefits of automated disaster recovery?

How does automated disaster recovery help?

Delivering outcomes

1. Automated disaster recovery

2. Resilient architecture designs

3. Multi-region failover and failback

4. Strategic DR initiatives

In conclusion

{{filterContent.facetedTitle}}

Contact us

PwC and AWS enhanced an insurance firm’s resiliency posture through automated disaster recovery

Understanding the challenge

Building the PwC and AWS solution

Cloud resiliency journey framework

What’s next? Guiding principles for designing cloud resiliency

Automated disaster recovery becomes reality

Challenge

What are the benefits of automated disaster recovery?

How does automated disaster recovery help?

Delivering outcomes

1. Automated disaster recovery

2. Resilient architecture designs

3. Multi-region failover and failback

4. Strategic DR initiatives

In conclusion

{{filterContent.facetedTitle}}

{{item.title}}

{{item.title}}

Contact us