How to achieve 99.999% platform uptime

Need to build a highly available, ultra resilient financial platform? In a world of shifting customer loyalties and ever-new attack vectors, doing so is crucial to financial institutions’ long-term success. 

At the LEAP 2026 conference, Arunkumar Ganesan, a distinguished engineer at CapitalOne, shared his experience of building financial platforms for 99.999% (“five nine”) uptime. 

Prior to 2019, CapitalOne had multiple lines of business creating duplicate solutions and capabilities, fragmented tools, and different standardizations for resiliencies and availability, resulting in lots of incidents and availability concerns. Its solution was to build foundational platforms with common capabilities to encourage reuse, provide standardization, and support innovation at scale. 

You can watch Arunkumar’s full explanation of how CapitalOne achieved it or read on for the top takeaways. 

Focus on resilience and reliability 

Building standardized platforms for 14,000+ users requires those platforms to be highly resilient, available, and user-friendly. That means focusing on both architectural patterns and non-architectural concerns. 

Architecture 

Architecture is the backbone of achieving high availability. With the right architecture, you can eliminate single points of failure in your deployment:

  • In the cloud, you can do so using two regions with multi-availability zones for auto-replication and auto-scaling. 
  • To support microservice availability, you can build microservices into a particular business domain or business capability, avoiding tightly coupled service dependencies. 
  • For platforms that require both batch and real-time users, separate them out with a cluster specifically for batch processing and another for real-time users, with different resiliency patterns (to avoid batch-related spiky loads impacting users). 
  • Bake redundancy in by separating databases from real-time user traffic for high reliability. 

Mandatory resiliency standards

If high availability and resiliency are your targets, you’ll need to define some goals. CapitalOne did this by grouping goals into three areas:

  1. Deployment and scalability: 
  • Geographic distribution: Must deploy across at least two distinct geographical regions. 
  • Regional capacity: Each region must independently support 100% of the production workload. 
  1. Architecture and dependencies: 
  • Regional autonomy: Cross-region dependencies are strictly prohibited; maintain regional affinity. 
  • Critical dependency compliance: All critical system dependencies must meet these resiliency standards.
  • Data consistency: Data synchronization must be automated and consistently replicated across all regions. 
  1. Failure management and recovery: 
  • Service continuity: A single-region failure must not cause a service outage. 
  • Automated failover: Failover and failback processes must be fully automated. 
  • Time-based objectives: Strict recovery time objective (RTO) and recovery point objective (RPO) goals. For example, when a service is degraded, it has to be back up within 15 minutes, and when there’s a data problem you should be able to go back at least 15 minutes to the data. 

All CapitalOne platforms must follow these standards. Writing them down clearly for your own organization means that all teams can see at a glance what they must comply with. 

CapitalOne also uses domain-driven design (DDD), which delivers multiple benefits for platform architecture, including modularity, bounded context services, and scalability. This empowers CapitalOne to set different service level agreements for different capabilities, as well as providing its customers with high flexibility in the way they integrate with specific subdomain services.

Achieving 99.999% availability 

Achieving 99.999% availability isn’t simple. It requires a careful balance between understanding your platform’s capabilities and engineering marvels. You need to focus on where you need to apply effort in terms of each capability and what your users need, to know where you need to achieve five nine availability. 

At CapitalOne, for instance, five nine availability is more critical for transaction processing than for reporting. It’s the same for every organization: Not every single service is critical for your domain. 

Implementing the standards above (the two-region deployment model with multiple availability zones, and so on), gives you your minimum requirement when it comes to deployments. The diagram below outlines CapitalOne’s minimum requirement model for bounded context service with top tier resilience: 

On top of this, you’ll need to account for poison pill requests. These requests that kill your services when you process them, such as a customer query that over-fetches and hogs database resources. Poison pill requests can be entirely legitimate but still cause high CPU, high memory, retraced ROMS, and the like.  

To avoid such requests bringing your services down one by one, you can build customer traffic-based circuit breakers. These can identify poison pill requests and stop the system accepting any further requests to reduce the blast radius to that particular customer. 

Rate limiters and sharding patterns, such as a shuffle sharding pattern, can also help reduce blast radius by containing each customer request in a particular set of resources.

Non-architectural considerations

So many factors can cause outages. Examples include cloud providers, external vendors, internal dependent platforms, platform bugs, and untrusted code, as well as customers’ poison pill requests. It’s essential to constantly review these failure modes and convert them into “what if” scenarios. You can turn these into an SDK that converts the scenarios into an automated resiliency testing framework. 

A sandbox approach for running untrusted code is also highly recommended, as is an infrastructure as code approach with shift left testing capabilities and enforcing policy as code. It means you catch problems in the CI/CD pipeline, instead of waiting for production to fail. 

Release techniques are also important. With the sharding mentioned above, you can put a new release into a particular customer segment and monitor, only rolling it out more widely once you’re confident in it. This reduces failures caused by program, product, code, and configuration changes. 

Deployment hooks are helpful too, with synthetic data helping to identify any configuration drift between production and non-production. Readiness checks, automated failovers, and resilience testing as code also help. 

Another key factor is observability standardization, including:

  • Standardized logging across all services
  • Metrics (latency, error and success rates, saturation)
  • Thresholds with dynamic alerting
  • Standardized error codes
  • Distributed tracing from customer to platform
  • KPIs

CapitalOne uses sidecar and bulkhead patterns for observability, with sidecars for logging, observability, retries, and so on. 

The result of all of this is 99.999% uptime, supporting customer trust, a competitive edge, compliance adherence, and a sound company reputation. 

Speak to the Tyk team to find out more. 

 

Share the Post:

Related Posts

Start for free

Get a demo

Ready to get started?

You can have your first API up and running in as little as 15 minutes. Just sign up for a Tyk Cloud account, select your free trial option and follow the guided setup.