AWS RE:Invent 2021 Highlights – Part 1
JPMorgan & Chase Risk Platform Modernization by Ken Taylor, PTS.
Introduction
AWS RE:Invent 2021 was an extremely well-run virtual technology trade show event put on by Amazon Web Services during Q4. PTS technologists were fortunate enough to be able to attend many of the myriad sessions and review even more of the session materials. Over a short series of posts, we would like to share some of our observations and pointers to areas of interest.
Although this attendee would be inclined to start by opining about new AWS features and products, I found that the sessions relating to real-world production implementations by enterprise customers were fascinating and edifying. Let’s start with a few examples from that list:
How JPMorgan and Chase (JPMC) modernized its core risk management platform
PTS does a great deal of work with large banking clients, so this session was of great interest. The session dealt with JPMC’s Athena risk management platform and the AWS architecture that was developed to deploy it. Some of the key features include:
- On-premise risk scheduler connected to AWS via Direct Connect and transit gateway
- A non-routable risk control plane in AWS that runs EKS (Elastic Kubernetes Service) with auto-scaling, RDS and ElastiCache for Redis
- A routable subnet in the AWS VPC that connects to the transit gateway and the risk control plane
- Other components such as S3 buckets and Cloudwatch
- There is also a read-only NoSQL DB running on a separate EC2 instance that gets its data feed from a NoSQL database that is running on-prem. This serves as a core account region rather than the separate LOB setups.
What they are doing is creating a compute grid for scheduling risk analysis tasks. This is a very sophisticated architecture. They have made major architectural decisions and changes around performance requirements. One of the most interesting is the grid scheduling algorithm that is based on affinity. Their analysis showed that instead of using a “push” model to schedule risk workloads onto workers, if they used a “pull” model instead, they achieved much better performance results. For example, if a worker was dedicated to risk analysis for a certain company, most of the information that a newly scheduled task needs is already resident, since it is for the same company. So, tasks for that company are pulled to that worker(s) based on affinity.
I remember needing to do this same sort of node affinity scheduling for compute grids that pre-processed data for a supercomputer.
This is exceptional work. One of the lessons to be learned from JPMC’s experience is that even architecting your application to be cloud-native and using the correct plethora of cloud vendor products and features is not going to make your application “just work.” You may still need to do some very sophisticated development of your own infrastructure algorithms to hit your performance and cost targets.
MemVerge | Enabling running stateful applications on Spot Instances
I cannot let this one go, since it is both related to the JPMC session and I am a huge fan of MemVerge technology and vision. The top “to do” item on JPMC’s future list was to start using AWS spot instances. Spot instances allow you to run workload on preemptible instances at a greatly reduced cost. The problem with using spot instances is that they can be taken from you at any time if AWS needs them. They are not recommended for any application that is stateful or would create a situation where significant amounts of processing time would be lost.
Charles Fan, CEO of MemVerge did a partner session on a new feature of MemVerge that can help with this problem. MemVerge provides an enterprise-class memory virtualization platform that allows users to combine DRAM and persistent memory, memory tiers, very large memory pools and more.
The new spot instance feature of MemVerge allows customers to use AWS spot instances, but instead of losing everything if AWS take your instance back, MemVerge creates periodic capsules of the entire instance state that it will resume for you on another new spot instance (or elsewhere).
Think of it this way, if you are 20 hours into a fluid dynamics run and you lose your spot instance, instead of losing 20 hours of effort, you may only lose 2 minutes and you are back running. That is the promise anyway. At PTS, we have not tried this out in our labs yet, but it seems promising for a number of use cases.
Author: Ken Taylor
Role: Principal Consultant
Location: Boston