Friday, May 6, 2016

AWS High Availability Gateway – Part 1 – Basic HA Model



Why this HA model
Gateway is used in AWS VPC to control egress traffic. In addition to NAT gateway, more feature rich gateway such as Gateway Transparent mode provides more sophisticated controls.

What is an optimal HA model for critical infrastructure components such as gateway? The AWS reference HA model for NAT and Gateway is rather dated. It uses script running on instances to ping each other for health, it has these potential shortcomings:
  • Depends on a continuous running shell script to monitor availability and perform failover.  If the process were to be terminated then no failover would occur.
  • Ping only provides limited indication of health
  • "split brain" scenario: when connectivity between the NAT instances fails (possibly due to Security Group) but each of them are still capable of connecting to the Internet, it is possible that each NAT instance will shut the other one down
  • Does not account for scenarios when an instance is terminated, the instance will not be recreated
Since ELB is not currently supported as a target for route tables, gateway instance must be defined as route target. When a gateway fails, the route target becomes a black hole. The first step towards HA is running gateway per AZ. The second step is detection of gateway failure in AZ and recover from it. The third step is provide dynamic failover during gateway recovery to minimize downtime. The HA model proposed has two parts:
1.       Basic HA model, with auto recovery of gateway
2.       Enhanced HA model, with dynamic route table failover during gateway recovery

The design and implementation of basic HA model is covered here. See part 2 for enhanced HA.

Design Overview

Health Monitoring and Auto Scaling
In cloud architecture, all instances should be behind an auto scaling group for resiliency. Here we leverage ASG to monitor gateway instance health. Auto Scaling health checks use the results of the EC2 status checks to determine the health status of an instance. Auto Scaling marks an instance as unhealthy if its instance status is any value other than running or its system status is impaired.
Therefore gateway is monitored based on AWS health monitoring for auto scaling instances. Customization is also supported.

Route Target and ENI
In this non-proxy design, internet access via default route, which is defined in a private route table per AZ. In a HA scenario, instances may get replaced, so the routing table entry will either 1) remain "persistent" outside the instance, or 2) updated to point to the new instance.
For the first option, what could be a persistent target for default route to point to? ELB would be an option, but it is not supported as a routing target. ENI is a network interface that can be detached and attached to instances so it can serve as the persistent target. Although there are some feature limitations and workarounds required, it is still proven to work.

Instance Recovery and Bootstrapping
Another feature that comes with ASG is automated recovery of instances. However, there are some limitations to ASG, for example, it cannot set instance attributes and it cannot attach ENI. Those are implemented via instance bootstrapping.

Implementation
The diagram shows an architectural view of the new HA model. Gateway is placed in a single instance ASG, with two interfaces. An ENI is attached to gateway instance, which provides persistence in route table even when a gateway instance fails (the ENI is reattached to a recovered instance).

For sample code, please refer to github repo:
There is a limitation with this HA model, when a gateway instance fails, recovery time may take up to 10-15 minutes (to build a new gateway, install and configure the appliance). During the time gateway is being rebuilt for that zone, traffic is black holed in the route table before ENI can be attached to a new gateway instance. See part 2 for enhanced HA.

No comments:

Post a Comment