Saturday, May 14, 2016

AWS High Availability Gateway – Part 2 – Enhanced HA Model



Basic HA model limitation
In part 1, the basic HA model uses ASG to recover Gateway instance when one fails. However, gateway service is unavailable in the affected zone during instance recovery. The length of downtime depends on how long it takes to build gateway from AMI, install and configure all software and services.  In an enhanced HA model, availability gap during recovery is closed by dynamically updating route table to use Gateway in another zone.

Enhanced HA Model
When a Gateway instance fails, ASG terminates old instance and builds a new one. The basic concept for enhanced HA is to detect a recovery event in Zone A, change route table to use Gateway in Zone B during recovery, and switch back to use Gateway in own zone once recovery completes.
Design mainly consists of two lambda functions:
  1. GatewayHA_Failover_duringRecovery
    1. triggered by CloudWatch event (Gateway instance terminated, indicating ASG initiating recovery)
    2. identify vpc, current and alternate zone, associated route tables and Gateway ENIs
    3. update current zone route table to use Gateway in alternate zone
  2.  GatewayHA_Restore_afterRecovery
    1. triggered by API call (Gateway ENI reattached to instance, indicating recovery completed)
    2. identify vpc, current and alternate zone, associated route tables and Gateway ENIs
    3. update current zone route table to use Gateway in own zone

Note the selection of triggering events. Lambda failover is triggered by instance termination, which indicates ASG starting the process of rebuilding gateway. It is necessary to restore route table to use gateway in own zone, in order to load balance outbound traffic across gateways. The Lambda restore function updates route table as soon as recovery completes (as indicated by the attachment of ENI).

Test Results
When applying HA model to build Squid transparent mode Gateway for internet access, using a test instance to perform continuous HTTP access tests, enhanced HA method shows dramatically decreased down time (from 15 minutes to almost unnoticeable). As observed in route table, when gateway in zone A fails, default route in Zone A’s private route table is updated to use Gateway in Zone B. As soon as gateway is rebuilt in Zone A, its private route table is updated to use Zone A’s gateway again. In this model, load balancing and dynamic failover are achieved with event driven intelligent response and full automation. 

Sample code for both basic and enhanced HA model can be found on Github:

Friday, May 6, 2016

AWS High Availability Gateway – Part 1 – Basic HA Model



Why this HA model
Gateway is used in AWS VPC to control egress traffic. In addition to NAT gateway, more feature rich gateway such as Gateway Transparent mode provides more sophisticated controls.

What is an optimal HA model for critical infrastructure components such as gateway? The AWS reference HA model for NAT and Gateway is rather dated. It uses script running on instances to ping each other for health, it has these potential shortcomings:
  • Depends on a continuous running shell script to monitor availability and perform failover.  If the process were to be terminated then no failover would occur.
  • Ping only provides limited indication of health
  • "split brain" scenario: when connectivity between the NAT instances fails (possibly due to Security Group) but each of them are still capable of connecting to the Internet, it is possible that each NAT instance will shut the other one down
  • Does not account for scenarios when an instance is terminated, the instance will not be recreated
Since ELB is not currently supported as a target for route tables, gateway instance must be defined as route target. When a gateway fails, the route target becomes a black hole. The first step towards HA is running gateway per AZ. The second step is detection of gateway failure in AZ and recover from it. The third step is provide dynamic failover during gateway recovery to minimize downtime. The HA model proposed has two parts:
1.       Basic HA model, with auto recovery of gateway
2.       Enhanced HA model, with dynamic route table failover during gateway recovery

The design and implementation of basic HA model is covered here. See part 2 for enhanced HA.

Design Overview

Health Monitoring and Auto Scaling
In cloud architecture, all instances should be behind an auto scaling group for resiliency. Here we leverage ASG to monitor gateway instance health. Auto Scaling health checks use the results of the EC2 status checks to determine the health status of an instance. Auto Scaling marks an instance as unhealthy if its instance status is any value other than running or its system status is impaired.
Therefore gateway is monitored based on AWS health monitoring for auto scaling instances. Customization is also supported.

Route Target and ENI
In this non-proxy design, internet access via default route, which is defined in a private route table per AZ. In a HA scenario, instances may get replaced, so the routing table entry will either 1) remain "persistent" outside the instance, or 2) updated to point to the new instance.
For the first option, what could be a persistent target for default route to point to? ELB would be an option, but it is not supported as a routing target. ENI is a network interface that can be detached and attached to instances so it can serve as the persistent target. Although there are some feature limitations and workarounds required, it is still proven to work.

Instance Recovery and Bootstrapping
Another feature that comes with ASG is automated recovery of instances. However, there are some limitations to ASG, for example, it cannot set instance attributes and it cannot attach ENI. Those are implemented via instance bootstrapping.

Implementation
The diagram shows an architectural view of the new HA model. Gateway is placed in a single instance ASG, with two interfaces. An ENI is attached to gateway instance, which provides persistence in route table even when a gateway instance fails (the ENI is reattached to a recovered instance).

For sample code, please refer to github repo:
There is a limitation with this HA model, when a gateway instance fails, recovery time may take up to 10-15 minutes (to build a new gateway, install and configure the appliance). During the time gateway is being rebuilt for that zone, traffic is black holed in the route table before ENI can be attached to a new gateway instance. See part 2 for enhanced HA.