Saturday, May 14, 2016

AWS High Availability Gateway – Part 2 – Enhanced HA Model



Basic HA model limitation
In part 1, the basic HA model uses ASG to recover Gateway instance when one fails. However, gateway service is unavailable in the affected zone during instance recovery. The length of downtime depends on how long it takes to build gateway from AMI, install and configure all software and services.  In an enhanced HA model, availability gap during recovery is closed by dynamically updating route table to use Gateway in another zone.

Enhanced HA Model
When a Gateway instance fails, ASG terminates old instance and builds a new one. The basic concept for enhanced HA is to detect a recovery event in Zone A, change route table to use Gateway in Zone B during recovery, and switch back to use Gateway in own zone once recovery completes.
Design mainly consists of two lambda functions:
  1. GatewayHA_Failover_duringRecovery
    1. triggered by CloudWatch event (Gateway instance terminated, indicating ASG initiating recovery)
    2. identify vpc, current and alternate zone, associated route tables and Gateway ENIs
    3. update current zone route table to use Gateway in alternate zone
  2.  GatewayHA_Restore_afterRecovery
    1. triggered by API call (Gateway ENI reattached to instance, indicating recovery completed)
    2. identify vpc, current and alternate zone, associated route tables and Gateway ENIs
    3. update current zone route table to use Gateway in own zone

Note the selection of triggering events. Lambda failover is triggered by instance termination, which indicates ASG starting the process of rebuilding gateway. It is necessary to restore route table to use gateway in own zone, in order to load balance outbound traffic across gateways. The Lambda restore function updates route table as soon as recovery completes (as indicated by the attachment of ENI).

Test Results
When applying HA model to build Squid transparent mode Gateway for internet access, using a test instance to perform continuous HTTP access tests, enhanced HA method shows dramatically decreased down time (from 15 minutes to almost unnoticeable). As observed in route table, when gateway in zone A fails, default route in Zone A’s private route table is updated to use Gateway in Zone B. As soon as gateway is rebuilt in Zone A, its private route table is updated to use Zone A’s gateway again. In this model, load balancing and dynamic failover are achieved with event driven intelligent response and full automation. 

Sample code for both basic and enhanced HA model can be found on Github:

No comments:

Post a Comment