Saturday, March 18, 2017

Three Networking features AWS should support

AWS is continuously enhancing and adding new features. However, a number of fundamental networking features have been discussed for a while, based on recent interactions with AWS team, still not on roadmap.

Here are three of those features high on my list, and why.

1. Multi-Path Routing (ECMP)
Currently, AWS routing table does not allow multiple routes to the same destination. For example, I can only define my default route in a private route table to a single destination (which can be a single point of failure).
If ECMP is supported, user will have a lot of load sharing and resiliency options. For example, I can define multiple default route to point to redundant load sharing gateways in multiple zones.

However, user still needs to keep those route up to date if the target instances changes. This can be done by keeping the ENI persistent and reattaching to new instances, or trigger lambda to update routes when instance refreshes

2. ELB as Route Table target
Supporting load balancer as a routing target may not seem natural as a network solution, there needs to be internal implementation that forward traffic to resolved load balancer and instances behind them.
This type of capability will allow user to fully benefit from the scalability and resiliency of load balancer, and have "native" high availability without the need for a self-maintained layer of lambda checks and actions.

An example that this can be done can be found with Azure, User Defined Route (UDR) can point to Azure Load Balancer (ALB), this enables route table to send traffic to a cluster of gateway nodes behind of load balancer, which leads to simple and elegant resiliency.

3. Native Transit VPC
In large scale enterprise use of AWS, as the number of VPCs go up, transit VPC can really help to scale by consolidating connectivity. Currently, there is a Cisco CSR based solution. But any third party appliances would require maintenance overhead, and introduce bottlenecks.

The ideal solution would be AWS enabled transit, to allow user to self define, much like peering connections.

I hope the these requirements are echoed by user communities.

Sunday, September 11, 2016

AWS VPC VGW Multipath Routing - difference between Direct Connect and VPN

VPC VGW multi-path scenario
To connect a VPC to enterprise networks or other VPCs, we use Direct Connect or VPN. It is common to have multiple connection paths from a VPC. Routing outbound from a VPC is controlled by VGW. The question is, how does VGW which is an AWS internal logical router handle multi-path routing?

Multi-path is a requirement for high availability. Load sharing on multi-path is often desirable. How VGW handles multi-path routing is actually different based on connection type. Specifically, Direct Connect supports ECMP. VPN does not (after Oct 2015).

Direct Connect
Direct Connect supports the configuration option of redundant paths with Active/Active (BGP multipath), VGW routes traffic over multiple equal cost paths. As a result, we can leverage all bandwidth resources provisioned for DX.

With VPN, VGW currently does not support BGP multipath. VPN chooses one BGP path only.

What if we use static route instead of BGP, can static be used to load share traffic across multiple paths?
In the scenario shown in the diagram, there are dual VPN connections going to two remote CGWs, each with redundant tunnels. If static routes are defined equally, does VGW route ECMP out multiple paths?
  • VGW created prior to Oct 28 2015 supports static multipath.
  • VGW created after Oct 28 2015 selects one active path out of multiple paths defined

The scenario is tested with a new VGW in one VPC, and a pair of customer VPN appliances in aonther VPC. With 4 tunnels/paths, it seems all traffic goes to one tunnel only. AWS support confirmed the behavior that VGW only selects one path only.

Why AWS should support VPN multipath
With VPN, it may be desirable to spread load across multiple customer gateways, because those customer gateways may be Cisco or Palo Alto appliances that has licensed throughput capacity. It is more optimal to spread load across multiple destinations rather than sending all traffic to one while other paths sit idle.

Hopefully AWS will bring consistent multipath routing to VPN, with BGP multipath and static ECMP.

Saturday, July 9, 2016

AWS Auto Scaling Lifecycle Hook with Lambda and CloudFormation

There are a lot of advantages to place instances in AWS Auto Scaling Groups, scaling is the obvious one. Even for a single instance appliance, Auto Scaling provides resiliency, health monitoring and auto recovery. In many cases, ASG High Availability model is superior to running active/standby appliances in terms of seamless automation and cost effectiveness.

However, Auto Scaling has limitations, not all instance actions and properties can be defined with an ASG. For example, instance launched in an ASG can have only one interface. Auto Scaling currently does not support attaching multiple interfaces. AWS Lambda, on the other hand, is great for defining custom actions executed efficiently and on demand. Putting the two together, AWS Auto Scaling lifecycle hook allows Lambda defined custom actions to be inserted during ASG instance launch or termination, which is powerful and flexible.

Reference links below for more details about Auto Scaling lifecycle hooks, as well as an excellent example and implementation steps using AWS console written by Vyom Nagrani  

To automate ASG and lifecycle hook actions, Cloudformation is used to define ASG and lifecycle hook. In the following example, a lifecycle hook is defined to send notification via SNS when instance launches. A Lambda function will be triggered via subscription to the SNS topic.
"GatewayAutoscalingGroupHook" : {
                "Type" : "AWS::AutoScaling::LifecycleHook",
                "Properties" : {
                                "AutoScalingGroupName" : { "Ref": "GatewayAutoscalingGroup" },
                                "HeartbeatTimeout" : 300,
                                "LifecycleTransition" : "autoscaling:EC2_INSTANCE_LAUNCHING",
                                "NotificationMetadata" : { "Fn::Join" : ["", [
                                                { "Ref" : "GatewayInstanceENI1" },
                                                { "Ref" : "GatewayInstanceENI2" },
                                "NotificationTargetARN" : "arn:aws:sns:us-east-1:697686697680:gateway-asg-lifecycle-hook",
                                "RoleARN" : "arn:aws:iam::697686697680:role/gateway-sns-hook-role"

There is an odd behavior with Cloudformation when it is used to define ASG lifecycle hook. According to AWS, Lifecycle hook is defined AFTER the first instance in ASG is created. As a result, the first instance launches without the expected lifecycle hook action. Only when the first instance is deleted, the next instance kicks off lifecycle action, and triggers Lambda function as expected. AWS suggests several workarounds, including launching ASG with 0 instance and increasing to 1 later, or use custom resources.

Use Lambda monitoring features to see if/when the function is triggered by Lifecycle hooks. It is helpful to log the receiving message. AWS sends out a TEST notification when lifecycle hook is initially created. The TEST notification won’t have the complete notification content but it still will trigger Lambda. Since it currently can’t be turned off, Lambda function need to have some error handling for it.

Saturday, June 18, 2016

AWS Kaggle Machine – turnkey data science “Lab in a Box”

The advancement in data science and machine learning has not only brought breakthroughs like AlphaGO, but also starting to have broad impact in our everyday lives (Airbnb uses data to predict traveler’s destinations). Gone are the days when data science is only accessible by those in the ivory tower with million dollar proprietary software, new trends have emerged:
  • open source software and tools
  • compute capacity at cloud scale, with dramatic cost reduction
  • public data set, community based problem solving (Kaggle)

AWS provides both cost efficiency and scalable capabilities. It makes sense for data scientist to tap into the power of public cloud. An AWS image is developed here which:
  • Automates the installation and configuration of a comprehensive set of open source data science tools
  • Allows instance sizing based on needs
  • Control cost (shut down or terminate when done, launch in a few minutes)

What it is
An AWS AMI which provides “data science server in a box” with current open source toolkit (RStudio, Jupyter Notebook, Anaconda, Xgboost…). Builds automatically, fully configured ready for use in less than five minutes.

How to build a Kaggle Machine
Using the community AMI named “kaggle machine”,  build your Kaggle Machine in AWS, with one of the following method. Note the AMIs are currently available in us-east-1 and us-west-2. For other regions, you can build your machine in the above two regions, and copy AMI across regions.
Build Kaggle Machine from AWS console
Launch EC2 instance, search for “Kaggle-machine” in Community AMIs, specify a key pair. After instance creation, add a Security Group which allows port 8787, 9999 and 22 for ssh.

Build Kaggle Machine using CloudFormation Stack

A CloudFormation template can be used to build Kaggle Machine and Security Group automatically. Download the template and use it to build your stack in us-east-1 or us-west-2. The template can be found at

How to use it
After instance creation, note public DNS name of the machine, from any client on the internet, access services by pointing your browser to:
Rstudio: http://:8787 (default ruser/ruser)
Jupyter: http://:9999 (default password jupyter)

Change the default password immediately. The EC2 instance runs on Ubuntu, you can ssh to it
The cost is based on AWS EC2 usage. You only pay hourly when instance running, shutdown the instance when done. When you are done with your project and no longer need data to be saved on server, terminate the instance.

The development of Kaggle Machine originates from the needs of data scientists participating in Kaggle challenges, hope it will be provide you a useful toolset as well.