enabling data science

Sunday, March 18, 2018

SageMaker model deployment - 3 simple steps

AWS SageMaker is a platform designed to support full lifecycle of data science process, from data preparation, to model training, to deployment. Having clean separation yet easy pipelining between model training and deployment is one of its greatest strength. A model can be developed using a training instances and saved as files. The deployment process can retrieve model artifacts saved in S3, and deploy a run time environment as HTTP endpoints. Finally, any application can send REST queries and get prediction results back from deployed endpoints.

While simple in concept, information regarding the practical implementation of SageMaker model deployment and prediction queries is currently lacking and scattered. It is easier to grasp in the simple 3 step process contained in a notebook.

1. create deployment model

We assume a model has been built (trained), with results saved in S3. A deployment model is defined with both model artifacts and algorithm containers.

2. configure deployment instances

Next, define the size and number of deployment instances, which will host the run time for deployment model service endpoints.

3. deploy to service endpoints

Finally, create service endpoints, wait for completion, and model deployment is finished, now ready to service prediction requests.

The complete deployment process can be visualized as follows:

The complete sample notebook can be seen here:

https://github.com/seanxwang/SageMaker_model_deployment

Sunday, February 4, 2018

Auto starting R Studio on AWS Deep Learning server

As an enhancement to machine learning servers built on AWS or Azure, it is often necessary to set up R development environment to meet the needs of data science community.

Adapt for your specific environment. Here we assume we to use AWS deep learning conda image (ubuntu). Specially we use "python3" virtual environment (source activate python3). One of the reasons to use this environment is it is already set up to run Jupyter Notebook (see auto start jupyter), we can therefore add an additional R kernel to it. Then we have a consolidated image that can be offered to both Python and R users.

The easiest method to install R is using conda:
conda install r r-essentials

RStudio is a popular development environment. Follow instructions to install RStudio, for example:
sudo apt-get install gdebi-core
wget https://download2.rstudio.org/rstudio-server-1.1.419-i386.deb
sudo gdebi rstudio-server-1.1.419-i386.deb

The above procedure also sets up auto start of R studio server by adding /etc/systemd/system/rstudio-server.service. However, because the only available procedure installs RStudio with "sudo" into the default system environment, it cannot find R which has been installed into a different environment. As a result, RStudio fails to start with error indicating RStudio cannot find R.
rstudio-server verify-installation
Unable to find an installation of R on the system (which R didn't return valid output); Unable to locate R binary by scanning standard locations

This can be easily fixed by specifying the exact path to R for RStudio, replace path with your installation of R:
sudo sh -c 'echo "rsession-which-r=/home/ubuntu/anaconda3/envs/python3/bin/R" >> /etc/rstudio/rserver.conf'

Restart instance, now RStudio Server starts successfully. Login with Linux credential at:
"http:(server IP):8787"

Wednesday, January 31, 2018

Auto starting Jupyter Notebook on AWS Deep Learning server

Cloud and computing on demand is an increasingly powerful and cost effective combination of enabling technologies for data scientists. Further, utilizing machine learning servers such as those based on AWS deep learning AMIs can make a full suite of machine learning tools available in a matter of minutes.

Jupyter Notebook is a popular development interface for data analysis and model training. Currently, AWS has a published procedure for configuring, starting, and connecting to notebook server.
https://docs.aws.amazon.com/dlami/latest/devguide/setup-jupyter.html

However, setting up can be challenging, and repeating the above step each time an instance restarts is not ideal, especially when server is offered to the broad data science community.

Here is an alternative and enhancement to auto start notebook server.

Adapt for your specific environment. Here we assume we to use AWS deep learning conda image (ubuntu). Specially we install into "python3" environment (source activate python3).

Configure Jupyter Notebook

Similar to steps outlined here, configure Jupyter Notebook, which consists of:

Create key and cert. For example, in ~/.jupyter/ directory:

openssl req -x509 -nodes -days 11499 -newkey rsa:1024 -keyout "jupytercert.key" -out "jupytercert.pem" -batch

Create notebook password, copy generated string in .json file

jupyter notebook password

update ~/.jupyter/jupyter_notebook_config.py
c.NotebookApp.open_browser = False

c.NotebookApp.ip = '*'

c.NotebookApp.port = 8888

c.NotebookApp.password = ‘sha1:xxx‘

c.NotebookApp.certfile = '/home/ubuntu/.jupyter/jupytercert.pem'

c.NotebookApp.keyfile = '/home/ubuntu/.jupyter/jupytercert.key'

Set up Auto Start Jupyter Notebook (virtualenv)

Setting up auto start is usually straightforward (for example, use /etc/rc.local). In this case, because the target environment is virtualenv. We don't want to auto start in the default python environement, or as root user. But we still want to use rc.local. Use the following 2 step process.

create a script /home/ubuntu/.jupyter/start_notebook.sh (note use of absolute path to invoke the executable)

#!/bin/bash
source /home/ubuntu/anaconda3/bin/activate python3
/home/ubuntu/anaconda3/envs/python3/bin/jupyter notebook &

Edit /etc/rc.local and add the following, note we switch to ubuntu user, and invoke the startup script:

cd /home/ubuntu

su ubuntu -c "nohup /home/ubuntu/.jupyter/start_notebook.sh >/dev/null 2>&1 &"

The reason for this two step process is to be able to execute multiple commands (I didn't find effective ways to do that easily in rc.local)

User Access to Jupyter Notebook

Jupyter Notebook will always start automatically with instance. Without any additional set up, user can conveniently access Jupyter server at
"https:(server IP):8888"

Sunday, January 14, 2018

Azure automation with Logic App - passing variable in workflow

Similar to AWS Lambda, Azure Logic App can be used for automated workflow. However, clear documentation is harder to come by, with fewer working examples, and often lack of effective technical support.

In a workflow, it should be a common requirement to pass the output of one step to another step. The motivation to post this working solution, is there is no clear example that illustrates how exactly that is done. It should be learned in a few minutes, rather than hours of trial and error.

output from step 1

Using a simple two step workflow to illustrate, in step 1, we use an Azure Function App with a powershell script. We can obtain a user email dynamically from Azure VM's user defined tag field.

$user_email = (Get-AzureRmVM -ResourceGroupName $resourceGroupName -Name $resourceName -ErrorAction $ErrorActionPreference -WarningAction $WarningPreference).Tags["user_email"]

More importantly, the obtained result needs to be sent to this rather odd "Out-File" structure. This is how variable can be passed in the workflow:

$result = $user_email | ConvertTo-Json
Out-File -Encoding Ascii -FilePath $res -inputObject $result

input to step 2

In a subsequent step, we can use the output of previous step, in this case, sending an email to VM's user per tag. This is best illustrated using the graphical interface of Logic App Designer:

Azure recognized a step generates an output, and make it available to be used for subsequent steps. The particular handle is shown as "Body" of Step 1 Function App, again, rather odd representation.

But it does work. And this simple mechanism is a much needed building block to construct complex features in a workflow.

Saturday, March 18, 2017

Three Networking features AWS should support

AWS is continuously enhancing and adding new features. However, a number of fundamental networking features have been discussed for a while, based on recent interactions with AWS team, still not on roadmap.

Here are three of those features high on my list, and why.

1. Multi-Path Routing (ECMP)
Currently, AWS routing table does not allow multiple routes to the same destination. For example, I can only define my default route in a private route table to a single destination (which can be a single point of failure).
If ECMP is supported, user will have a lot of load sharing and resiliency options. For example, I can define multiple default route to point to redundant load sharing gateways in multiple zones.

However, user still needs to keep those route up to date if the target instances changes. This can be done by keeping the ENI persistent and reattaching to new instances, or trigger lambda to update routes when instance refreshes

2. ELB as Route Table target
Supporting load balancer as a routing target may not seem natural as a network solution, there needs to be internal implementation that forward traffic to resolved load balancer and instances behind them.
This type of capability will allow user to fully benefit from the scalability and resiliency of load balancer, and have "native" high availability without the need for a self-maintained layer of lambda checks and actions.

An example that this can be done can be found with Azure, User Defined Route (UDR) can point to Azure Load Balancer (ALB), this enables route table to send traffic to a cluster of gateway nodes behind of load balancer, which leads to simple and elegant resiliency.

3. Native Transit VPC
In large scale enterprise use of AWS, as the number of VPCs go up, transit VPC can really help to scale by consolidating connectivity. Currently, there is a Cisco CSR based solution. But any third party appliances would require maintenance overhead, and introduce bottlenecks.

The ideal solution would be AWS enabled transit, to allow user to self define, much like peering connections.

I hope the these requirements are echoed by user communities.

Sunday, September 11, 2016

AWS VPC VGW Multipath Routing - difference between Direct Connect and VPN

VPC VGW multi-path scenario

To connect a VPC to enterprise networks or other VPCs, we use Direct Connect or VPN. It is common to have multiple connection paths from a VPC. Routing outbound from a VPC is controlled by VGW. The question is, how does VGW which is an AWS internal logical router handle multi-path routing?

Multi-path is a requirement for high availability. Load sharing on multi-path is often desirable. How VGW handles multi-path routing is actually different based on connection type. Specifically, Direct Connect supports ECMP. VPN does not (after Oct 2015).

Direct Connect

Direct Connect supports the configuration option of redundant paths with Active/Active (BGP multipath), VGW routes traffic over multiple equal cost paths. As a result, we can leverage all bandwidth resources provisioned for DX.

https://docs.aws.amazon.com/directconnect/latest/UserGuide/getstarted_sub1g_provider.html#redundant_connections_sub1g_provider

VPN

With VPN, VGW currently does not support BGP multipath. VPN chooses one BGP path only.

https://docs.aws.amazon.com/AmazonVPC/latest/NetworkAdminGuide/Introduction.html#MultipleVPNConnections

What if we use static route instead of BGP, can static be used to load share traffic across multiple paths?

In the scenario shown in the diagram, there are dual VPN connections going to two remote CGWs, each with redundant tunnels. If static routes are defined equally, does VGW route ECMP out multiple paths?

VGW created prior to Oct 28 2015 supports static multipath.
VGW created after Oct 28 2015 selects one active path out of multiple paths defined

The scenario is tested with a new VGW in one VPC, and a pair of customer VPN appliances in aonther VPC. With 4 tunnels/paths, it seems all traffic goes to one tunnel only. AWS support confirmed the behavior that VGW only selects one path only.

Why AWS should support VPN multipath

With VPN, it may be desirable to spread load across multiple customer gateways, because those customer gateways may be Cisco or Palo Alto appliances that has licensed throughput capacity. It is more optimal to spread load across multiple destinations rather than sending all traffic to one while other paths sit idle.

Hopefully AWS will bring consistent multipath routing to VPN, with BGP multipath and static ECMP.