Saturday, May 10, 2014

AWS automation – CloudFormation bootstrapping early lessons – Part 2

I shared some lessons from the initial learning curve towards more sophisticated CloudFormation capabilities in part 1 of this post. While it is easy to get started mimicking an existing design, it takes more in-depth understanding of bootstrapping in order to design to your specific target behavior and to troubleshoot more effectively.

Build Incrementally
It may be tempting to develop the full template and scripts all at once, and test full feature set to target design. If you are lucky, then everything work the first time. However, due to the many components involved, more often than not, some troubleshooting will be involved. At that point, running the whole system every time you change a snippet is actually more time-consuming, and often counterproductive to isolating root cause.

In other words, break the solution down into logical components, build incrementally. Start testing and troubleshooting early, at the component level. When the components have been tested, it will be a lot easier to assemble a complete system together successfully.

Logically, an incremental build may flow like this:
  • Develop and valid a basic template that creates target resources
  • Verify that the template launches target instance(s) and/or auto-scaling groups, ELBs, etc.
  • Instance installs specified software and packages successfully
  • Instance can access external data store (such as S3) and create local file structure per design
  • Instance can run the specified command/script/code
  • The specified command/script/code performs the desired function
  • CloudFormation receives signal  upon completion

Think Modular
An incremental approach also encourages the development of reusable code. For example, you may find it beneficial to capture a specific feature in a utility template, which has been tested and proven. In the future, you may develop a new app calling this nested template using parameters.

Disable Rollback
By default, CloudFormation performs rollback if an error is received during stack creation. For troubleshooting, it is often not sufficient just to look at CloudFormation event log, but also necessary to preserve the failed instances in order to collect more detailed clues. Therefore, it is essential to set DisableRollback to true (or if creating stack using console, expand “advanced option” to deselect default option).

After you have examined failed instances, you can manually delete the stack which will clean up the unwanted instances. You can then modify code and repeat the stack creation process.

Troubleshoot on the instance
If things don’t work as expected, the most specific and definitive information is always on the instance itself. Using credential, log on to the instance itself.

Check instance logs, for example, cfn-init logs, on linux: /var/log/cfn-init.log, on windows: C:\cfn\log\cfn-init.log

Take out the guessing
While your final product should be concise and elegant, you should feel free to generate additional information and output to help pinpoint the issue during development and troubleshooting. Why not make it obvious and easy for yourself?

You can apply any development technique here. For example, insert lines into your script or code to print to log file. I also find it more efficient to test the script directly on the instance, which often reveals issues without going through the lengthy steps of deleting and recreating stacks every time you make a change. Because the instance is already in the target VPC, you can use the command line directly to simulate bootstrapping process.

Tune timeout
Waitcondition is used for CloudFormation to receive signal back. If you have experience long delay for Waitcondition to report failure, check its times out value set. A typical bootstrapping operation takes no more than 5 minutes, there is no point waiting much longer. By decreasing timeout to less than 10 minutes, you will save a lot of time and frustration.

Watch external dependencies
A lot of times, a script that runs well locally may not work on bootstrapping. Think of various conditions that the instance relies on externally, think of them as necessary conditions for bootstrapping to run successfully:
  • Internet access from VPC
  • Security groups and policies applied to the instance
  • Instance role and access privilege
  • DNS
  • External data store access protection 

The more sophisticated automation capabilities become, the more components are involved in a complete sequence of events. Later, one process may pass variables to another. There will be more error-handling, more nested templates, parameters, more code, conditions, etc…  But every journey starts from somewhere, the lessons learned from bootstrapping provide a good first step. 

No comments:

Post a Comment