Monday, October 12, 2015

Evaluating Amazon Machine Learning - in a Kaggle competition

Cloud changes IT in many ways. A new class of platform, database, messaging and app services have emerged to enable the rapid delivery of cloud native apps.  IT architecture can no longer be satisfied with delivering compute, network and storage. It must expand “up the stack”, putting more capabilities more rapidly into the hands of developers and business users.

A primary example of new IT capacities in demand is in the area of Big Data and Machine learning. With elasticity and on-demand computing, cloud has dramatically lowered the cost of entry. With emerging open source tool sets (e.g., Distributed Machine Learning Common, Jupyter, Anaconda, Python...), even individuals are now capable of performing analytics on large data sets, at a fraction of the cost of traditional methods (SAS grid).

To gain insight and bridge the gap between IT and data science community, I have experimented with Amazon Machine Learning (AML) service, comparing with custom built open source tool sets. By participating in a Kaggle competition, the results are also benchmarked in the real world.

The particular Kaggle competition I used has the goal of predicting hazard score using a dataset of property information. The hazard score to be predicted is a numeric value.

Machine Learning “as a service” test
AWS has delivered a service that puts modeling and predictive analysis capabilities into the hands of a non-IT and non-data-scientist person. Its documentation provides sufficient information to build a model and perform analytics, and requires no prior modeling skills.

The first step is creating a data source. From input data, AWS infers attribute types and creates a schema, which can be further modified by user.

The second step is to build a ML model using the data source. Amazon supports only three Models (Binary classification, Multiclass classification, Numerical regression). Since the prediction result is a numeric value, the only Amazon model applicable is the regression model. Note that there are many more models, attribute selection techniques and sequencing variations than the three models offered by AWS, thus making data science equally an art than it is a science.

AWS’s built in regression model evaluation uses residual distribution to evaluate the model. In this particular case, the model has a tendency for negative residuals which indicates an overestimation (the actual target tends to be smaller than the predicted target). 

To further evaluate model’s performance, it is used to calculate Hazard score for the real data set in Kaggle competition. After the competition closed, the AWS ML model obtained a score of 0.343. Compared to all submission, it ranks 1830th (over a total of 2236). The winning submission scored 0.397.

Machine Learning “on a server” test
For comparison, a custom server is built on AWS infrastructure. A set of data science tools and libraries from the open source community are then deployed, with no additional cost. For the Kaggle competition, an emerging ML model called XGBoost is used (developed by Tianqi Chen, a PhD student at University of Washington). The resulting score is 0.392, which ranks 299th/2236.

For cost comparison, running evaluation on Amazon ML was quite expensive. I only ran a few times with a record size of 50000, and end up spending over $50. The cost of custom server is almost negligible, as the use of a mid-sized EC2 instance is quite adequate to run XGBoost Python code.

Amazon Machine Learning “as a service” delivers a very easy to use tool. It frees users from build, scale, and maintain machine learning infrastructure. However, in its current form, it is only suited to handle a narrow set of problems that matches the simple models provided. As large enterprises typically faces more sophisticated data analytical challenges, as those represented in Kaggle competitions, AML is of limited value to the data science community.

On the other hand, as data science is being revolutionized by open source, there seems to be huge opportunities for Amazon and AML to improve on.

I haven’t found much benchmarking work out there. Here are a couple of posts comparing AML with others including Google Prediction and Azure Machine Learning.