Tuesday, August 14, 2018

Managing GPU memory constraints for gradient boosting models

Gradient boosting works by constructing hundreds of trees, find best split by eval gradient/hessian of features is the most expensive and time consuming task. It has been an active research area to do that in parallel on GPU. The follow papers provide detailed descriptions of GPU implementation and benchmarks on xgboost and lightgbm.

With data size over 1-10 million range, I have observed GPU acceleration speeding up training time by up to 5-10x, while offering comparable accuracy, which is a big boost to data science work. However, the most significant constraint with the use of GPU for machine learning is GPU memory, which prevents modeling with larger data sets.

Know how much memory your GPU has

Whether you are on AWS, Google or Azure, chances are we have the same memory constraint based on Nvidia GPU, the most common hardware for machine learning is Tesla V100 with 16G GPU memory, which can fit the ballpark data size of around 10 million records. Actual milage would vary with data set and how you tuned it, more on this later.

It should be noted although multi-GPU option exists, it does nothing to alleviate per GPU memory constraint, since data set needs to be fit on a single GPU.

Earlier this year, Nvidia announced that all new Tesla V100 comes with 32G memory, which double the amount of memory per GPU. Effectively, it should also double the data set that can fit to around 20 million records. However, cloud vendors have not confirmed timeline. I have not seen it available yet.

Gradient Boosting parameters affecting GPU memory footprint

With implementation such as GPU version of xgboost, certain parameters affect memory allocation, therefore determines whether your data set will fit (otherwise you will get a memory error). The following parameters have an effect on whether your data set will fit on limited GPU memory, so you will need to budget available memory and allocate wisely. For example, you don't want to set parameters values unnecessarily high and causing over-allocation in certain dimensions.

Max Bin

Research in gradient boosting shows histogram method, which turns continuous feature value into discrete points for eval, can be equally effective, and it leads to more efficient implementation on GPU. Max_bin is a parameter that defines the maximum number of discrete points to evaluate the feature on.

Reducing from default value of 256 to 64, 32, or even 16 will reduce GPU memory required. You can run comparison tests on smaller data sets and compare evaluation metrics in order to determine impact on model performance.

Max depth, Boost round, Early Stopping

These parameters determines the number of trees to build and depth, which directly affects amount of memory allocated, so you want to put some thoughts into adjusting. Maybe start with an iterative approach and setting parameters how enough to fit the model first, then combine with parameter optimization, push the boundary of maximum memory utilization with fine tuning.

CPU Predictor

Even with GPU based training, you can still set CPU to be used for prediction, which reduces GPU memory required.


Using smaller GPU to fit larger data set

Saturday, April 21, 2018

Resolving Compiler issues with XgBoost GPU install on Amazon Linux

GPU accelerated xgboost has shown performance improvements especially on data set with large number of features, using 'gpu_hist' tree_method. More information can be found here:


The installation on ubuntu based distribution is straight forward. Best results are obtained with latest generation of Nvidia GPU (AWS P3)

However, when compiling on Amazon Linux (including deep learning image), the following error is seen at "cmake" step:
cmake .. -DUSE_CUDA=ON
-- The C compiler identification is GNU 7.2.1
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) (found version "1.0")
-- Found OpenMP_CXX: -fopenmp (found version "3.1") 

CMake Error at /home/ec2-user/anaconda3/envs/python3/share/cmake-3.9/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) (found
  version "1.0")

This can be a frustrating error to correct, if focusing on troubleshooting OpenMP. Turns out, OpenMP version support is "embedded" in the compilers (gcc and g++). So different versions of compilers come with different versions of OpenMP implementation. In this case, the error message indicates OpenMP_C is version 1.0, while OpenMP_CXX is version 3.1. Xgboost GPU will not compile when versions mismatch.

The mismatch is a result of Amazon Linux comes with mismatched C and C++ compiler versions:
$ gcc --version
gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)
$ g++ --version

g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)

To fix this problem, remove default older version of cmake, remove gcc (twice since it has two packages), reinstall gcc and gcc-c++:
sudo yum remove cmake
sudo yum remove gcc –y
sudo yum remove gcc –y
sudo yum install gcc48 gcc48-cpp –y
sudo yum install gcc-c++

Also reinstall cmake (using your preferred method)
conda install -c anaconda cmake

Now cmake finds matching versions for xgboost GPU compile to proceed:
-- Found OpenMP_C: -fopenmp (found version "3.1")
-- Found OpenMP_CXX: -fopenmp (found version "3.1")

The lessons learned here: Amazon Linux may not come perfectly set up, check compiler environment before installing new software, especially when GPU and parallel processing is involved.

Sunday, March 18, 2018

SageMaker model deployment - 3 simple steps

AWS SageMaker is a platform designed to support full lifecycle of data science process, from data preparation, to model training, to deployment. Having clean separation yet easy pipelining between model training and deployment is one of its greatest strength.  A model can be developed using a training instances and saved as files. The deployment process can retrieve model artifacts saved in S3, and deploy a run time environment as HTTP endpoints. Finally, any application can send REST queries and get prediction results back from deployed endpoints.

While simple in concept, information regarding the practical implementation of SageMaker model deployment and prediction queries is currently lacking and scattered. It is easier to grasp in the simple 3 step process contained in a notebook.

1. create deployment model

We assume a model has been built (trained), with results saved in S3. A deployment model is defined with both model artifacts and algorithm containers.

2. configure deployment instances

Next, define the size and number of deployment instances, which will host the run time for deployment model service endpoints.

3. deploy to service endpoints

Finally, create service endpoints, wait for completion, and model deployment is finished, now ready to service prediction requests.

The complete deployment process can be visualized as follows:

The complete sample notebook can be seen here: