Tuesday, August 14, 2018

Managing GPU memory constraints for gradient boosting models

Gradient boosting works by constructing hundreds of trees, find best split by eval gradient/hessian of features is the most expensive and time consuming task. It has been an active research area to do that in parallel on GPU. The follow papers provide detailed descriptions of GPU implementation and benchmarks on xgboost and lightgbm.

With data size over 1-10 million range, I have observed GPU acceleration speeding up training time by up to 5-10x, while offering comparable accuracy, which is a big boost to data science work. However, the most significant constraint with the use of GPU for machine learning is GPU memory, which prevents modeling with larger data sets.

Know how much memory your GPU has

Whether you are on AWS, Google or Azure, chances are we have the same memory constraint based on Nvidia GPU, the most common hardware for machine learning is Tesla V100 with 16G GPU memory, which can fit the ballpark data size of around 10 million records. Actual milage would vary with data set and how you tuned it, more on this later.

It should be noted although multi-GPU option exists, it does nothing to alleviate per GPU memory constraint, since data set needs to be fit on a single GPU.

Earlier this year, Nvidia announced that all new Tesla V100 comes with 32G memory, which double the amount of memory per GPU. Effectively, it should also double the data set that can fit to around 20 million records. However, cloud vendors have not confirmed timeline. I have not seen it available yet.

Gradient Boosting parameters affecting GPU memory footprint

With implementation such as GPU version of xgboost, certain parameters affect memory allocation, therefore determines whether your data set will fit (otherwise you will get a memory error). The following parameters have an effect on whether your data set will fit on limited GPU memory, so you will need to budget available memory and allocate wisely. For example, you don't want to set parameters values unnecessarily high and causing over-allocation in certain dimensions.

Max Bin

Research in gradient boosting shows histogram method, which turns continuous feature value into discrete points for eval, can be equally effective, and it leads to more efficient implementation on GPU. Max_bin is a parameter that defines the maximum number of discrete points to evaluate the feature on.

Reducing from default value of 256 to 64, 32, or even 16 will reduce GPU memory required. You can run comparison tests on smaller data sets and compare evaluation metrics in order to determine impact on model performance.

Max depth, Boost round, Early Stopping

These parameters determines the number of trees to build and depth, which directly affects amount of memory allocated, so you want to put some thoughts into adjusting. Maybe start with an iterative approach and setting parameters how enough to fit the model first, then combine with parameter optimization, push the boundary of maximum memory utilization with fine tuning.

CPU Predictor

Even with GPU based training, you can still set CPU to be used for prediction, which reduces GPU memory required.


Using smaller GPU to fit larger data set