Saturday, April 21, 2018

Resolving Compiler issues with XgBoost GPU install on Amazon Linux

GPU accelerated xgboost has shown performance improvements especially on data set with large number of features, using 'gpu_hist' tree_method. More information can be found here:

The installation on ubuntu based distribution is straight forward. Best results are obtained with latest generation of Nvidia GPU (AWS P3)

However, when compiling on Amazon Linux (including deep learning image), the following error is seen at "cmake" step:
cmake .. -DUSE_CUDA=ON
-- The C compiler identification is GNU 7.2.1
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) (found version "1.0")
-- Found OpenMP_CXX: -fopenmp (found version "3.1") 

CMake Error at /home/ec2-user/anaconda3/envs/python3/share/cmake-3.9/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) (found
  version "1.0")

This can be a frustrating error to correct, if focusing on troubleshooting OpenMP. Turns out, OpenMP version support is "embedded" in the compilers (gcc and g++). So different versions of compilers come with different versions of OpenMP implementation. In this case, the error message indicates OpenMP_C is version 1.0, while OpenMP_CXX is version 3.1. Xgboost GPU will not compile when versions mismatch.

The mismatch is a result of Amazon Linux comes with mismatched C and C++ compiler versions:
$ gcc --version
gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)
$ g++ --version

g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)

To fix this problem, remove default older version of cmake, remove gcc (twice since it has two packages), reinstall gcc and gcc-c++:
sudo yum remove cmake
sudo yum remove gcc –y
sudo yum remove gcc –y
sudo yum install gcc48 gcc48-cpp –y
sudo yum install gcc-c++

Also reinstall cmake (using your preferred method)
conda install -c anaconda cmake

Now cmake finds matching versions for xgboost GPU compile to proceed:
-- Found OpenMP_C: -fopenmp (found version "3.1")
-- Found OpenMP_CXX: -fopenmp (found version "3.1")

The lessons learned here: Amazon Linux may not come perfectly set up, check compiler environment before installing new software, especially when GPU and parallel processing is involved.

Sunday, March 18, 2018

SageMaker model deployment - 3 simple steps

AWS SageMaker is a platform designed to support full lifecycle of data science process, from data preparation, to model training, to deployment. Having clean separation yet easy pipelining between model training and deployment is one of its greatest strength.  A model can be developed using a training instances and saved as files. The deployment process can retrieve model artifacts saved in S3, and deploy a run time environment as HTTP endpoints. Finally, any application can send REST queries and get prediction results back from deployed endpoints.

While simple in concept, information regarding the practical implementation of SageMaker model deployment and prediction queries is currently lacking and scattered. It is easier to grasp in the simple 3 step process contained in a notebook.

1. create deployment model

We assume a model has been built (trained), with results saved in S3. A deployment model is defined with both model artifacts and algorithm containers.

2. configure deployment instances

Next, define the size and number of deployment instances, which will host the run time for deployment model service endpoints.

3. deploy to service endpoints

Finally, create service endpoints, wait for completion, and model deployment is finished, now ready to service prediction requests.

The complete deployment process can be visualized as follows:

The complete sample notebook can be seen here:

Sunday, February 4, 2018

Auto starting R Studio on AWS Deep Learning server

As an enhancement to machine learning servers built on AWS or Azure, it is often necessary to set up R development environment to meet the needs of data science community.

Adapt for your specific environment. Here we assume we to use AWS deep learning conda image (ubuntu). Specially we use "python3" virtual environment (source activate python3). One of the reasons to use this environment is it is already set up to run Jupyter Notebook (see auto start jupyter), we can therefore add an additional R kernel to it.  Then we have a consolidated image that can be offered to both Python and R users.

The easiest method to install R is using conda:
conda install r r-essentials

RStudio is a popular development environment. Follow instructions to install RStudio, for example:
sudo apt-get install gdebi-core
sudo gdebi rstudio-server-1.1.419-i386.deb

The above procedure also sets up auto start of R studio server by adding /etc/systemd/system/rstudio-server.service. However, because the only available procedure installs RStudio with "sudo" into the default system environment, it cannot find R which has been installed into a different environment. As a result, RStudio fails to start with error indicating RStudio cannot find R.
rstudio-server verify-installation
Unable to find an installation of R on the system (which R didn't return valid output); Unable to locate R binary by scanning standard locations

This can be easily fixed by specifying the exact path to R for RStudio, replace path with your installation of R:
sudo sh -c 'echo "rsession-which-r=/home/ubuntu/anaconda3/envs/python3/bin/R" >> /etc/rstudio/rserver.conf'

Restart instance, now RStudio Server starts successfully. Login with Linux credential at:
"http:(server IP):8787"