Friday, February 7, 2020

Predicting Market Rank for Airbnb Listings

Motivation

There are several reasons that makes Airbnb data more challenging and rewarding to work with:
  • Unlike Kaggle, where objectives and metrics are defined, open ended problem definition is a critical data science skill - how to identify a valuable business objective and create analytical framework and modeling solutions around it?
  • Rich information including structured data, text and images to be assessed and narrated - how to deal with having too much data and missing certain data at the same time?
  • Regular data updates from sources such as insideairbnb.com, which provides critical feedback and enables iterative optimization
  • Behind data, are the places, people and their diverse culture, to be interpreted and uncovered. Enhancing Airbnb experience can bring enrichment to humanity and makes it a meaningful and fascinating goal for data scientists

Why predict Market Rank instead of Price?

Airbnb listings contain a rich array of information including structured data, text and images. The convenience and updated availability of data from sources such as insideairbnb.com make them intriguing data science projects. Let’s start by building baseline models.

Model design and training target

The better a listing does, the higher its market rank should be. How do we define and obtain “market rank” information for supervised training? We are looking for data that tells us how popular a listing is. Although we don’t have actual booking information, we have availability and review data that is updated monthly. Let’s look at pros and cons of these two key metrics:
  • For the same target time range, looking at how many more reviews are added and calculates a review rate








Developing Baseline Models

For the prototype, I used data for airbnb listings in Los Angeles, with 2019–03–06 as time A, and 2019–12–05 as time B, and 2019–12–05 + 30 days as the target booking time window. For simplicity, I used XGBoost with minimal amount of data processing and tuning.








  • Entire home/apt is most preferred by guests
  • Simple features such as coffee machine, self check-in can boost a listing
  • Feature impact are different for the two models, therefore making them complimentary to each other



Market Rank Illustrated

Here we evaluate the models using a reserved test set of listings in Los Angeles area. After combining scores from booking model and review model, we obtain a city wide ranking score (“market_rank_city”) in percentile form. For example, listing ID 8570847 is at top 98.6% in terms of its market competitiveness, while listing ID 14014479 is at bottom 6.9%.








Model performance on new listings

Note we include historical information such as increasing rate of reviews and booking activities to predict future outcome. This is not a form of data leak. Rather, it is a true reflection of how a potential guest evaluates a listing. Consequently, model is more neutral on new listings because of missing information on many features. This is also a true reflection of reality.

What is next

I have shared the journey of a data science project utilizing real world data, starting from a meaningful objective, by researching available data and experimenting with model combination, to promising results. The baseline models developed only scratched the surface of what is possible. Thanks to the vast amount and dynamic nature of airbnb data, further improvements may come from more data scrubbing, feature engineering and algorithm tuning. Adding images and text information also makes for exciting exploration with deep learning.

Tuesday, August 14, 2018

Managing GPU memory constraints for gradient boosting models

Gradient boosting works by constructing hundreds of trees, find best split by eval gradient/hessian of features is the most expensive and time consuming task. It has been an active research area to do that in parallel on GPU. The follow papers provide detailed descriptions of GPU implementation and benchmarks on xgboost and lightgbm.

With data size over 1-10 million range, I have observed GPU acceleration speeding up training time by up to 5-10x, while offering comparable accuracy, which is a big boost to data science work. However, the most significant constraint with the use of GPU for machine learning is GPU memory, which prevents modeling with larger data sets.

Know how much memory your GPU has

Whether you are on AWS, Google or Azure, chances are we have the same memory constraint based on Nvidia GPU, the most common hardware for machine learning is Tesla V100 with 16G GPU memory, which can fit the ballpark data size of around 10 million records. Actual milage would vary with data set and how you tuned it, more on this later.

It should be noted although multi-GPU option exists, it does nothing to alleviate per GPU memory constraint, since data set needs to be fit on a single GPU.

Earlier this year, Nvidia announced that all new Tesla V100 comes with 32G memory, which double the amount of memory per GPU. Effectively, it should also double the data set that can fit to around 20 million records. However, cloud vendors have not confirmed timeline. I have not seen it available yet.

Gradient Boosting parameters affecting GPU memory footprint

With implementation such as GPU version of xgboost, certain parameters affect memory allocation, therefore determines whether your data set will fit (otherwise you will get a memory error). The following parameters have an effect on whether your data set will fit on limited GPU memory, so you will need to budget available memory and allocate wisely. For example, you don't want to set parameters values unnecessarily high and causing over-allocation in certain dimensions.

Max Bin

Research in gradient boosting shows histogram method, which turns continuous feature value into discrete points for eval, can be equally effective, and it leads to more efficient implementation on GPU. Max_bin is a parameter that defines the maximum number of discrete points to evaluate the feature on.

Reducing from default value of 256 to 64, 32, or even 16 will reduce GPU memory required. You can run comparison tests on smaller data sets and compare evaluation metrics in order to determine impact on model performance.


Max depth, Boost round, Early Stopping

These parameters determines the number of trees to build and depth, which directly affects amount of memory allocated, so you want to put some thoughts into adjusting. Maybe start with an iterative approach and setting parameters how enough to fit the model first, then combine with parameter optimization, push the boundary of maximum memory utilization with fine tuning.

CPU Predictor

Even with GPU based training, you can still set CPU to be used for prediction, which reduces GPU memory required.

'predictor':'cpu_predictor',



Using smaller GPU to fit larger data set



Saturday, April 21, 2018

Resolving Compiler issues with XgBoost GPU install on Amazon Linux

GPU accelerated xgboost has shown performance improvements especially on data set with large number of features, using 'gpu_hist' tree_method. More information can be found here:

http://xgboost.readthedocs.io/en/latest/gpu/

The installation on ubuntu based distribution is straight forward. Best results are obtained with latest generation of Nvidia GPU (AWS P3)
http://xgboost.readthedocs.io/en/latest/build.html#building-with-gpu-support

However, when compiling on Amazon Linux (including deep learning image), the following error is seen at "cmake" step:
cmake .. -DUSE_CUDA=ON
-- The C compiler identification is GNU 7.2.1
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) (found version "1.0")
-- Found OpenMP_CXX: -fopenmp (found version "3.1") 
...

CMake Error at /home/ec2-user/anaconda3/envs/python3/share/cmake-3.9/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) (found
  version "1.0")


This can be a frustrating error to correct, if focusing on troubleshooting OpenMP. Turns out, OpenMP version support is "embedded" in the compilers (gcc and g++). So different versions of compilers come with different versions of OpenMP implementation. In this case, the error message indicates OpenMP_C is version 1.0, while OpenMP_CXX is version 3.1. Xgboost GPU will not compile when versions mismatch.

The mismatch is a result of Amazon Linux comes with mismatched C and C++ compiler versions:
$ gcc --version
gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)
$ g++ --version

g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)

To fix this problem, remove default older version of cmake, remove gcc (twice since it has two packages), reinstall gcc and gcc-c++:
sudo yum remove cmake
sudo yum remove gcc –y
sudo yum remove gcc –y
sudo yum install gcc48 gcc48-cpp –y
sudo yum install gcc-c++

Also reinstall cmake (using your preferred method)
conda install -c anaconda cmake

Now cmake finds matching versions for xgboost GPU compile to proceed:
-- Found OpenMP_C: -fopenmp (found version "3.1")
-- Found OpenMP_CXX: -fopenmp (found version "3.1")

The lessons learned here: Amazon Linux may not come perfectly set up, check compiler environment before installing new software, especially when GPU and parallel processing is involved.

Sunday, March 18, 2018

SageMaker model deployment - 3 simple steps


AWS SageMaker is a platform designed to support full lifecycle of data science process, from data preparation, to model training, to deployment. Having clean separation yet easy pipelining between model training and deployment is one of its greatest strength.  A model can be developed using a training instances and saved as files. The deployment process can retrieve model artifacts saved in S3, and deploy a run time environment as HTTP endpoints. Finally, any application can send REST queries and get prediction results back from deployed endpoints.

While simple in concept, information regarding the practical implementation of SageMaker model deployment and prediction queries is currently lacking and scattered. It is easier to grasp in the simple 3 step process contained in a notebook.

1. create deployment model

We assume a model has been built (trained), with results saved in S3. A deployment model is defined with both model artifacts and algorithm containers.



2. configure deployment instances

Next, define the size and number of deployment instances, which will host the run time for deployment model service endpoints.



3. deploy to service endpoints

Finally, create service endpoints, wait for completion, and model deployment is finished, now ready to service prediction requests.


















The complete deployment process can be visualized as follows:












The complete sample notebook can be seen here: