Sunday, March 1, 2020

Boosting Machine Learning Models with Explainable AI (XAI) - Insights on Airbnb listings

With a typical machine learning model, the traditional correlation of feature importance analysis often has limited value. In a data scientist’s toolkit, are there reliable, systematic, model agnostic methods that measure feature impact accurate to the prediction? The answer is yes.
Here we use a model built on Airbnb data to illustrate:
  • Explainable AI (XAI) technologies
  • What can XAI do for global and local explanation
  • What can XAI do for model enhancement

XAI — a brief overview

As AI gains traction with more applications, Explainable AI (XAI) is an increasingly critical component to explain with clarity and deploy with confidence. XAI technologies are becoming more mature for both machine learning and deep learning. Here are a couple of algorithm neutral methods that are practical to use today:

SHAP

SHAP (SHapley Additive exPlanations) is developed by Scott Lundberg at the University of Washington. SHAP computes Shapley values from game theory, by assuming that each feature value of the instance is a “player” in a game where the prediction is the payout. Then a prediction can be explained by computing the contribution of each feature to the prediction. Note SHAP has these desirable properties:
1. Local accuracy: the sum of the feature attributions is equal to the output of the model we are trying to explain
2. Missingness: features that are already missing have no impact
3. Consistency: changing a model so a feature has a larger impact on the model will never decrease the attribution assigned to that feature.
SHAP supports tree ensemble, deep learning and other models. It can be used for both global and local explanation. Please refer to Scott Lundberg’s .

LIME

Local Interpretable Model-Agnostic Explanations (LIME) is based on the concept of surrogate models. When interpreting a black box model, LIME tests what happens to the predictions with variations of data, and trains local surrogate models with weighted features. Finally, individual predictions for “black box” models can be explained with local, interpretable, surrogate models.
Please refer to LIME paper: 

Airbnb booking rate model

The model used here predicts Airbnb booking rate. It is trained with data for Los Angeles area listings, obtained from insideairbnb.com. For simplicity, I use a subset of features to train an XGBoost model.

Insight with global explanation

SHAP summary shows top feature contributions. It also shows data point distribution and provides visual indicators of how feature values affect predictions. Here red indicates higher feature value, blue indicates lower feature value. On the x-axis, higher SHAP value to the right corresponds to higher prediction value (more likely listing gets booked), lower SHAP value to the left corresponds to lower prediction value (less likely listing gets booked).
Here are a few insights gained with global feature analysis:

Who are the most successful hosts?

Using Dependence Plots, we can examine the relationship between feature values and predicted outcome. In the first diagram, as the number of listings a host has increases, we see a decreasing trend of SHAP values. In the second diagram, the x-axis shows host listings count, the color shows listings count of “entire home”.
We can probably derive these type of hosts:
  • host with single or a few listings — these are individuals and family, their listing are generally attractive likely due to focus and personal care
  • host with 15–60 listings — these hosts have the least attractive listings, they are probably small hotel or motel type of properties that rents out rooms?
  • host with more than 150 listings — in the second diagram, we can see as the number of host listing increases to above 50, the predicted booking rate increase substantially (reversing earlier trend). Further, those listings are almost all “entire home”. At the top range, those hosts with over 100 “entire home” listings achieve a booking rate of 75% and above which is far superior to anybody else. Are those professionally managed Airbnb property companies?

Higher cleaning fee or higher price?

Given the choice, should a host charge more on nightly price or cleaning fee? The dependence plot shows feature interaction between price and cleaning fee. The red color indicates a higher cleaning fee. Along the x-axis, as price increases, predicted booking rate decreases, which is expected. Further, we see listing with higher cleaning fee (red dots) tends to stay above those with lower cleaning fee (blue dots).
Therefore, a listing with a higher cleaning fee is actually favorable when it comes to predicted booking rate. A host who shifts costs more to the cleaning fee probably wins by encouraging guests to stay longer and makes listing price seems cheaper on the front page.

More reviews or higher review rating?

In the diagram, we see an increase in review rating (x-axis) leads to higher booking rate. An increase in the number of reviews, however, does not correspond to better booking rate (red dots are scattered vertically). A host is much better off getting a few good reviews than having a lot of mediocre reviews.

Insight for Local Explanation

SHAP force plot can be used to explain individual predictions.
For example, we can see that there is a base value (bias term) of 0.01249, with features in red pushes that value to the right, and features in blue pushes that value to the left, with a combined output of 0.58. Therefore, the effect of top feature is quantified on the prediction with local accuracy. The particular listing has a number of strong features values (superhost, low price, entire home, recent calendar updates) which makes it favorable for booking.
The listing below has a number of feature values (high price of $390, long minimum stay of 30 days, long time calendar hasn’t been updated) that makes it less likely to be booked.
LIME method can be used to explain individual predictions as well, it quantitively shows the effect of top features (orange is positive, blue is negative).
The first example shows a listing with mostly positive feature values (it is an entire home offered by a superhost, with lots of reviews).
The second example shows a listing with negative feature values (it requires a minimum stay of 30 nights and charges a cleaning fee of $150).

Insight for Model Improvement

By examining the global and local impact of features, we can often reveal unexpected patterns of data and gain new insights. With further analysis, we may find the root cause to be one of the following:
  • deficiency with business analysis
  • error with data collection
  • data processing improvement (impute and scale)
  • or, the unusual pattern is a true reflection of new knowledge to learn
The diagram below shows there are mainly two types of listings with high “calendar_updated” value (red dots). One group, on the left most side of x-axis, who has 0 reviews last twelve month, are essentially stale listing with negative SHAP value and therefore low predicted booking rate. The other red dots are scattered in the upper area which indicates they have higher SHAP value and more likely to get booked. Those are listings that are consistently available and require few calendar updates from hosts. This provides a clue for feature engineering with the goal to distinguish stable listing from training data.
Another example shown here is Skater Partial Dependence Plot which shows the interaction of latitude and longitude features, with vertical axis indicating their effect on prediction. Visually, this 3D diagram can be superimposed on a map of the LA area, which clearly shows central and north area being more popular, and south being the least likely to be booked. This insight can not be gained with analysis of individual features.
Making corrections, adjustments, and gaining new knowledge is part of the iterative model lifecycle which should lead to incremental improvements. XAI can uncover hidden clues and provide critical evidence for that.

Github

A simplified version of the model and XAI code can be found here: 

What is next

Airbnb listings have rich and informative features such as image and text. Incorporating those into models can greatly enhance predictive performance. XAI with deep learning and vision should be both challenging and rewarding as well.

Friday, February 7, 2020

Predicting Market Rank for Airbnb Listings

Motivation

There are several reasons that makes Airbnb data more challenging and rewarding to work with:
  • Unlike Kaggle, where objectives and metrics are defined, open ended problem definition is a critical data science skill - how to identify a valuable business objective and create analytical framework and modeling solutions around it?
  • Rich information including structured data, text and images to be assessed and narrated - how to deal with having too much data and missing certain data at the same time?
  • Regular data updates from sources such as insideairbnb.com, which provides critical feedback and enables iterative optimization
  • Behind data, are the places, people and their diverse culture, to be interpreted and uncovered. Enhancing Airbnb experience can bring enrichment to humanity and makes it a meaningful and fascinating goal for data scientists

Why predict Market Rank instead of Price?

Airbnb listings contain a rich array of information including structured data, text and images. The convenience and updated availability of data from sources such as insideairbnb.com make them intriguing data science projects. Let’s start by building baseline models.

Model design and training target

The better a listing does, the higher its market rank should be. How do we define and obtain “market rank” information for supervised training? We are looking for data that tells us how popular a listing is. Although we don’t have actual booking information, we have availability and review data that is updated monthly. Let’s look at pros and cons of these two key metrics:
  • For the same target time range, looking at how many more reviews are added and calculates a review rate








Developing Baseline Models

For the prototype, I used data for airbnb listings in Los Angeles, with 2019–03–06 as time A, and 2019–12–05 as time B, and 2019–12–05 + 30 days as the target booking time window. For simplicity, I used XGBoost with minimal amount of data processing and tuning.








  • Entire home/apt is most preferred by guests
  • Simple features such as coffee machine, self check-in can boost a listing
  • Feature impact are different for the two models, therefore making them complimentary to each other



Market Rank Illustrated

Here we evaluate the models using a reserved test set of listings in Los Angeles area. After combining scores from booking model and review model, we obtain a city wide ranking score (“market_rank_city”) in percentile form. For example, listing ID 8570847 is at top 98.6% in terms of its market competitiveness, while listing ID 14014479 is at bottom 6.9%.








Model performance on new listings

Note we include historical information such as increasing rate of reviews and booking activities to predict future outcome. This is not a form of data leak. Rather, it is a true reflection of how a potential guest evaluates a listing. Consequently, model is more neutral on new listings because of missing information on many features. This is also a true reflection of reality.

What is next

I have shared the journey of a data science project utilizing real world data, starting from a meaningful objective, by researching available data and experimenting with model combination, to promising results. The baseline models developed only scratched the surface of what is possible. Thanks to the vast amount and dynamic nature of airbnb data, further improvements may come from more data scrubbing, feature engineering and algorithm tuning. Adding images and text information also makes for exciting exploration with deep learning.

Tuesday, August 14, 2018

Managing GPU memory constraints for gradient boosting models

Gradient boosting works by constructing hundreds of trees, find best split by eval gradient/hessian of features is the most expensive and time consuming task. It has been an active research area to do that in parallel on GPU. The follow papers provide detailed descriptions of GPU implementation and benchmarks on xgboost and lightgbm.

With data size over 1-10 million range, I have observed GPU acceleration speeding up training time by up to 5-10x, while offering comparable accuracy, which is a big boost to data science work. However, the most significant constraint with the use of GPU for machine learning is GPU memory, which prevents modeling with larger data sets.

Know how much memory your GPU has

Whether you are on AWS, Google or Azure, chances are we have the same memory constraint based on Nvidia GPU, the most common hardware for machine learning is Tesla V100 with 16G GPU memory, which can fit the ballpark data size of around 10 million records. Actual milage would vary with data set and how you tuned it, more on this later.

It should be noted although multi-GPU option exists, it does nothing to alleviate per GPU memory constraint, since data set needs to be fit on a single GPU.

Earlier this year, Nvidia announced that all new Tesla V100 comes with 32G memory, which double the amount of memory per GPU. Effectively, it should also double the data set that can fit to around 20 million records. However, cloud vendors have not confirmed timeline. I have not seen it available yet.

Gradient Boosting parameters affecting GPU memory footprint

With implementation such as GPU version of xgboost, certain parameters affect memory allocation, therefore determines whether your data set will fit (otherwise you will get a memory error). The following parameters have an effect on whether your data set will fit on limited GPU memory, so you will need to budget available memory and allocate wisely. For example, you don't want to set parameters values unnecessarily high and causing over-allocation in certain dimensions.

Max Bin

Research in gradient boosting shows histogram method, which turns continuous feature value into discrete points for eval, can be equally effective, and it leads to more efficient implementation on GPU. Max_bin is a parameter that defines the maximum number of discrete points to evaluate the feature on.

Reducing from default value of 256 to 64, 32, or even 16 will reduce GPU memory required. You can run comparison tests on smaller data sets and compare evaluation metrics in order to determine impact on model performance.


Max depth, Boost round, Early Stopping

These parameters determines the number of trees to build and depth, which directly affects amount of memory allocated, so you want to put some thoughts into adjusting. Maybe start with an iterative approach and setting parameters how enough to fit the model first, then combine with parameter optimization, push the boundary of maximum memory utilization with fine tuning.

CPU Predictor

Even with GPU based training, you can still set CPU to be used for prediction, which reduces GPU memory required.

'predictor':'cpu_predictor',



Using smaller GPU to fit larger data set



Saturday, April 21, 2018

Resolving Compiler issues with XgBoost GPU install on Amazon Linux

GPU accelerated xgboost has shown performance improvements especially on data set with large number of features, using 'gpu_hist' tree_method. More information can be found here:

http://xgboost.readthedocs.io/en/latest/gpu/

The installation on ubuntu based distribution is straight forward. Best results are obtained with latest generation of Nvidia GPU (AWS P3)
http://xgboost.readthedocs.io/en/latest/build.html#building-with-gpu-support

However, when compiling on Amazon Linux (including deep learning image), the following error is seen at "cmake" step:
cmake .. -DUSE_CUDA=ON
-- The C compiler identification is GNU 7.2.1
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) (found version "1.0")
-- Found OpenMP_CXX: -fopenmp (found version "3.1") 
...

CMake Error at /home/ec2-user/anaconda3/envs/python3/share/cmake-3.9/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) (found
  version "1.0")


This can be a frustrating error to correct, if focusing on troubleshooting OpenMP. Turns out, OpenMP version support is "embedded" in the compilers (gcc and g++). So different versions of compilers come with different versions of OpenMP implementation. In this case, the error message indicates OpenMP_C is version 1.0, while OpenMP_CXX is version 3.1. Xgboost GPU will not compile when versions mismatch.

The mismatch is a result of Amazon Linux comes with mismatched C and C++ compiler versions:
$ gcc --version
gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)
$ g++ --version

g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)

To fix this problem, remove default older version of cmake, remove gcc (twice since it has two packages), reinstall gcc and gcc-c++:
sudo yum remove cmake
sudo yum remove gcc –y
sudo yum remove gcc –y
sudo yum install gcc48 gcc48-cpp –y
sudo yum install gcc-c++

Also reinstall cmake (using your preferred method)
conda install -c anaconda cmake

Now cmake finds matching versions for xgboost GPU compile to proceed:
-- Found OpenMP_C: -fopenmp (found version "3.1")
-- Found OpenMP_CXX: -fopenmp (found version "3.1")

The lessons learned here: Amazon Linux may not come perfectly set up, check compiler environment before installing new software, especially when GPU and parallel processing is involved.