Friday, February 7, 2020

Predicting Market Rank for Airbnb Listings


There are several reasons that make Airbnb data challenging and rewarding to work with:
  • Unlike Kaggle, where objectives and metrics are defined, open ended problem definition is a critical data science skill - how to identify a valuable business objective and create analytical framework and modeling solutions around it?
  • Rich information including structured data, text and images to be assessed and narrated - how to deal with having too much data and missing certain data at the same time?
  • Regular data updates from sources such as, which provides critical feedback and enables iterative optimization
  • Behind data, are the places, people and their diverse culture, to be interpreted and uncovered. Enhancing Airbnb experience can bring enrichment to humanity and makes it a meaningful and fascinating goal for data scientists

Why predict Market Rank instead of Price?

Airbnb listings contain a rich array of information including structured data, text and images. The convenience and updated availability of data from sources such as make them intriguing data science projects. Let’s start by building baseline models.

Model design and training target

The better a listing does, the higher its market rank should be. How do we define and obtain “market rank” information for supervised training? We are looking for data that tells us how desirable a listing is, or in other words, how fast it gets booked relative to peers. Although we don’t have actual booking information, we have availability and review data that is updated monthly. Let’s look at pros and cons of these two key metrics:
  • For the same target time range, looking at how many more reviews are added and calculates a review rate

Developing Baseline Models

For the prototype, I used data for airbnb listings in Los Angeles, with 2019–03–06 as time A, and 2019–12–05 as time B, and 2019–12–05 + 30 days as the target booking time window. For simplicity, I used XGBoost with minimal amount of data processing and tuning.

  • Entire home/apt is most preferred by guests
  • Simple features such as coffee machine, self check-in can boost a listing
  • Feature impact are different for the two models, therefore making them complimentary to each other

Market Rank Illustrated

Here we evaluate the models using a reserved test set of listings in Los Angeles area. After combining scores from booking model and review model, we obtain a city wide ranking score (“market_rank_city”) in percentile form. For example, listing ID 8570847 is at top 98.6% in terms of its market competitiveness, while listing ID 14014479 is at bottom 6.9%.

Model performance on new listings

Note we include historical information such as increasing rate of reviews and booking activities to predict future outcome. This is not a form of data leak. Rather, it is a true reflection of how a potential guest evaluates a listing. Consequently, model is more neutral on new listings because of missing information on many features. This is also a true reflection of reality.

What is next

I have shared the journey of a data science project utilizing real world data, starting from a meaningful objective, by researching available data and experimenting with model combination, to promising results. The baseline models developed only scratched the surface of what is possible. Thanks to the vast amount and dynamic nature of airbnb data, further improvements may come from more data scrubbing, feature engineering and algorithm tuning. Adding images and text information also makes for exciting exploration with deep learning.