enabling data science: NLP based query of Airbnb properties with Asymmetric Semantic Search

Motivation – Why search Airbnb reviews?

Guest rating and reviews are the most valuable assets Airbnb has. For perspective guests, learning from experience shared by others makes the booking experience more predictable and dependable. However, often there are too many reviews to read one by one. More importantly, Airbnb currently only supports property search based on predefined “filters”, thus excluding rich information contained in reviews to be fully utilized in the search process.

Why enable search specifically for reviews?

Reviews written by other guests are more subjective and trustworthy
Reviews contain rich info, its free text form allows capturing of unlimited variety of experience, stories, and emotions
Untapped potential to expand the power of search, i.e., by matching undefined interest of guests with properties

How would the search feature work?

For example, a perspective guest may be searching for a place to stay in LA. In addition to applying map based selection and filtering with structured features currently supported, one could enter a search query describing desired property such as: “quiet and spacious place close to beach in a safe neighborhood”.

Using the query, the search feature will find relevant reviews among current candidate listings, score based on contextual similarity, and return the top ranked listings with closest matched reviews.

Data Processing

By utilizing raw data from insideairbnb.com, we can get raw user review data updated monthly. To be used with a query application, we can apply data processing to have text data ready for search. In run time, as user search in a particular geography and apply filters, a candidate listing with review data is dynamically generated, which can be loaded into memory to support user query.

Asymmetric Semantic Search

To support query, we apply the concept of semantic search. Specifically, utilizing the state of the art NLP transformer models, we can embed reviews to be represented as a vector space. The subsequent search is simply to compare query embedding with review embedding and find the closest matches with a scoring function such as cosine similarity, thus finding reviews with a high semantic overlap with the query.

In choosing the transformer model to test with, it is important to recognize the type of query which is “asymmetric” in this case. Because users usually enter a short query which is to be matched with reviews which are often longer paragraphs, so we utilize MS MARCO models which are created based on user search queries using Bing search engine.

System Design

As illustrated, a simple prototype design consists of:

Selected Listings - user preselection (filter) of perspective listings
Candidate reviews - corresponding reviews for selected listing
Review embeddings – use transformer to generate sentence embedding for candidate reviews
User query – performs asymmetric semantic search against review embeddings
Result – select top K matches and present corresponding listing for user

Sample Result

Here are some of the top reviews found to semantically matching the sample query above. Since reviews don’t change, they are pre-transformed to vector form for the search. As a result, the query is fast even with tens and thousands of candidate reviews, and results are highly relevant. Clearly, this represents a powerful new search capability currently unavailable.

Going Beyond Search

While Airbnb is used here as an example for illustration, the approach of combining unstructured feature such as NLP sematic search with structured feature search is generally applicable. Imagine a few scenarios:

Amazon – enabling users to search product via user reviews: “the best portable fishing pole others have taken on a flight to Alaska”
Netflix – voice enabled search based on user experience: "looking for a Halloween movie for a 3 year old that is scary but fun”

Going beyond search and query applications, here a few additional business ideas to explore:

Summary and Highlights – Continued advancement in text summarization models (think GPT-2) makes a highly feasible to extract the most critical information from large bodies of text. For an Airbnb listing, that could mean condensing from dozens of reviews and generate a highly concise title and subject to describe the property. Unlike owner provided descriptions, it will contain subjective opinions from real user experience, both pros and cons, which is highly valuable to assist with user booking.

Feature Extraction – Currently Airbnb has a structured and predefined set of amenities presented as a checklist. However, each property is unique and NLP models can uncover unique features that has been noted by guests. Imagine there is a property specific feature area in addition to the standard list of amenities. For example: “beach chairs”, “children’s splash pool”, “watching sunset“, “hear ocean wave”, “farm animal”. Since it is not predefined, there is no limitation to what can be uncovered. The additional richness of information would create value for both host and guests.

Recommendation – Airbnb has access to guest’s information and trip history, especially guest’s text query history as the feature becomes available, then a much more detailed and personal profile can be developed. Intrinsic and “soft” features are best uncovered with text and deep learning models. Clearly, the next generation recommendation must go beyond rigid filters, and be built based on “personality” and family profile.

Prototype code can be found on Github. I look forward to testing and applying more AI technology to our everyday living innovatively.

enabling data science

Thursday, November 3, 2022

NLP based query of Airbnb properties with Asymmetric Semantic Search