Tuesday, December 27, 2022

An Iterative AI based approach to Pediatric self-screening diagnostic

Opportunity for AI

It is well known that children get sick more often, as part of the natural process of developing their own immune systems. Consequently, the need for diagnostics is much more frequent, putting the burden on parents. According to data from the National Health Interview Survey, in 2019, more than one in four children had one or more visits to an urgent care center or retail health clinic (26.4%) in the past 12 months. 

Unnecessary visits to ER represent an enormous waste of valuable medical resources, not to mention the economical and societal cost, as well as impact on the well-being of each individual family. According to a multi-year analysis of children’s visit to ER in hospitals in Italy, 75.8% of the visits are unnecessary. Most of the ER visits are result from independent decision of the parents (97.2%), especially in the evening and at night on Saturdays/Sundays/holidays (69.7%).

A distinguishing characteristic with children’s diagnostics is the high percentage of cases resulting from common illness. In the above study, the most common trigger resulting in parents’ decision to visit ER was fever (51.4%)

Opportunity is ripe for an AI system which can identify the most common illness that does not require ER visits with a certain degree of accuracy. There is no need to provide diagnostics for complex diseases. That is a fundamental consideration when it comes to collecting data, designing, and building such a system.

Proposed Design

Illustrated below is a proposed design which is based on the following principles:

  • Simulating the knowledge and iterative diagnostic process of a physician
  • Utilizing combination of AI technologies including NLP, vision models
  • Focusing on well-defined target outcome (diagnostics of common childhood illness)

Here are highlights of how the system works:

  • Allow user to start with free text description of the symptoms and illness
  • With a predefined list of symptoms, use transformer model to do Reverse Asymmetric Semantic Search against the user query, resulting in a list of matching symptoms (binary features)
  • With well defined symptom-disease data (see next section), use a classification model to predict a list of candidate diseases based on symptoms
  • Based on the candidates identify additional information to probe user for, iteratively predict/probe until threshold is met
  • Incorporate additional model into the system (image, video)
  • Arrive at diagnostic (or no diagnostic), present result and recommendation to the user


Prototype - Proof of Concept

Illustrated below is a prototype built to demonstrate the concept and how different AI components work together to form a more sophisticated system.

The prototype is built around two machine learning models, an NLP model that is designed to turn free text input into symptom features. The symptom features are fed into a symptom-disease predictor to get a diagnostic.

More details are provided in the following sections.

Sample Disease/symptoms Data

Below shows the sample dataset used, each patient case is diagnosed with a disease, together with observed symptoms.

To prepare data for training, we encode each patient case, with disease as the prediction target, and symptoms as encoded features.

Predictor model

We can use gradient boosting to train a classification model that predicts disease based on symptoms. Here is the outcome of a LightGBM model, showing both the overall accuracy and per disease.

User Query

With this component, we use a transformer model to vectorize all the symptoms, and the incoming NLP query. By performing asymmetric semantic search with each symptom, we get a list of “activated” or matched symptoms.

We choose a threshold to apply against above matching score, to generate symptom feature for the query case, now we are ready to make a prediction using previously trained disease predictor model.

Use previously trained classification model to predict the disease:

Finally, we combine with other data to return complete diagnostic info to user.


Sample code for prototype can be found here: https://github.com/seanxwang/pediatric_self_diagnosis


Reducing unnecessary ER visits by even just a small percentage would translate to millions of unnecessary visits avoided and potentially billions of economic values. An AI based pediatric self-diagnostic system is both in demand and viable. A simple prototype demonstrates feasibility.  When implementing a more complete version of the proposed system, “toy” data can be replaced with professional quality data source such as PedAM. While there is more work and user evaluation to be done to develop such a system into “production ready” state, its potential to serve consumer and society is significant.

(This article is based on work with Professional Master Program at University of Washington Computer Science and Engineering Department)

Thursday, November 3, 2022

NLP based query of Airbnb properties with Asymmetric Semantic Search

Motivation – Why search Airbnb reviews?

Guest rating and reviews are the most valuable assets Airbnb has. For perspective guests, learning from experience shared by others makes the booking experience more predictable and dependable. However, often there are too many reviews to read one by one. More importantly, Airbnb currently only supports property search based on predefined “filters”, thus excluding rich information contained in reviews to be fully utilized in the search process.

Why enable search specifically for reviews?

  • Reviews written by other guests are more subjective and trustworthy
  • Reviews contain rich info, its free text form allows capturing of unlimited variety of experience, stories, and emotions
  • Untapped potential to expand the power of search, i.e., by matching undefined interest of guests with properties

How would the search feature work?

For example, a perspective guest may be searching for a place to stay in LA. In addition to applying map based selection and filtering with structured features currently supported, one could enter a search query describing desired property such as: “quiet and spacious place close to beach in a safe neighborhood”.

Using the query, the search feature will find relevant reviews among current candidate listings, score based on contextual similarity, and return the top ranked listings with closest matched reviews.

Data Processing

By utilizing raw data from insideairbnb.com, we can get raw user review data updated monthly. To be used with a query application, we can apply data processing to have text data ready for search. In run time, as user search in a particular geography and apply filters, a candidate listing with review data is dynamically generated, which can be loaded into memory to support user query.

Asymmetric Semantic Search

To support query, we apply the concept of semantic search. Specifically, utilizing the state of the art NLP transformer models, we can embed reviews to be represented as a vector space. The subsequent search is simply to compare query embedding with review embedding and find the closest matches with a scoring function such as cosine similarity, thus finding reviews with a high semantic overlap with the query.

In choosing the transformer model to test with, it is important to recognize the type of query which is “asymmetric” in this case. Because users usually enter a short query which is to be matched with reviews which are often longer paragraphs, so we utilize MS MARCO models which are created based on user search queries using Bing search engine.

System Design

As illustrated, a simple prototype design consists of:

  • Selected Listings - user preselection (filter) of perspective listings
  • Candidate reviews - corresponding reviews for selected listing
  • Review embeddings – use transformer to generate sentence embedding for candidate reviews
  • User query – performs asymmetric semantic search against review embeddings
  • Result – select top K matches and present corresponding listing for user


Sample Result

Here are some of the top reviews found to semantically matching the sample query above. Since reviews don’t change, they are pre-transformed to vector form for the search. As a result, the query is fast even with tens and thousands of candidate reviews, and results are highly relevant. Clearly, this represents a powerful new search capability currently unavailable.

Going Beyond Search

While Airbnb is used here as an example for illustration, the approach of combining unstructured feature such as NLP sematic search with structured feature search is generally applicable. Imagine a few scenarios:

  • Amazon – enabling users to search product via user reviews: “the best portable fishing pole others have taken on a flight to Alaska”
  • Netflix – voice enabled search based on user experience: "looking for a Halloween movie for a 3 year old that is scary but fun”

Going beyond search and query applications, here a few additional business ideas to explore:


Summary and Highlights – Continued advancement in text summarization models (think GPT-2) makes a highly feasible to extract the most critical information from large bodies of text. For an Airbnb listing, that could mean condensing from dozens of reviews and generate a highly concise title and subject to describe the property. Unlike owner provided descriptions, it will contain subjective opinions from real user experience, both pros and cons, which is highly valuable to assist with user booking.


Feature Extraction – Currently Airbnb has a structured and predefined set of amenities presented as a checklist. However, each property is unique and NLP models can uncover unique features that has been noted by guests. Imagine there is a property specific feature area in addition to the standard list of amenities. For example: “beach chairs”, “children’s splash pool”, “watching sunset“, “hear ocean wave”, “farm animal”. Since it is not predefined, there is no limitation to what can be uncovered. The additional richness of information would create value for both host and guests.


Recommendation – Airbnb has access to guest’s information and trip history, especially guest’s text query history as the feature becomes available, then a much more detailed and personal profile can be developed. Intrinsic and “soft” features are best uncovered with text and deep learning models. Clearly, the next generation recommendation must go beyond rigid filters, and be built based on “personality” and family profile.


Prototype code can be found on Github. I look forward to testing and applying more AI technology to our everyday living innovatively.