The Answers algorithm is constantly improving, in ways big and small. A few of the highlights from the Orion release include:
- More Direct Answers
- Better people name searching
- Improvements to NLP filtering
- Healthcare-specific lemmatization
More Direct Answers
We’ve updated the Answers algorithm to show more direct answers by using BERT. Previously, we used a heuristic to decide whether to show a direct answer: a query had to contain the name of a field (like “birthday” or “price”) and resolve to only one entity.
But this didn’t work for queries like “how many calories in tomato soup” if there were multiple different types of tomato soup. Or if you asked “what is amy gillespie’s phone number” and there were multiple people named “amy” in your dataset.
Now, instead of requiring that the search resolves to only one entity, we instead trained BERT to detect whether or not the query is “direct-answerable”. For example, “where did max go to college” is a “direct-answerable” query - the user is looking for a highly specific factoid, and our algorithm can now detect this dynamically. This allows us to show direct answers in more situations, even when there are multiple entities to choose between - as long as we know for sure that the user is seeking a direct answer.
Better People Name Searching
We’ve improved the way Answers handles searching for people’s names. A big challenge here is that many people’s names are also place’s names, so Answers uses an algorithm called named entity recognition (NER) to disambiguate them. Here’s an example:
The NER algorithm uses context to understand that “jones” in the first example is a location, but in the second example is a city.
But the challenge is with single-word queries, where there isn’t enough context to know whether the user is talking about a person or place. In these cases, Answers used to default to location searches, but based on user testing we discovered that this was a bad policy, since it too often showed faraway locations with few or no results.
Now, the default assumption for single-word queries is that the user is looking for a person, so the algorithm first checks if a person of that name exists before defaulting to a location search. If a person with the name doesn’t exist we will continue to fall back to location.
Improvements to NLP Filtering
We’ve also made a number of updates to the way that we infer NLP filters. A few include:
- Dynamic Typo Tolerance: Changing the number of accepted typos based on the length of the filter. The longer the filter, the more typo-tolerant the algorithm should be. In most cases, this will mean less typo tolerance, so that words like “cancer” and “cancel” are not mistakenly matched.
- Improved Lemmatization: We’ve improved the way we do lemmatization for filter matching as well, which works hand-in-hand with more conservative typo tolerance. This ensures that words like “cancel” and “cancer” aren’t matched, but words like “hospital” and “hospitals” are.
- Reducing Stop Words Importance: We’ve reduced the amount of weight that we give to non-salient stop words, like “with” or “the”, in filter matching, ensuring fewer false positives.
Healthcare-Specific Lemmatization
“Lemmatization” is the process of resolving words to their root forms - like understanding that “runner” and “running” are different forms of the same word. Answers takes a fairly conservative approach to lemmatization, because too much lemmatization can lead to a lot of false positives.
There are plenty of standard stemming / lemmatization libraries that we have experimented with for answers, but all of them don’t work well for healthcare. Specifically, it’s critical that words like “cardiology” and “cardiologist” are related, but most standard libraries don’t handle this case.
To solve this problem, we’ve built our own custom lemmatizer that is healthcare-specific. This gives us full control over the exact lemmationtion we perform and allows us to optimize it for healthcare related topics. Overall this is a much more scalable approach than using synonyms to solve for issues like this. So going forward, Answers will now use custom lemmatization for healthcare-related terms, to ensure that all the different forms of medical terms are understood.