A very broad Search Q but maybe someone has an answer or can point me to the right direction with this one:
Where can I find more detailed info on how Search handles combinations of words in search queries that could be both single tokens or phrases?
I know we can define experience specific Custom Phrases (for text search, phrase match and nlp filters) but this is more about generic non brand specific queries.
A client tried to find out about mime types, typed mime type into the yext.com search and got a lot of results back for the type token only (e.g. sth for entity types).
They would like to understand better how Search decides whether it’s a token or a phrase and if the behavior differs between languages (in German there are more long words that work as tokens more easily).
Is this all depending on which known/trained phrases are stored in our data models for a specific language?
Is there any way for users to tell Search that you want to enter a phrase instead of single tokens (e.g. via syntax like this on Google or putting the phrase in “”)? If not, is this planned for the future?
Hi Hauke, all good questions! Let me go through them one by one.
The search you provided, ‘mime type’, would be treated as two tokens by Search. The reason you’re only seeing results for ‘type’ and not ‘mime’ are almost certainly because there was data matching ‘type’ but no data for ‘mime’ anywhere on yext.com
Search will tokenize the entire query by default, so a query will be broken down into its individual tokens. The exception, as you correctly pointed out, is if you configure a phrase query. Consider ‘big mac’ as an example - if it is a standard query, it is tokenized into ‘big’ and ‘mac’. If it’s configured as a phrase query, we evaluate the phrase as a single unit, so Search will evaluate ‘big mac’ as one phrase.
Tokenization and the rules we use are definitely language dependent. An example of a rule we use to classify tokens is by identifying whitespaces, but Japanese for example doesn’t have whitespaces so we need to use a special tokenizer for them. As you mentioned, German has its own intricacies and is a language we provide first class support for, so we have a German-specific tokenizer. Languages we don’t have first class support for (so anything not English, Spanish, German, French, Italian, Japanese) we use a standard tokenizer for.
We definitely have some additional phrase configuration syntax planned for the future. In particular, we want to support the exact match “” syntax that you called out - we will be sure to update you when that’s ready.
Thanks so much for your reply!
One follow-up question on the tokenization in general just to clarify:
Is there a library of generic phrases that work for all clients (in a certain language)? Or do they all get broken down whenever there’s a whitespace? And if there’s a library - can we see these phrases somewhere in the debug overview (search logs)?
Use case: Known phrases in a language should work without having to add a lot of custom phrases manually.
And one question on the language-specific tokenizers:
Is there any way to have insight into which words get split into tokens? Do they show up as individual tokens in the debug overview?
Curious to know more about the details as we are currently QAing search quality for a couple of German Search experiences and are looking for the best ways to improve it and provide internal feedback.
Hi Hauke, thanks for the follow up.
To your first point, tokenization is just a way of segmenting text into individual “units”. It’s not necessarily used for known phrase recognition or matching. As an example, the phrase “new york” would be tokenized into “new” and “york”. Tokenization happens based on a set of defined rules and patterns (whitespaces, special characters etc.), rather than a pre-generated library of phrases.
To your second point, for all tokenizers in general, the search log debug does indeed give you insight into which words get split into tokens, in the “Search Factors” section.
Let me know if this helps answer your question!
Hi Chris, thanks for the input!
If I understand this correctly your big mac example and the mime type example would always be split into individual tokens. Does that mean a search would bring up results that include one of the tokens unless a custom phrase or query rule has been manually added?
And how does this relate to known phrase recognition and matching? Is this applied on top or does this only come in with the NLP filter (or a specific algorithm? Can clients get any insights into what the known phrases are? And does this show up in the search log debug as well?
All great questions! Yup, all queries will always be split up into individual tokens, unless configured as a custom phrase, where the phrase is treated as a single unit or ‘token’.
To your first question, usually there needs to be a token match between the query and a searchable field but that isn’t always the case - for instance, if you configured semantic text search there will often be results returned that do not include a query token but are semantically similar. You’re correct about query rules, and a custom phrase is treated as one big token.
This relates to phrase matching (which is available with the phraseMatch searchable field) in that the query tokens (or custom phrase) must match the entirety of the field value configured with phraseMatch. So if the name field is set to phraseMatch and the name of the entity is ‘Nike Dunk Low’ and the user searches for ‘Nike Dunk’, the result does not get returned.
By known phrase recognition I’m assuming you’re referring to named entity recognition (NER)? Currently, I believe this is only used in English for NLP filtering on builtin.location, where the model is used to identify key location objects in a given search query. I don’t think the named entities are publicly available, but you can demo the model here. You can see the NLP filter value applied on builtin.location in both the Search Log and the API response (in the appliedQueryFilters object).
Hi Chris, Thanks so much for the insights - this helps a lot!
Especially the info on NER. Looking forward to see what comes up with the next product releases!