In the titles of many of the relevant documents on our site we often refer to parameter values that have units (e.g., voltage, current or power). In the title this would be written as, for example, “5 V” or “1 A” or “100 W”. Tokeninzing those is problematic, since “5” or “1” or “100”, without any additional context, could mean virtually anything. I have 538 documents as of today, with a myriad of combinations of these terms in the titles. Bottom line is I have to figure out how to make a query like “5 V 1 A charger design” display the documents that say “5 V” AND “1 A” AND “charger” in the title.
It seems to me there are a number of ways to go about this, but so far I haven’t come up with one that works well enough that I’d consider the problem fixed. Here’s what I’ve considered:
- Custom Phrases - I could, in theory, make every one of those value/unit pairs a custom token. That doesn’t seem practical, though, as there would be hundreds of them (1 V, 2 V, etc.) and even then it wouldn’t work for a search containing something like “3.3 V” (unless of course that was a custom phrase as well).
- Stop Words - One of the problems is that “5 W”, for example, gets tokenized into “5” and “W”, which generates matches on pretty much every document that has a stand-alone “W” in the title, even if the context is “100 W”. Not helpful. Making “V”, “A”, “W”, etc. into stop words prevents that, but has other implications.
- Synonyms - This helps if somebody searches for “volts”, as I can make a synonym with “v”. But it doesn’t really help with the problem of tokenizing the value/unit pair.
- Query Rules - I can’t see how these would help because, while I coulld create regex expressions to match the input, it would be impractical to create enough of them to make a difference, and the number of filters (if I could even figure out how to make those, since regex isn’t allowed in filters) to generate the “correct” results would be equally as large if not larger.
- “Fix” the Data - Given the limiations of the system’s configurability, this may be the only way. I could, for example, use a Connector to “massage” the document titles (using a transform) and remove the spaces from the value/unit pairs, converting “5 V” to “5V”, for example. Values like “5V” do not appear to be separated into multiple tokens. That’s not going to be trivial, but it might work.
- Add Additional Fields - Each one of these documents is linked to an entity that has an extensive set of metadata describing the thing in the document. In raw form those data are just numbers, but I could attach them as value/unit pairs in text fields instead. That combined with the stop words might give me a fighting chance of generating relevancy. Not sure what weight the engine would give to a value in a linked field vs., say, the name, though. And there’s no way to affect field weighting in the algo.
I’m still searching for a “good” answer - that is, one that’s effective AND scales with minimal-to-no intervention as this library of documents grows. If anyone out there has any ideas I’m all ears. Thanks in advance.