Searching for Values with Units

In the titles of many of the relevant documents on our site we often refer to parameter values that have units (e.g., voltage, current or power). In the title this would be written as, for example, “5 V” or “1 A” or “100 W”. Tokeninzing those is problematic, since “5” or “1” or “100”, without any additional context, could mean virtually anything. I have 538 documents as of today, with a myriad of combinations of these terms in the titles. Bottom line is I have to figure out how to make a query like “5 V 1 A charger design” display the documents that say “5 V” AND “1 A” AND “charger” in the title.

It seems to me there are a number of ways to go about this, but so far I haven’t come up with one that works well enough that I’d consider the problem fixed. Here’s what I’ve considered:

  1. Custom Phrases - I could, in theory, make every one of those value/unit pairs a custom token. That doesn’t seem practical, though, as there would be hundreds of them (1 V, 2 V, etc.) and even then it wouldn’t work for a search containing something like “3.3 V” (unless of course that was a custom phrase as well).
  2. Stop Words - One of the problems is that “5 W”, for example, gets tokenized into “5” and “W”, which generates matches on pretty much every document that has a stand-alone “W” in the title, even if the context is “100 W”. Not helpful. Making “V”, “A”, “W”, etc. into stop words prevents that, but has other implications.
  3. Synonyms - This helps if somebody searches for “volts”, as I can make a synonym with “v”. But it doesn’t really help with the problem of tokenizing the value/unit pair.
  4. Query Rules - I can’t see how these would help because, while I coulld create regex expressions to match the input, it would be impractical to create enough of them to make a difference, and the number of filters (if I could even figure out how to make those, since regex isn’t allowed in filters) to generate the “correct” results would be equally as large if not larger.
  5. “Fix” the Data - Given the limiations of the system’s configurability, this may be the only way. I could, for example, use a Connector to “massage” the document titles (using a transform) and remove the spaces from the value/unit pairs, converting “5 V” to “5V”, for example. Values like “5V” do not appear to be separated into multiple tokens. That’s not going to be trivial, but it might work.
  6. Add Additional Fields - Each one of these documents is linked to an entity that has an extensive set of metadata describing the thing in the document. In raw form those data are just numbers, but I could attach them as value/unit pairs in text fields instead. That combined with the stop words might give me a fighting chance of generating relevancy. Not sure what weight the engine would give to a value in a linked field vs., say, the name, though. And there’s no way to affect field weighting in the algo.

I’m still searching for a “good” answer - that is, one that’s effective AND scales with minimal-to-no intervention as this library of documents grows. If anyone out there has any ideas I’m all ears. Thanks in advance.

Surprised at the lack of response, despite this being a little out there in terms of use case. Nobody? Bueller?

1 Like

Hi @Chick_Webb ,

Apologies for the lack of response here. This is a pretty challenging problem, and it’s hard to think of a perfect solution without fully understanding how your data is modeled in the KG, but I might be able to provide some high level advice.

I do think this is an appropriate use case for phase-based searching. One way to do that would be to list every voltage / wattage / etc. in a custom phrase, but if those values are stored in a separate field you could also search that field with Phrase Match (which achieves largely the same effect).

For example, you could create a c_voltage field, and store a list of text values like [“5”, “5V”, “5 V”, “5 Volts”, …]. You could use this field to store synonyms directly, which would be easier than creating synonym sets. For example, while we could create synonyms from “V” to “Volts”, that wouldn’t account for a query like “5V” that does not have a space.

This list of values could likely be set up during the Connectors flow using Transforms.

Hope this helps!
Alex

Thanks for chiming in @Alex_Yang. I actually kicked this over to Support 10 days or so ago and hadn’t gotten a response yet. So I kept thinking/working on it, and here’s what I sent over by way of an update the other day, even though I hadn’t heard back:

I’ve been thinking about and working on this in the meantime. One of the vexing cases that I’ve been dealing with is the lack of matches on those value-space-unit queries. A search for “65 W usb charger” for example, should prominently display the numerous designs we have for this use case, but did not. After studying up on the way the search algos work and doing a lot of test searches, it appeared to me that the many fields we have with values in them just don’t have much effect, and that I would need to create more “text” for the search.

I started with our Desugns type. Documents are one-to-one associate with Design Entities, and much of the interesting technical information regarding the circuit described in the associated Document is in there. Design Entities are not a vertical, but they are linked to the Document. So the first thing I did was modify the Design to create a new field - “searchableName” - and then I modified the Connector for that Entity to take data from the various fields in the Entity and stuff that searchableName field with them in a “semi-formatted” way.

So, if the title of the Design is “DER-943 - 60 W General Purpose/Notebook PC Power Supply using InnoSwitch4-CZ and ClampZero”, then the contents of the searchableName field end up as “Name - DER943 DER-943 - 60 W General Purpose/Notebook PC Power Supply using InnoSwitch4-CZ and ClampZero, Product - NN4075C-H180, CPZ1075M, Minimum Input Voltage (Vin(min)) - 90V 90Volts, Maximum Input Voltage (Vin(max)) - 265V 265Volts, Number of Outputs - 1, Output Voltage - 20V 20Volts , Output Power (W(out)) - 60W 60Watts, Application - Chargers & Adapters,Computers & Servers, RDK Available”. There’s a lot going on there, but basically I’m trying to create a bunch of potential text/phrase matches for the name, the product(s) that are in the design, and the paramters (Volts, Amps, Watts). That field (c_linkedDesign.c_searchableName) is part of the searchable fields for the Document.

That seemed to help a bit as I could see Documents appearing in search results because of matches with the contents of that field, but I was still vexed by seemingly random results generated by the value-space-unit problem. So I turned to Custom Phrases. But I didn’t want to create every possible combination of number-space-unit for, say 1-100 x 3 (V,A,W). However, figured I could create ones for all of the existing Designs since those data were already in the Design Entity. So I created another field in the Design called customPhrases and in the Connector I set the value of that to all of the V/A/W values for the design. The one above, for instance, came out to “90 V 265 V 20 V 60 W”. I then exported that field from the Design Entities, and with a little clever copy/paste in Word/Excel was able to pretty quickly reduce the thousands of those values to a small set (355) of unique ones in JSON format. It was pretty easy to copy/paste those into the platform.

So, now I’ve got about a thousand Custom Phrases, including ones for those values. For Documents, this seems to have helped quite a bit and I’m mostly getting rational results. Although how a 5 W design sneaks into 6th place when I search for “65 W charger” mystifies me. See screenshot. Still some work to do, apparently. And, yeah, I know that data quality is part of the problem.

As you can see it’s similar to the suggestion that you had and I think pretty much functionally the same. (If there’s some reason you think your solution is better, please comment; I’m happy to rework this if it’ll help.) There are some downsides to this, mainly the ongoing maintenance of those Custom Phrases. It also doesn’t work at all for Amps (abbreviated “A”), since obviously that’s s stop word.

Following up, I don’t know if it’s worth tweaking this further at this point. Does it matter much, for example, which field the text/phrase match is on? I haven’t seen anything that suggests it in the documentation, but perhaps you know something I don’t. Similarly, if the field has a lot of unrelated information in it, does that effect the weight of a match? Again, I don’t think so, but please correct me if I’m wrong.