Web Crawler - Random important elements in large element

Simon_Suewer · April 13, 2022, 12:44pm

Hello, I have two questions.
I hope you can help me.

with the web crawler selectors like :first-child seem not to work properly, what could be the reason for this?
for hotels there is an info page for this application, which comes through an external service, so a web crawler must be used e.g. https://www.chiemsee-alpenland.de/chiemsee/ukv/house/Bad-Feilnbach-Pension-Gaestehaus-Huber-DEU00000060002247028.

Here you can see a category “Equipment & Information”.
This contains several relevant data, which I would like to use as a separate attribute for the graph. However, the data, which is in random order, does not have its own selector. Is the NLP good enough to use all information as one attribute. The language is german.

With kind regards
Simon

Kristy_Huang · April 15, 2022, 9:02pm

Hi Simon,

Yes, I see that with the way this page is set up, it is a bit difficult to separate the different headers under the “Equipment & Information” section into different fields since the headers all use the same selector.

If I understand correctly, as an example, you would want a setup like below:

Meals field with a text list containing “Shopping service before arrival” and “breakfast”
Breakfast field with a text list containing “Bread service”, “breakfast buffet”, and “Regional specialties”

image (7)563×529 29.9 KB

However, using the selector .tp-characteristics__text as you have done in your account would pull in all the attributes together in one long string, i.e. “Shopping service before arrival, breakfast, Bread service, breakfast buffet, Regional specialties”. You could use a “Split to Column” transform to separate these into separate columns (and thus fields) with a comma as the separator. However, each attribute would be mapped to a separate field. Without some kind of separator in between the “meals” and “breakfast” categories, you won’t be able to group them automatically.

The other option is to have all attributes under the same field, which I think is what your question about NLP is referring to. If you expect people to search your Answers experience for something like “hotels with breakfast”, you can add a text search to this field. I’d caution against using an NLP filter since there are similar attributes such as “breakfast” and “breakfast buffet” - an NLP filter would narrow down to the one best match and create a black and white filter from there. Check out the Searchable Fields Best Practices unit to learn more.

You could also add a searchable facet to this attributes field (with or without making using text search as well) so that users can check off the features they want. See an example with the Publishers vertical in the Hitchhikers search.

If you have another idea to accomplish what you want, but the crawler product can’t support it, feel free to submit a product request to the Ideas board. We appreciate any feedback!

Topic		Replies	Views
Crawler - Select specific address components within one span Content spring21-release	2	1013	April 19, 2021
How does the "List Page" functionality work in a Crawler Connector?	2	1005	July 23, 2021
Https://hitchhikers.yext.com/tracks/knowledge-graph/kg140-data-connectors/assessment/ Content	1	752	May 13, 2021
Formatted content via a data connector Search	1	542	October 27, 2021
Questions re Yext Web Crawler and Data Connectors Content spring21-release	10	1697	October 10, 2021

Web Crawler - Random important elements in large element

Related topics