I’ve got a couple of questions wrt the Web Crawler that has been introduced with the Spring '21 Release:
The first two questions are actually questions asked by customers during our regional Release Webinar:
- Are there any short-term plans to add the option to set the user-agent string for the crawler?
- Are there any short-term plans to add the option to set user/password credentials to accommodate crawling pages that require a login?
My next question is about configuring a Data Connector that extracts and processes crawled data in the staging area for loading into the Knowledge Graph:
Identification of CSS and/or XPath selectors is not really straightforward. Do we have in-depth documentation or best practices for finding suitable selectors. Can we have a webinar or workshop specifically for this topic or tackle it in details in one of the next Office Hours event?
Thank you, Stefan
I have a few additional questions on the Crawler and configuring a Data Connector for a Crawler based off of Stefan’s above:
- What are some ways to trouble shoot Crawlers that seem to be stuck “In Progress”. I have 3 setup (all in 1 account) that have not successfully crawled any pages or failed to crawl any pages but still say “In Progress”.
- Is there a way to add Profile Language as a field when uploading data through Data Connector from a Crawler? This is not a field that would be scraped on a page but I need it to pull in for the proper locale.
Hi Stefan and Adrienne,
Thank you for your questions. Please see the answers below:
We expect to allow setting of a custom user-agent later this year via specifying optional headers for a crawler
Similar to #1, we expect to allow username/password credentials to be passed the same way later this year.
Adrienne, do you have a link to the crawler in question? There might be something blocking crawls on the specific website.
Similar to #1, we expect to allow setting generic headers for a crawler that will include use cases like specifying locale later this year.
adding another question here:
I’m having trouble finding the right CSS/Xpath to crawl an image with my connector.
Can you tell me what the right CSS for this image (the FB buttons) would be?
@Adrienne_Williams right now, we use the Account’s Primary Language as the Profile Language for each new entity created via Data Connectors. You can go to Account Settings to view your account’s Primary Language. If you were looking to use a different Profile Language, you can update the Primary Language for your entities after they are created (either one at a time or in bulk!). We will add support for explicitly specifying a Profile Language in Connectors in the future.
@Laura_Ameskamp this specific case is not supported just yet, but will be in the coming weeks. The image URLs on that page are “relative URLs” e.g. “/uploads/6IahYIWy/493x0_764x0/neonbrand-I6wCDYW6ij8-unsplash.jpg”, so we need to do a bit of work to support turning these into absolute URLs. I will report back when this work is complete!
@Stefan_Heidbrink, for your original question, we will be adding more documentation around CSS and XPath Selectors to our module, but in the meantime, here are some resources that should help you out (all courtesy of @Jamie_O_Brien):
- CSS Selectors Reference
- XPath Syntax
- Xpath cheatsheet
Thank you @Liz_Frailey , for following up
Sorry I cannot mark two answers as “solution”
Anyway, after having played around a bit with a couple of web site samples my findings are
a) you need some practice - it’s getting easier when you’ve reached a certain level of experience (sorry for stating the obvious)
b) some web pages are just not meant to be “extractable” - you will need to apply some manual editing after creating entities or consider another approach altogether with manual copying/pasting as a last resort
I have two more questions related to the Crawler capabilities!
- Does the crawler duplicate content that is already in the KG? How can we avoid this?
- Is there a way to export the information that the Crawler scraped and is selected through Data connectors (say in an Excel) before it gets created as entities in the Knowledge Graph?
Thanks so much for all your helpful answers above!
The Crawler won’t duplicate content because the crawler is the tool that scrapes the website. The Connector is what allows you to convert crawled content into entities. The key to making sure that data is not duplicated in KG is to ensure that the Connector configuration is correctly mapped to the appropriate Entity ID. Entity IDs are the unique identifier our how systems know whether to add, update, or delete records. So we trigger updates to existing entities based on IDs. Meaning, whenever you are making adjustments to the Connector configurations you want to use extra caution around the IDs and their mapping, otherwise there can be unintended consequences of entities not being updated, or the wrong entities being updated.
Currently there is not a way to export before creating entities. That can only be done once the entities have been created.
Let me know if this answers your questions of I can add any more details!
Thanks for the context here, Calvin. For the duplicated content, what if the entities already existed in the Knowledge Graph prior to using the Crawler and have different Entity IDs?
Following up on this duplicate question - I have a few crawlers running for a custom demo and I’m encountering a LOT of duplicates (like as many as 200 for a given entity). Is there a way to clean this up efficiently?