Crawler doesn´t update content correctly

Hello.

We have several problems with a crawler we have created:

  • The first problem is that it does not update the entities correctly. If the web that is crawled updates the content, it is not updated inside the entity. We have even tried to delete the entity and run the crawler again and it recreates the entity with the previous content, not with the new one.
  • The second problem is that we cannot tell it in which language it has to create the new entities.

Could you tell me how to proceed? Thank you.

Juanma

Hi Juan,

Apologies you’re having issues. We’re happy to help dig into this. To determine if this is a bug, we’d like to better understand your crawler/connector settings to see if it’s a configuration issue. Could you send us a link to the account configuration?

For languages, are you saying the same crawler is pulling in information in different languages so the entities need to have different language profiles depending on the content that is crawled?

Best,
Alyssa

Hi, Alyssa.

My account ID is: 3437198
https://www.yext.com/s/3437198/account/personalSettings

The client’s URL is: https://www.ie.edu/

Right now they only have the Answers product activated in English but the crawler is creating the entities in Spanish.

Thank you!

Hi Juan,

Thanks for sending over this additional info!

So, for the first question of the entities not updating correctly - do you have an example of an entity that was not updated as expected?

It looks like your Crawler and Connector are configured correctly — the Crawler is running daily, and the Connector is set to ‘Auto’, so that it runs after each Crawl. Also, from a quick look at the Connector Summary, it looks like the Connector ran today and added 1 new entity, and deleted 2 new entities:

image

So, it does look like it’s working. But if you have an example of a specific entity that does not match the site it is crawled from, let us know and we can take a look!

For your question about the languages - that is correct. By default, when new entities are created, the system uses your account’s default primary language. Which you can find in Account Settings.

In your instance, the Account Primary Language is set to Spanish - so entities created by the Connector will be in Spanish.

You may already be familiar with the process of managing multiple language profiles, but you can always adjust this once changing the language profile of a specific entity this unit has all the details.

We also do have an item on our roadmap to support explicitly specifying a language in the Connectors flow.

Thank you for your response.

Update problem:
I give more details. The client’s website updates the date of the events on the same page. The URL of the page is the same but they modify one element in the html, the date.

The problem is:

  • The crawler goes through and creates the entity.
  • The client updates the date on their website.
  • The crawler passes and does not modify the date in the entity.

Here are some examples:

https://www.ie.edu/search-results/?query=events+october&referrerPageUrl=https%3A%2F%2Fwww.ie.edu%2F&tabOrder=.%2Findex.html%2Cproducts%2Cfaqs%2Cevents%2Clinks%2Coffices&verticalUrl=events.html

Language problem:

So far we have been doing it this way, editing the language of all elements in bulk. But we can’t be doing this every time the crawler creates new elements, which is every day.
Is there any way to change the main language of my account to English? This field doesn’t let me edit it:

image

Thank you again!
Juanma

Hi Juanma,

I believe the issue may relate to the CSS selector currently being used to pull in the Date information.

The current CSS selector is: .info-evento .info-evento__section:nth-child(5) span:nth-child(2). This does not seem to be targeting the correct content in many cases.

I think the correct CSS selector for the Date would look like this: span[itemprop=startDate].

I came up with this by inspecting the page and looking at the properties for the Date span, which you can see highlighted on the right.
Screen Shot 2021-11-04 at 11.52.09 AM

This should fix the issue where it seems like the Connector isn’t updating the Date field!

Best,
DJ

Hello, DJ.

Thank you for replying.

We are not using the date field because, as you say, we did not always get the correct value:

The field we are displaying in the card, which is not updated, is the html that encompasses all these elements:

In the preview items, none of them appear with the date already passed, I can’t see them all. But this is the field that is not being updated.

Right now there are none that are wrong because none have been updated:

I don’t know if you have changed the way you create entities. They have all been deleted and recreated. This would be perfect if it were not for the fact that they have been created in Spanish and no entity was shown on the web. I had to change them all to English.

At this point, we come back to the other problem, we can’t be logging in every day to change the language of the entities. We lose the sense of automating this task.

Thank you for your understanding and time.
Juanma

Hey Juanma,

I have changed the Primary Language of your account to English!

Regarding the three incorrect date values you highlighted, where it shows a timezone instead of a date, the CSS selector I shared in my previous post should remedy that issue!

I also took a closer look at that “Informacion” field which you are populating with the html that encompasses all the event information. It seems that this field is successfully updating, as the “Women Leadership Talks” event now shows a date of December 2nd in the Answers experience, matching the website, instead of the outdated October date it was showing initially.

In general, the unique identifier used as the Entity ID is how the connector decides whether to create, update or delete entities during each run. If you continue to see entities being deleted and recreated without any change to the Entity ID mapping in your crawler settings, or changes to the client’s pages themselves, we can take another look at what may be causing that.

Best,
DJ

1 Like

Hi, DJ.

Thank you for your response.

Everything seems to be fine. We’ll check it over the next few days, I’ll let you know if it doesn’t update.

Thank you!

1 Like