Web Crawler Performance and Redirects

Lenny_Pham · April 19, 2021, 1:25pm

In situations where the crawl strategy is All Pages for a domain Is there a feature/option to limit the number of URLs the data crawler will crawl per second?
Will the Crawler follow re-directs?

jamie · April 19, 2021, 3:57pm

Hi Lenny,

No, at this point we do not have a feature that let’s you limit crawls per second though it is something we may consider adding in the future
Yes, the crawler will follow redirects!

Thanks,
Jamie

Kara_Knight · May 4, 2021, 4:00pm

Hi - following up on this question, are we able to see which pages were redirected during the crawl? Will this be called out anywhere in the platform?

Calvin_Casalino · May 5, 2021, 8:47pm

We are planning on more gracefully handling redirects in the crawler as well as indicating a specific status for redirected pages in the UI. We are hoping to complete this work in roughly the next few sprints.

Adrienne_Williams · May 21, 2021, 6:39pm

Hi @Calvin_Casalino,
Similar question to the above: Does the Crawler recognize if a page or URL has been deleted? What would happen if an old page previously scraped and added as an entity was deprecated? Would the Crawler be able to tell it leads to a 404 page?

Any insight into how this might be handled would be much appreciated!

Thanks,
Adrienne

Calvin_Casalino · May 25, 2021, 6:52pm

Hey Adrienne,

If a page or URL has been deleted, the crawler will run into a 404 page and mark that task as failed. This will result in any connectors based off of this crawler to simply no longer consider that 404 page.

In the future we will add more functionality to handle these deleted pages on the connector side more gracefully.

Thanks,
Calvin

Topic		Replies	Views
Does the crawler work on pages that paginate? Search	1	985	November 12, 2021
What happens to entities if the crawled/connected page is taken down? Search spring21-release	1	989	June 11, 2021
Recommend Auto-Refresh of Crawled pages status on Crawler page during a run Content	0	1111	February 22, 2023
How to test changes to my crawler? Content	0	690	May 9, 2022
Crawlers stuck in progress Bug Reporting	0	1010	April 28, 2022

Web Crawler Performance and Redirects

Related topics