Web Crawler Performance and Redirects

  1. In situations where the crawl strategy is All Pages for a domain Is there a feature/option to limit the number of URLs the data crawler will crawl per second?
  2. Will the Crawler follow re-directs?

Hi Lenny,

  1. No, at this point we do not have a feature that let’s you limit crawls per second though it is something we may consider adding in the future
  2. Yes, the crawler will follow redirects!

Thanks,
Jamie

1 Like

Hi - following up on this question, are we able to see which pages were redirected during the crawl? Will this be called out anywhere in the platform?

We are planning on more gracefully handling redirects in the crawler as well as indicating a specific status for redirected pages in the UI. We are hoping to complete this work in roughly the next few sprints.

Hi @Calvin_Casalino,
Similar question to the above: Does the Crawler recognize if a page or URL has been deleted? What would happen if an old page previously scraped and added as an entity was deprecated? Would the Crawler be able to tell it leads to a 404 page?

Any insight into how this might be handled would be much appreciated!

Thanks,
Adrienne

Hey Adrienne,

If a page or URL has been deleted, the crawler will run into a 404 page and mark that task as failed. This will result in any connectors based off of this crawler to simply no longer consider that 404 page.

In the future we will add more functionality to handle these deleted pages on the connector side more gracefully.

Thanks,
Calvin