Crawler Product | Yext Hitchhikers Platform
Overview
The Crawler is a method of extracting data from websites to be brought into Yext. Data is scraped by a crawler, and then can be ingested into Yext via a connector. Crawlers can be customized to scrape data from specified web pages on a one-time or ongoing basis.
This doc outlines the configuration settings to set up a crawler. For information on using a crawler as a source when building a connector, see the Crawler Source reference.
Basic Configuration
These settings apply to all crawlers.
Duplicate URLs
If a URL is encountered more than once in a single crawl, only the first instance of that URL will be crawled.
Schedule
Crawlers can be scheduled to run Once, Daily, or Weekly. The default option is to run once.
If a daily or weekly schedule is set, a new crawl will be initiated exactly one day (or one week) after the previous crawl has finished.
All crawlers set to Daily or Weekly will have their schedules automatically changed to Once after the 14th day of inactivity.
A crawler is considered inactive if:
- The crawler is not linked to a connector
- The crawler is linked to a manually-run connector, but the connector has not been run in the past 14 days
- The crawler is linked to a manually-run connector, but the crawler configuration has not been viewed in the past 14 days
Inactive crawlers that have been automatically reset to Run Once will remain in the platform, and can still be viewed and run manually. You can also re-add a daily or weekly schedule to the crawler, but it will be reset back to Run Once if the crawler is inactive again.
Supported File Types
The current supported file types are HTML and PDF.
Choose from the following options:
All File Types: All supported file types will be crawled if encountered. This setting will automatically include any file types that become supported in the future.
Select File Types: Only the file types selected will be crawled.
Note that the crawler is still able to detect and spider to URLs picked up from an unselected file type, without actually crawling that file. For example, if you want to only crawl PDFs, but you have PDFs linked on an HTML webpage, you can still select “PDF only.” The crawler will spider through the HTML links to find the PDF files, but will then crawl only the PDFs.
Rate Limit
The rate limit is the maximum number of tasks the crawler can execute on a site at one time.
- Default:
100
- Minimum:
1
- Maximum:
1,500
The rate limit should be set based on the maximum number of concurrent scrapers your site can handle without impacting site performance.
Websites often have a pre-set rate limit. When too many requests are received, a 429 error will occur. This error would be surfaced in the“Status” column on the Crawl Details page.
If this error occurs, lower the rate limit on the crawler. If your site can handle a faster rate limit, you can increase the rate limit to increase the speed of the crawl.
Blacklisted URLs
Specify any URLs that should be excluded from being crawled, even if they would fall under the chosen crawl strategy.
You can provide a list of exact URLs or use regex notation to specify URLs of a certain pattern.
To learn more about regex notation, see Mozilla’s developer cheat sheet.
Source Type
The crawler supports using a sitemap or specific domains to determine which pages to crawl. The below settings depend on the chosen source type.
Sitemap Source Type
Selecting Sitemap as the source will require a sitemap URL (.xml format). You will also choose whether or not to check the lastmod tag on URLs in the sitemap.
Domain Source Type
Selecting Domain as the source will require you to choose a crawl strategy, specify a start URL (pages or domains to crawl from based on the chosen crawl strategy), and specify a sub-pages URL structure.
Pages or Domains to Crawl
A list of start domains for the crawler to spider through.
The start URL is the first URL that will be crawled. It is also the starting point for any subsequent pages crawled by a crawler. You can configure a crawler with more than one start URL.
The root domain of a start URL is referred to as the start domain.
Start URLs can be as broad or specific as you choose. All of these are examples of valid start URLs:
yext.com
blog.yext.com
https://www.yext.com
https://www.yext.com/blog/2023/05/why-tech-leaders-are-embracing-custom-built-dxp
The crawler will resolve the HTTPS protocol — you can include https://
in your start URL(s), but it is not necessary in order for the crawler to function.
The specific pages to crawl from the start URL are determined by the crawl strategy.
Crawl Strategy
The crawl strategy determines which pages from the start URL should be crawled. A crawler can crawl every page on a start domain, only certain sub-pages, or specific URLs.
The process of detecting URLs to crawl is called spidering. The crawler detects pages to crawl via URLs referenced in href
tags within the HTML of your site (e.g., <a href="www.yext.com">
). From there, the crawler spiders to each URL stored in an href
tag and repeats the process again, up to 100,000 URLs (see the
Connector Limits reference
).
Below are the available crawl strategy options:
All Pages: Crawls all pages on the same root domain and subdomain as the start URL. This will include any pages at a higher subdirectory level than the start URL.
For example, for the start URL www.yext.com/blog
, the subdomain is www.
, the root domain is yext.com
, and the subdirectory is /blog
. This means that selecting the All Pages crawl strategy for this start URL will crawl all pages on www.yext.com
, not just subpages under www.yext.com/blog
.
Here are how some example URLs would be crawled with the all pages strategy, for the start URL www.yext.com/blog
:
Page URL | Crawled | Reason |
---|---|---|
www.yext.com |
![]() |
Subdomain and root domain match the start URL |
www.yext.com/blog/2023/04/use-cases-for-ai |
![]() |
Subdomain and root domain match the start URL |
www.yext.com/platform |
![]() |
Subdomain and root domain match the start URL |
help.yext.com |
![]() |
Subdomain does not match the start URL |
Sub-Pages: Crawls all pages at a lower subdirectory level than the start URL. The pages must also be on the same subdomain and root domain as the start URL.
Here are how the same example URLs from above would be crawled with the sub-pages strategy, from the start URL www.yext.com/blog
:
Page URL | Crawled | Reason |
---|---|---|
www.yext.com |
![]() |
Not a sub-page of the start URL |
www.yext.com/blog/2023/04/use-cases-for-ai |
![]() |
Sub-page of the start URL |
www.yext.com/platform |
![]() |
Not a sub-page of the start URL |
help.yext.com |
![]() |
Subdomain does not match the start URL |
Specific Pages: Only the URLs listed as start URLs will be crawled. No other pages will be spidered to or crawled from the start URLs.
Sub-Page URL Structure
Adding a sub-page URL structure allows the crawler to spider to sub-pages with a different URL structure than the start URL.
Use this setting to crawl sub-pages from multiple subdirectories on one web domain using a single crawler. For example, if you wanted to capture sub-pages under www.yext.com/blog
and www.yext.com/faq
, but not www.yext.com/products
, you should specify a sub-page URL structure.
Here are how some example URLs would be crawled from the start URL www.yext.com/blog
, with the sub-page crawl strategy, and a sub-page URL structure of www.yext.com/faq/*
:
Page URL | Crawled | Reason |
---|---|---|
www.yext.com |
![]() |
Crawl strategy (not a sub-page of the start URL) |
www.yext.com/blog/2023/04/use-cases-for-ai |
![]() |
Crawl strategy (sub-page of the start URL) |
www.yext.com/faq/posts/2023 |
![]() |
Sub-page URL structure |
help.yext.com |
![]() |
Crawl strategy (subdomain does not match the start URL) |
www.yext.com/products/posts/content |
![]() |
Both settings (not a sub-page of start URL, does not match sub-page URL structure) |
Query Parameter Settings
Designate whether query parameters should be ignored when differentiating between crawled URLs.
For example, if query parameters are ignored, the crawler would see www.yext.com/blog
and www.yext.com/blog?utm_source=google
as the same URL. This would prevent the crawler from crawling that page more than once.
Choose from the following settings:
- All: All query parameters are ignored
- None: No query parameters are ignored
- Specific Parameters: The parameters specified in this setting will be ignored.
Note that selecting None will override any Specific Parameters settings.
Here are how some example URLs would be crawled with all query parameters ignored, with the start URL www.yext.com/blog
:
Page URL | Crawled | Reason | Resulting Crawled URL |
---|---|---|---|
www.yext.com/blog?utm_source=google |
![]() |
Matches the start URL | www.yext.com/blog |
www.yext.com/blog?page=11&language=en |
![]() |
With query parameters ignored, resulting URL is a duplicate of prior crawled URL | N/A |
With the specific query parameter language
ignored:
Page URL | Crawled | Reason | Resulting Crawled URL |
---|---|---|---|
www.yext.com/blog?utm_source=google |
![]() |
Query parameter not ignored | www.yext.com/blog?utm_source=google |
www.yext.com/blog?language=en&page=11 |
![]() |
With language ignored, resulting URL is not a duplicate of other crawled URLs |
www.yext.com/blog?page=11 |
www.yext.com/blog?language=en&utm_source=google |
![]() |
With language ignored, resulting URL is a duplicate of prior crawled URL |
N/A |
With none of the query parameters ignored:
Page URL | Crawled | Reason | Resulting Crawled URL |
---|---|---|---|
www.yext.com/blog?utm_source=google |
![]() |
Seen as a distinct URL | www.yext.com/blog?utm_source=google |
www.yext.com/blog?language=en&page=11 |
![]() |
Seen as a distinct URL | www.yext.com/blog?language=en&page=11 |
www.yext.com/blog?language=en&utm_source=google |
![]() |
Seen as a distinct URL | www.yext.com/blog?language=en&utm_source=google |
Max Depth (Domain Source Only)
The number of levels past the start URL for the crawler to spider to.
- Default:
10
- Minimum:
0
- Maximum:
100
For example: if the start domain is www.yext.com/blog
(depth = 0) , and the specified Max Depth is 1, the Crawler will spider to any URLs directly linked on the page www.yext.com/blog
(for example, www.yext.com/blog/2023
). With that, the crawler has reached the max depth of 1, and it would not continue to spider to any further URLs linked on www.yext.com/blog/2023
.
If duplicate URLs are present within a single crawl, the system will consider the first instance of that URL as the recorded depth. For example, if yext.com/blog/2023
was encountered first at a depth of 2, and then later in the crawl at a depth of 9 (via a different spidering path), the system would consider the depth of that URL to be 2.
Crawler Status and Performance
Crawler Overview Page
The Status bar displays details for the most recent crawl.
The Pages tab displays all unique pages crawled in all crawls. Specific details for a given crawled page reflect the most recent crawl of that page.
Crawl Details Page
Displays details for a single crawl. The pages listed here are all pages crawled for that individual crawl.
Limitations
For a full list of system limitations, see the Connector System Limits reference.