Crawler Product | Yext Hitchhikers Platform
Overview
The Crawler is a method of extracting data from websites to be brought into Yext. Data is scraped by a crawler, and then can be ingested into Yext via a connector. Crawlers can be customized to scrape data from specified web pages on a one-time or ongoing basis.
This doc outlines the configuration settings to set up a crawler. For information on using a crawler as a source when building a connector, see the Crawler Source reference.
Configuration
Start URL
The start URL is the first URL that will be crawled. It is also the starting point for any subsequent pages crawled by a crawler. You can configure a crawler with more than one start URL.
The root domain of a start URL is referred to as the start domain.
Start URLs can be as broad or specific as you choose. All of these are examples of valid start URLs:
yext.com
blog.yext.com
https://www.yext.com
https://www.yext.com/blog/2023/05/why-tech-leaders-are-embracing-custom-built-dxp
The crawler will resolve the HTTPS protocol — you can include https://
in your start URL(s), but it is not necessary in order for the crawler to function.
The specific pages to crawl from the start URL are determined by the crawl strategy.
Schedule
Crawlers can be scheduled to run Once, Daily, or Weekly. The default option is to run once.
If a daily or weekly schedule is set, a new crawl will be initiated exactly one day (or one week) after the previous crawl has finished.
All crawlers set to Daily or Weekly will have their schedules automatically changed to Once after the 14th day of inactivity.
A crawler is considered inactive if:
- The crawler is not linked to a connector
- The crawler is linked to a manually-run connector, but the connector has not been run in the past 14 days
- The crawler is linked to a manually-run connector, but the crawler configuration has not been viewed in the past 14 days
Inactive crawlers that have been automatically reset to Run Once will remain in the platform, and can still be viewed and run manually. You can also re-add a daily or weekly schedule to the crawler, but it will be reset back to Run Once if the crawler is inactive again.
Crawl Strategy
The crawl strategy determines which pages from the start URL should be crawled. A crawler can crawl every page on a start domain, only certain sub-pages, or specific URLs.
The process of detecting URLs to crawl is called spidering. The crawler detects pages to crawl via URLs referenced in href
tags within the HTML of your site (e.g., <a href="www.yext.com">
). From there, the crawler spiders to each URL stored in an href
tag and repeats the process again, up to 100,000 URLs (see the
Crawler Limits reference
).
All crawlers spider to find URLs in the same way. However, the crawl strategy will determine whether a certain URL detected during the spidering process should or should not be crawled.
Below are the available crawl strategy options:
All Pages: Crawls all pages with a subdomain and root domain matching that of the provided start URL, including pages that are at a higher subdirectory level than the start URL.
For example, for the start URL www.yext.com/blog
, the subdomain is www.
, the root domain is yext.com
, and the subdirectory is /blog
. This means that selecting the All Pages crawl strategy for this start URL will crawl all pages on www.yext.com
(the root domain), not just subpages under www.yext.com/blog
.
These URLs would be crawled:
www.yext.com
www.yext.com/blog/2023/04/use-cases-for-ai
www.yext.com/platform
This URL would not be crawled:
help.yext.com
(thehelp.
subdomain in this URL does not match thewww.
subdomain of the start URL)
Sub-Pages: Crawls all pages that have a subdomain and root domain matching that of the provided start URL, and are subpages of the start URL.
For the start URL www.yext.com/blog
:
These URLs would be crawled:
www.yext.com/blog
www.yext.com/blog/2023/07/the-stages-of-search-denial
These URLs would not be crawled:
www.yext.com
www.yext.com/platform/content
Subpage URL Structure (optional): Adding a subpage URL structure allows the crawler to spider to subpages that have a different URL structure than the start URL.
An example use case would be to capture sub-pages from two specific subdirectories on a website in one crawler (for example, if you wanted to capture sub-pages under www.yext.com/blog
and www.yext.com/products
but not www.yext.com/faq
).
Specify the subpage structure using wildcard notation (e.g, www.yext.com/faq/posts/*
).
For example, with a start URL of www.yext.com/blog/posts
, adding a subpage URL structure of www.yext.com/faq/posts/*
would produce these results:
These URLs would be crawled:
www.yext.com/blog/posts/role-of-a-customer
www.yext.com/faq/posts/2023
These URLs would not be crawled:
www.yext.com/products/posts/content
www.yext.com/faq
Adding a subpage URL structure will only broaden the scope of crawled URLs, not limit it. If you want to limit the scope, use the blacklisted URLs setting.
Specific Pages: Only the pages explicitly listed as a start URL will be crawled. The crawler will not spider to any subpages of the start URL(s).
File Types
Choose from the following options:
All File Types: All supported file types will be crawled if encountered. The current supported file types are HTML and PDF, but this setting will automatically include any file types that become supported in the future.
Select File Types: Only the file types selected will be crawled.
Note that the crawler is still able to detect and spider to URLs picked up from an unselected file type, without actually crawling that file. For example, if you want to only crawl PDFs, but you have PDFs linked on an HTML webpage, you can still select “PDF only.” The crawler will spider through the HTML links to find the PDF files, but will then crawl only the PDFs and exclude all HTML files.
Domains
A comma-separated list of start domains for the crawler to spider through.
Blacklisted URLs
Specify any URLs that should be excluded from being crawled, even if they would fall under the start URL and chosen crawl strategy.
For example, if the start URL is www.yext.com/blog
and the crawl strategy is set to Sub-Pages, we can expect that the URL www.yext.com/blog/post/2023
would be crawled. To exclude this URL from the crawl, it should be added to the blacklist.
You can provide a list of exact URLs to blacklist, or use regex notation to specify URLs of a certain pattern. To learn more about regex notation, see Mozilla’s developer cheat sheet.
Query Parameter Settings
Designate whether query parameters should be ignored when differentiating between crawled URLs.
- All: All query parameters are ignored
- None: No query parameters are ignored (even if it is specified as an ignored query parameter in this setting)
- Specific Parameters: The parameters specified in this setting will be ignored.
Note that selecting None will override any Specific Parameters settings. This means that even if query parameters are entered to be ignored under Specific Parameters, selecting None will make it so that those query parameters will not be ignored.
This setting is most useful in the context of pages with duplicate content. For example, if https://www.yext.com?test=1
and https://www.yext.com?test=2
have the same content, ignoring the test
parameter would avoid both these URLs being crawled.
Example: Crawler starts on www.yext.com/blog
and spiders to the following pages:
www.yext.com/blog?utm_source=google
https://yext.com?page=11&language=en
If the crawler is set to ignore All query parameters:
- Only
https://yext.com
will be crawled
If the crawler is set to ignore Specific Parameters and the language query (&language=en
) is specified:
https://www.yext.com
will be crawledhttps://www.yext.com?page=11
will be crawled
Advanced Settings
Rate Limit
The rate limit is the maximum number of concurrent tasks allowed for our scrapers to execute on a site.
- Default:
100
- Minimum:
1
- Maximum:
1,500
The rate limit should be set based on the maximum number of concurrent scrapers your site can handle without impacting site performance.
Websites often have a pre-set rate limit, and when too many requests are received, a 429 error will occur. This error would be surfaced within the Crawler “Status” column and signifies that a lower rate limit is necessary.
If your site can handle a faster rate limit, you can increase the value to increase the speed of the crawl.
Max Depth
The number of levels past the start URL for the crawler to spider to.
- Default:
10
- Minimum:
0
- Maximum:
100
For example: if the start domain is www.yext.com/blog
(depth = 0) , and the specified Max Depth is 1, the Crawler will spider to any URLs directly linked on the page www.yext.com/blog
(for example, www.yext.com/blog/2023
). With that, the crawler has reached the max depth of 1, and it would not continue to spider to any further URLs linked on www.yext.com/blog/2023
.
If duplicate URLs are present within a single crawl, the system will consider the first instance of that URL as the recorded depth. For example, if yext.com/blog/2023
was encountered first at a depth of 2, and then later in the crawl at a depth of 9 (via a different spidering path), the system would consider the depth of that URL to be 2.
Additional Notes
Crawler Overview Page
The Status bar displays details for the most recent crawl.
The Pages tab displays all unique pages crawled in all crawls. Specific details for a given crawled page reflect the most recent crawl of that page.
Crawl Details Page
Displays details for a single crawl. The pages listed here are all pages crawled for that individual crawl.
Duplicate URLs
If a URL is encountered more than once in a single crawl, we will only crawl the first instance of that URL. The first instance of a duplicate URL is also the only one that will be listed on the Pages tab of the Crawl Overview page.
Limitations
For a full list of system limitations, see the Connector System Limits reference.