Create a Crawler | Yext Hitchhikers Platform
What You’ll Learn
In this section, you will learn:
- How to create a Crawler
- An overview of Crawler settings
- Additional details on the Crawler functionality
The Crawler is one of the Sources in the Data Connectors framework that will help you scrape web pages to extract information to store in Yext Content. Before you can add a Data Connector using the Crawler, you have to create a Crawler.
Once you create a Crawler in your account, it will try to scrape web pages on the specified domain. You can then proceed to the Add Data flow to parse through the raw HTML and convert the scraped data into entities in the Yext platform.
Whitelist the Yext Crawler
Before setting up a crawler for your website, you need to ensure that the Yext Crawler is properly whitelisted to access your web pages. We ask that you both whitelist our Crawler’s user agent and IP addresses.
The Yext Crawler uses the following user agent:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/87.0.4280.88 YextBot/Java Safari/537.36
The Yext Crawler uses the following IP addresses:
Create a Crawler
- Click Content in the navigation bar.
- Click Configuration.
- Click Crawlers under the Data Ingestion & Processing section.
- Click on the + New Crawler button.
- Enter a name for your Crawler. We recommend something that clearly designates the site you plan to crawl, so you can easily distinguish it from other Crawlers you may set up.
- Select the schedule of how often you would like the crawler to run: Once, Daily, or Weekly.
- Select your Crawl Strategy — whether you want to crawl all pages, sub pages, or specific pages.
- Select your File Type — this is where you specify which files types, if encountered, the crawler should crawl.
- If you select All File Types all supported file types will be crawled, including any added in the future.
- If you select Select File Types you will then see the option to select HTML, PDF, or both as the file types and it will only crawl the select file types.
Pages or Domains to Crawl
Enter the Pages or Domains you would like to crawl. Note that domains and any pages that can be spidered to the same domain will be crawled.
In this section you will also acknowledge that this crawler functionality may only be used on websites or domains that you own and operate, and it may not be used on any third-party websites or domains.
URLs and Query Parameters to Omit from Crawls
In the Blacklisted URLs field you have the option to add any domains that you want to exclude from the crawl.
You can also select query parameters to ignore when crawling pages.
By default this will be set to ‘None’, but if you want to exclude parameters click on the Ignore Query Parameters field and select to ignore ‘All’ or ‘Specific Parameters’. Then, in the Ignore Query Parameters List you enter the query parameter you would like to omit from the crawl. You can add additional parameters by clicking on the + Add Another Ignored Query Parameter link.
This is specifically helpful for omitting parameters that would duplicate the data crawled in another parameter.
Advanced Crawler Settings
Here you can add a rate limit this allows you to decide the maximum number of concurrent requests that can run on their site. This helps ensure that your site is not crawled too quickly, and is not overwhelmed by requests from crawlers.
Once you have entered all of the relevant details click Save Crawler.
View the Details of a Crawler
Once your crawler has been created, you can click View Details to view the settings, as well as a history of its previous and active crawls.
The Settings tab gives you an overview of your Crawler’s configuration.
The Crawls tab provides a list of all crawls completed by this Crawler. This includes the status, start date, end date, and the number of successfully and unsuccessfully crawled pages.
You can also click into any individual crawls to see exactly which pages were successfully crawled during that specific crawl and if there were any failures.
The Pages tab gives you a complete list of all crawled pages and the status.
From this page, you can also click View HTML to view the raw HTML of the crawled page.
Managing a Crawler
From the Settings tab in the Crawler Details you can easily make adjustments to any of the Crawler settings.
Delete or Disable a Crawler
From the Crawler’s page, you have the option to Disable or Delete any Crawler.
Disabling a Crawler will halt any in-progress crawls, but you can enable it again at any time. Deleting a Crawler will permanently remove it, as well as delete any raw HTML data that was found by the crawler. Data that is already saved within existing entities in Content will not be removed.
Crawler Best Practices
If your business has multiple brands or domains, you can make as many crawlers in your account as you’d like. We recommend making a unique crawler with its own settings for each brand or domain you want to crawl.
Crawler Functionality and Nuances
Some helpful information to know about the Crawler’s functionality:
- The Crawler can only crawl public sites, and cannot crawl internal or locked parts of your website.
- The Crawler only crawls desktop versions of sites. We may allow specifying which version of the site to crawl in the future.