loading

Create a Crawler| Hitchhikers Platform

What You’ll Learn

In this section, you will learn:

  • How to create a Crawler
  • An overview of Crawler settings
  • Additional details on the Crawler functionality

Overview

The Crawler is one of the Sources in the Data Connectors framework that will help you scrape web pages to extract content to store in the Knowledge Graph. Before you can add a Data Connector using the Crawler, you have to create a Crawler.

Once you create a Crawler in your account, it will try to scrape web pages on the specified domain. You can then proceed to the Add Data flow to parse through the raw HTML and convert the scraped data into entities in the Knowledge Graph.

Whitelist the Yext Crawler

Before setting up a crawler for your website, you need to ensure that the Yext Crawler is properly whitelisted to access your web pages. We ask that you both whitelist our Crawler’s user agent and IP addresses.

User Agent:

The Yext Crawler uses the following user agent:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/87.0.4280.88 YextBot/Java Safari/537.36

IPs:

The Yext Crawler uses the following IP addresses:

  • 54.204.19.87
  • 50.19.160.200
  • 34.198.218.97
  • 54.221.171.225

Create a Crawler

To create a Crawler

  1. Hover over Knowledge Graph in the top navigation bar and click on the Configuration sub-tab.
  2. Click Crawlers in the sidebar.
  3. Click on the + New Crawler button.

Create a Crawler

Basic Information

  1. Enter a name for your Crawler. We recommend something that clearly designates the site you plan to crawl, so you can easily distinguish it from other Crawlers you may set up.
  2. Select the schedule of how often you would like the crawler to run: Once, Daily, or Weekly.
  3. Select your Crawl Strategy — whether you want to crawl all pages, sub pages, or specific pages.

Add Basic Crawler Info

Pages or Domains to Crawl

Enter the Pages or Domains you would like to crawl. Note that domains and any pages that can be spidered to the same domain will be crawled.

Enter Pages to Crawl

In this section you will also acknowledge that this crawler functionality may only be used on websites or domains that you own and operate, and it may not be used on any third-party websites or domains.

URLs and Query Parameters to Omit from Crawls

In the Blacklisted URLs field you have the option to add any domains that you want to exclude from the crawl.

Enter Blaclisted URLs

You can also select query parameters to ignore when crawling pages.

By default this will be set to ‘None’, but if you want to exclude parameters click on the Ignore Query Parameters field and select to ignore ‘All’ or ‘Specific Parameters’. Then, in the Ignore Query Parameters List you enter the query parameter you would like to omit from the crawl. You can add additional parameters by clicking on the + Add Another Ignored Query Parameter link.

This is specifically helpful for omitting parameters that would duplicate the data crawled in another parameter.

Once you have entered all of the relevant details click Save Crawler.

View the Details of a Crawler

Once your crawler has been created, you can click View Details to view the settings, as well as a history of its previous and active crawls.

The Settings tab gives you an overview of your Crawler’s configuration.

The Crawls tab provides a list of all crawls completed by this Crawler. This includes the status, start date, end date, and the number of successfully and unsuccessfully crawled pages.

You can also click into any individual crawls to see exactly which pages were successfully crawled during that specific crawl and if there were any failures.

The Pages tab gives you a complete list of all crawled pages and the status.

From this page, you can also click View HTML to view the raw HTML of the crawled page.

view html

Managing a Crawler

From the Settings tab in the Crawler Details you can easily make adjustments to any of the Crawler settings. You can make these adjustments in the table format, or by editing the configuration in the JSON editor.

Edit Crawl Settings

Edit Crawl Settings

Delete or Disable a Crawler

From the Crawler’s page, you have the option to Disable or Delete any Crawler.

Delete or Disable Crawler

Disabling a Crawler will halt any in-progress crawls, but you can enable it again at any time. Deleting a Crawler will permanently remove it, as well as delete any raw HTML data that was found by the crawler. Data saved within entities in the Knowledge Graph will not be removed.

Crawler Best Practices

If your business has multiple brands or domains, you can make as many crawlers in your account as you’d like. We recommend making a unique crawler with its own settings for each brand or domain you want to crawl.

Crawler Functionality and Nuances

Some helpful information to know about the Crawler’s functionality:

  • The Crawler can only crawl public sites, and cannot crawl internal or locked parts of your website.
  • The Crawler only crawls desktop versions of sites. We may allow specifying which version of the site to crawl in the future.
  • Crawlers are limited to 100,000 pages and 10 page depth levels.
unit Quiz
+20 points
Daily Quiz Streak Daily Quiz Streak: 0
Quiz Accuracy Streak Quiz Accuracy Streak: 0
    Error Success Question 1 of 2

    Where in the platform do you navigate to to create a Crawler?

    Error Success Question 2 of 2

    Which of the following are options for the Crawl Schedule? (Select all that apply)

    Wahoo - you did it! 🙌

    You've already completed this quiz, so you can't earn more points.You completed this quiz in 1 attempt and earned 0 points! Feel free to review your answers and move on when you're ready.
1st attempt
0 incorrect
Splash Loading