Step 3: Domains Source Type Settings

If you’ve chosen Domains as your source type, choose your crawl strategy and specify the pages to include in your crawl. Then, choose how you want to handle query parameters on any crawled URLs.

If you’re using the Sitemap source type, continue to the next step of this guide.

Crawl Strategy

The crawl strategy determines which pages should be crawled to from your specified web domain.

A crawler can crawl all pages on the domain, sub-pages (e.g., if the specified domain is www.galaxygrill.com, then www.galaxygrill.com/locations would be a sub-page), or only specific URLs.

From the Crawl Strategy dropdown, choose from the following options:

All Pages
Sub Pages
Specific URLs

Choose a crawl strategy

Which Pages or Domains Would You Like to Crawl?

Under the Pages or domains to crawl option, specify the URL(s) where you want the crawler to begin crawling (and finding other pages to crawl to). These URLs are also referred to as start domains or start URLs.

Enter your start domains

Sub-Pages URL Structure

You can use the Sub-pages URL structure setting if you want to crawl pages on certain sub-directories of a web domain while excluding others. This setting is optional and works with all crawl strategies.

Specify your sub-page URL structure with wildcard notation (e.g., www.galaxygrill.com/faq/*)

For example, if your start URL is www.galaxygrill.com, and you want to crawl pages under www.galaxygrill.com/faq, but not www.galaxygrill.com/locations, you could set up your sub-pages URL structure as shown below:

Enter a sub-page URL structure if desired

Which URLs Should Be Omitted from the Crawl?

In this section, specify any URLs you want to blacklist and set your query parameter specifications. These are both optional.

Enter blacklisted URLs and query parameter settings

Blacklisted URLs

Exclude specific URLs from a crawl, even if they match your chosen crawl strategy and other settings. Enter each URL to blacklist on a separate line. You can also use wildcard notation here.

Query Parameter Settings

Choose how query parameters should be handled on any crawled URLs.

For example, if query parameters are ignored, the crawler would see www.yext.com/blog and www.yext.com/blog?utm_source=google as the same URL. In this case, ignoring the query parameters would prevent the crawler from crawling that page more than once.

Choose from the following settings:

All: All query parameters are ignored
None: No query parameters are ignored
Specific Parameters: The parameters specified in this setting will be ignored.

Advanced Crawler Settings

Set the rate limit and maximum depth of pages to be crawled from the start URL.

Set the rate limit and max depth

The rate limit determines how many tasks the crawler can execute on a site at one time, without impacting site performance.

By default, this is set to 100. You may need to consult your web team in order to determine an ideal rate limit for your site.

The max depth is the number of levels past the start URL that the crawler will spider to. For example, if your start URL is www.galaxygrill.com and this page links to www.galaxygrill.com/faq, the faq page would be one level beyond the start URL.

By default, the max depth is set to 10.

Save Crawler

Once you’ve configured your crawler settings, click Save Crawler at the bottom of the screen. Then, continue to the Create Connector step of this guide.

<% elem.innerText %>