Yext Crawler

The Yext Crawler helps you automatically populate your Knowledge Graph based on websites that you can crawl.

Must not be:

Type: object

The following properties are required:

domains
sitemap

$idstring

The unique identifier for the Yext Crawler resource.

$schemaconst

Specific value: “https://schema.yext.com/config/crawler/site-crawler/v1"

namestring Required

The display name of the crawler.

enabledboolean

Default: true

If true, the crawler will run according to its crawl schedule.

crawlScheduleenum (of string)

Default: “weekly”

Defines how often the crawler will index the website.

Must be one of:

“once”
“daily”
“weekly”

crawlStrategyenum (of string)

Default: “subPages”

Specifies the crawl strategy of the crawler

Must be one of:

“allPages”
“subPages”
“specificPages”

domainsarray of string

A list of domains or URLs to crawl, e.g. https://www.example.com

Must contain a minimum of 1 items

Each item of this array must be:

Type: string

sitemapobject

Configuration for using XML sitemaps as a source for starting URLs.

string Required

The URL of the XML sitemap.

boolean

Default: false

If true, the crawler will only crawl the sitemap if it has been modified since the last crawl.

ignoreQueryParameterOptionenum (of string)

Default: “none”

Option for ignoring query parameters when differentiating crawled URLs

Must be one of:

“none”
“all”
“specificParameters”

ignoreQueryParametersListarray of string

Any query parameters specified in the list will be ignored when differentiating crawled URLs.

Each item of this array must be:

Type: string

blacklistedUrlsarray of string

Any URLs that match any regex rule in the list will be omitted from the crawl.

Each item of this array must be:

Type: string

subPagesUrlStructuresarray of string

Specified wildcard URLs will also be considered when using the Sub Pages crawl strategy, e.g. www.yext.com/bad-website/blog/*

Each item of this array must be:

Type: string

headersarray of object

Custom header values that will be passed to each crawled page.

Each item of this array must be:

Type: object

string Required

fileTypes

Default: “allTypes”

Specifies which file types, if encountered, to crawl

Type: array of enum (of string)

Specifies which file types should be crawled

Must contain a minimum of 1 items

Each item of this array must be:

Type: enum (of string)

Must be one of:

“HTML”
“PDF”

Type: const

If selected, all supported file types will be crawled, including any added in the future

Specific value: “allTypes”

rateLimitinteger

Default: 100

Specifies the maximum number of concurrent crawls

Value must be greater or equal to 1 and lesser or equal to 15000

maxDepthinteger

Default: 10

Specifies the number of levels past your root URLs for the crawler to index

Value must be greater or equal to 0 and lesser or equal to 100

Yext Crawler

Must not be:

The following properties are required:

$idstring

$schemaconst

namestring Required

enabledboolean

crawlScheduleenum (of string)

Must be one of:

crawlStrategyenum (of string)

Must be one of:

domainsarray of string

Each item of this array must be:

sitemapobject

url string Required

checkLastModified boolean

ignoreQueryParameterOptionenum (of string)

Must be one of:

ignoreQueryParametersListarray of string

Each item of this array must be:

blacklistedUrlsarray of string

Each item of this array must be:

subPagesUrlStructuresarray of string

Each item of this array must be:

headersarray of object

Each item of this array must be:

key string Required

value string Required

fileTypes

One of

Each item of this array must be:

Must be one of:

rateLimitinteger

maxDepthinteger

string Required

boolean

string Required

string Required