Crawler Source | Yext Hitchhikers Platform

Use the Crawler source on a connector to ingest data from webpages crawled by the Yext Crawler. For information about configuring a crawler, see the Crawler Product reference.

The connector will pull in data from the most recently completed crawl of the selected crawler.

Options for configuration are outlined below.

File Types

Select the types of files (HTML and/or PDF) to bring in via the connector.

This setting does not have to match the file type setting of the crawler itself, but the connector is limited to only the file types that would have been crawled.

For example: if a crawler is configured to crawl both HTML and PDF files, a connector using that crawler as a source can be set to only ingest HTML files. However, if the crawler was set to scrape only HTML files, a connector configured to scrape only PDFs would not result in any new data brought into Yext Content via that connector.

URLs

Choose from one of the following configuration options:

All URLs Crawled

All URLs present in the most recently completed crawl.

Specific URLs or URL Patterns

Specify a comma-separated list of exact URL paths or URLs that match a wildcard pattern. These will be ingested if present in the most recent crawl.

For example: If https://faqs.yext.com is specified, then:

  • https://faqs.yext.com/blogs/1 will be included
  • https://pages.yext.com/blogs/1 will not be included

Page Type

List Page

Multiple entities are contained on a single URL. This option is only supported for HTML file types ( see note below ).

The entity container is specified by a CSS or XPath expression. This should correspond to an outer container that includes all the information to extract for each entity.

Detail Page

Each page corresponds to the data for a single entity.

PDF Support

If the File Type settings for the connector include PDFs (i.e., the chosen setting is PDF Only or PDF and HTML files), the Page Type setting must be Detail Page.

This is because when data from a PDF is extracted and formatted into structured entities, each PDF will be considered as a single entity. The data from that PDF will be read as fields for metadata, and the entire body of the PDF, unstructured.

HTML-specific selectors, such as CSS or XPath selectors, are not compatible with PDFs. If you choose to create a connector that allows both HTML and PDF file types, you can still add these selectors, but the fields will just be blank for PDFs.

Selectors

Extract data from the crawled webpages using the following methods, depending on the file types included in the connector:

HTML

  • CSS Path
  • XPath
  • Page ID — this corresponds to the unique PageID as referenced in the crawler
  • Page URL
  • Page Title — this is extracted from the <title> tag HTML element within the metadata of the crawled page
  • Cleaned Body Content

PDF

  • Page ID — this corresponds to the unique PageID as referenced in the crawler
  • Page URL
  • Page Title — this is extracted from the <title> tag HTML element within the metadata of the crawled page
  • Cleaned Body Content
  • Author
  • Created Date

If both HTML and PDF file types are selected as supported file types, all HTML-specific selectors will be available, and will render as blank values for any present PDFs. The Author and Created Date selectors will not be available.