Crawler PDF Support (Spring '22 Release)

We now support ingesting PDFs so content from PDFs can be stored in the Yext Knowledge Graph and searched by Answers via Document Search.

When you create a Crawler you will see a new File Types field. Here, you can select either All File Types or Select File Types.

  • If you select All File Types all supported file types will be crawled, including any added in the future.
  • If you select Select File Types you will then see the option to select HTML, PDF, or both as the file types and it will only crawl the select file types.

All existing crawlers that have been created up until this point will be set to Select File Types, HTML only. To include PDFs in an existing crawler you’ll need to update the settings and initiate a new Crawl.

Under basic settings, you can select All File Types, or select specific file types to crawl.

All File Types

Selected File Types

If PDF is a selected file type, your Crawler will now include any PDFs that are encountered! On the “Pages” tab of your crawler, you can filter by file type and view the PDF that was crawled.

Add Crawled PDFs to the Knowledge Graph via a Crawler-Source Connector:
Now, to create a Connector using a Crawler as a source, you will navigate to the Connectors flow, select Site Crawler as a source, and select the file types to include. You can choose to filter to include only HTML files, only PDF files, or both!

To note, this setting is not directly tied to your crawler settings. For example, if your Crawler was set to crawl both PDFs and HTML, but you only wish to include HTML data in this connector, you can do so by selecting only HTML as a file type! In the same manner, if your Crawler was set to crawl only HTML, you can still choose PDF & HTML as file types in the connector, and simply no PDFs would appear if they are not present. However, this could be useful if you change your crawler settings in the future to include PDFs, then your Connector configuration would be ready to go.

After configuring your crawler settings, you will then select your page type. Currently, we cannot structure PDFs into structured entities, so Detail Page will be the only selection option if you choose PDF or PDF & HTML, as the PDF will create one entity with all the supported data from the PDF.

Next you can add the relevant selector to extract the appropriate data from the PDF. Right now, we recommend you extract relevant metadata as well as the body content to create entities in KG. HTML-specific selectors, such as CSS or XPath Selectors, are not compatible with PDFs. If you choose to create a connector that allows both HTML & PDF file types to be present, you can add these selectors, and the fields will just be blank for PDFs.

Turn on the Spring ‘22 Release: Crawler PDF Support (early access) account feature to use this feature during the Early Access period.

To learn more about Crawlers, visit the Yext Site Crawler & Crawler Connector training module