Yext Site Crawler

This release we are launching the Yext Crawler! The Yext Crawler is a Source in the new Data Connectors framework that helps users scrape web pages to extract content to store in the Knowledge Graph.

For example, let’s say you want to include your blog posts in your Answers experience, but you manage and store those on your Wordpress. You can now set up a crawler that will pull in new blog posts as you publish in Wordpress and start including in your Answers search.

When you create a crawler in your account, it will try to scrape web pages on the specified domain. You can then proceed to the Add Data flow to parse through that raw HTML to convert the scraped data into entities in the Knowledge Graph.

To leverage the Yext Crawler to create entities in the Knowledge Graph, you will need to:

  1. Create a Crawler
  2. Add a new Connector by selecting the Crawler in the Add Data flow
  3. Configure the connector by choosing things like how often you want it to run (e.g., once or on a recurring basis)
  4. Map the data from the crawl to your entity schema to update in the Knowledge Graph

Once your crawl is complete, each crawler will record a history of its previous and active crawls. You can see the status, start date, end date, and the number of successfully and unsuccessfully crawled pages. You can also click into any individual crawls to see exactly which pages were crawled during that specific crawl and if there were any failures.

After you set up your crawl, you can configure your Connector. The Connector allows you to create both sophisticated and simple Crawlers. You can choose to specify a specific CSS or XPath selector to pull specific elements from the page, or you can use built-in selectors like Page Title and Body Content to pull the text off the page. You can also customize the settings to run daily, weekly, or monthly, and to crawl sub-pages or pages. Additionally, you can blacklist URLs that should not be crawled.

To learn more visit the Data Connectors & the Crawler training module.