Crawler Sub Pages Wildcard Support (September '21 Release)

Caroline_Gould · September 14, 2021, 7:57pm

Crawling a specific set of subpages is a commonly used function of the crawler, especially when customers have large websites with excess pages. Our current subpages crawl strategy setting will crawl both a user’s inputted page URLs, as well as any pages that share the same beginning URL path as the inputted page. For example, if you input http://www.yext.com/blog, we would crawl that page and any page it links to with the same beginning domain, such as: The Overlooked Importance of Customer Service Representatives | Yext.

Sometimes, however, the base domain of the desired sub pages to crawl do not match the top level URL. To continue our example, if we wanted to crawl a list of our product-related blogs from http://www.yext.com/products/blog/*, this would not crawl the desired sub pages in the format of Yext Blog | News and Stories from Yext*, since the base of the URLs do not match exactly. As a result, we would not be able to pull in all of the blog entities that we are looking for.

With the September Release, admins can now specify an optional additional wildcard URL path to spider to in addition to pages that match the inputted URL’s base domain. In our example, you would specify the sub page structure to include yext.com/products/blog/*, allowing the crawler to crawl all desired blog pages.

You can add this new wildcard sub page specification in the “Sub Pages URL Structure” field on the Settings page of the crawler you are interested in updating. Note that you are able to add multiple sub page URL structures to this field.

To learn more about the Crawler, visit the Yext Site Crawler & Crawler Connector training module.

Topic		Replies	Views
Yext Site Crawler Spring '21 Release spring21-release	0	1945	March 15, 2021
Pages - Redirects to the same subdomain Pages	3	1598	May 28, 2021
Domain Improvements (Fall '20 Release) Answers Updates fall20-release	0	1233	October 29, 2020
Pages - Subdomain Takeover with in-platform Page Builder Pages	0	712	January 19, 2021
October 19, 2021: Hitchhikers Office Hours Webinars and Office Hours	6	1280	October 19, 2021

Crawler Sub Pages Wildcard Support (September '21 Release)

Related topics