Crawling a specific set of subpages is a commonly used function of the crawler, especially when customers have large websites with excess pages. Our current subpages crawl strategy setting will crawl both a user’s inputted page URLs, as well as any pages that share the same beginning URL path as the inputted page. For example, if you input http://www.yext.com/blog, we would crawl that page and any page it links to with the same beginning domain, such as: The Overlooked Importance of Customer Service Representatives | Yext.
Sometimes, however, the base domain of the desired sub pages to crawl do not match the top level URL. To continue our example, if we wanted to crawl a list of our product-related blogs from http://www.yext.com/products/blog/*, this would not crawl the desired sub pages in the format of Yext Blog | News and Stories from Yext*, since the base of the URLs do not match exactly. As a result, we would not be able to pull in all of the blog entities that we are looking for.
With the September Release, admins can now specify an optional additional wildcard URL path to spider to in addition to pages that match the inputted URL’s base domain. In our example, you would specify the sub page structure to include yext.com/products/blog/*, allowing the crawler to crawl all desired blog pages.
You can add this new wildcard sub page specification in the “Sub Pages URL Structure” field on the Settings page of the crawler you are interested in updating. Note that you are able to add multiple sub page URL structures to this field.
To learn more about the Crawler, visit the Yext Site Crawler & Crawler Connector training module.