How to correctly use Blacklisted URLs on Crawler?

Mikel_Torres · November 4, 2021, 2:49pm

Hey there,

I’m trying to crawl a website that has 2 language versions.
The default one, in English, doesn’t include any slug/path for the content but the one in Spanish is all below the slug/path /es/

Example:
https: //mydomain. com/products/whatever (English)
https: //mydomain. com/es/productos/whatever (Spanish)

As I want to create multi language profiles I’m trying to create different Crawlers for the different language versions of the site. I see that there’s this option of Blacklisted URLs when creating the Crawler. However, the options I’ve used:

Whole domain with slug and wildcard https: //mydomain. com/es/*
Regexp of relative path

Didn’t work for me.
It’s a bit confusing cause the tooltip (that states a regex should be used) doesn’t match the placeholder (where a full URL is used).

Could you please let me know exactly what I should enter in the Blacklisted URLs field to filter the pages that include the path /es?

Thanks,
Mikel

DJ_Corbett · November 4, 2021, 10:01pm

Hi Mikel,

Welcome to the Community!

I think your first approach of the slug and wildcard makes the most sense here!

The regex expression would just have to be slightly tweaked to use .* instead of just * at the end of the url pattern. With this change, your Blacklisted Url expression would look like:

https://www.mydomain.com/es/.*

Let me know if this ends up working for you!

Also, we appreciate your feedback on the confusion caused by the discrepancy between the tooltip and the placeholder example. I’ll make sure this feedback is passed along!

Best,
DJ

Mikel_Torres · November 5, 2021, 2:15pm

Hey DJ,

It worked, thanks so much for your help!

Zach_Shearer · March 9, 2022, 3:59pm

@DJ_Corbett , I’m working on my configuration now. Per this conversation, it appears that the “blacklistedUrls” option in the crawler configuration will accept a RegEx expression. Is this still correct? I believe that the training mentions asterisk wildcards, but I don’t recall it mentioning full regular expressions as possible values. Is the same true for the “domains” setting?

Micaela_Luders · March 10, 2022, 9:47pm

Hi Zach,

Tha Blacklisted URLs parameter does accept RegEx expressions, but the Subpages URLs parameter accepts wildcards. The information within the tooltip for each parameter is up-to-date and accurate!

The asterisk wildcards are mentioned in the training as part of setting up a connector, but the use of regex is not so I see the confusion – I will pass on that feedback for future training updates!

Topic		Replies	Views
How to set domain parameters to exclude URLs within Crawler Content	0	639	May 3, 2021
Crawler Sub Pages Wildcard Support (September '21 Release) Winter '21 Release	0	1218	September 14, 2021
Blacklisting URLs from Links using Regex Patterns Best Practices - Search Configuration	2	729	December 21, 2020
Setting Up Crawler for PDFs Content	6	2117	February 22, 2023
Blacklisting URLs from Links section Search	1	832	August 30, 2021

How to correctly use Blacklisted URLs on Crawler?

Related topics