How to correctly use Blacklisted URLs on Crawler?

Hey there,

I’m trying to crawl a website that has 2 language versions.
The default one, in English, doesn’t include any slug/path for the content but the one in Spanish is all below the slug/path /es/

Example:
https: //mydomain. com/products/whatever (English)
https: //mydomain. com/es/productos/whatever (Spanish)

As I want to create multi language profiles I’m trying to create different Crawlers for the different language versions of the site. I see that there’s this option of Blacklisted URLs when creating the Crawler. However, the options I’ve used:

  1. Whole domain with slug and wildcard https: //mydomain. com/es/*
  2. Regexp of relative path

Didn’t work for me.
It’s a bit confusing cause the tooltip (that states a regex should be used) doesn’t match the placeholder (where a full URL is used).

Could you please let me know exactly what I should enter in the Blacklisted URLs field to filter the pages that include the path /es?

Thanks,
Mikel

1 Like

Hi Mikel,

Welcome to the Community!

I think your first approach of the slug and wildcard makes the most sense here!

The regex expression would just have to be slightly tweaked to use .* instead of just * at the end of the url pattern. With this change, your Blacklisted Url expression would look like:

https://www.mydomain.com/es/.*

Let me know if this ends up working for you!

Also, we appreciate your feedback on the confusion caused by the discrepancy between the tooltip and the placeholder example. I’ll make sure this feedback is passed along!

Best,
DJ

Hey DJ,

It worked, thanks so much for your help!

1 Like

@DJ_Corbett , I’m working on my configuration now. Per this conversation, it appears that the “blacklistedUrls” option in the crawler configuration will accept a RegEx expression. Is this still correct? I believe that the training mentions asterisk wildcards, but I don’t recall it mentioning full regular expressions as possible values. Is the same true for the “domains” setting?

Hi Zach,

Tha Blacklisted URLs parameter does accept RegEx expressions, but the Subpages URLs parameter accepts wildcards. The information within the tooltip for each parameter is up-to-date and accurate!

The asterisk wildcards are mentioned in the training as part of setting up a connector, but the use of regex is not so I see the confusion – I will pass on that feedback for future training updates!

1 Like