How to correctly use Blacklisted URLs on Crawler?

Hey there,

I’m trying to crawl a website that has 2 language versions.
The default one, in English, doesn’t include any slug/path for the content but the one in Spanish is all below the slug/path /es/

Example:
https: //mydomain. com/products/whatever (English)
https: //mydomain. com/es/productos/whatever (Spanish)

As I want to create multi language profiles I’m trying to create different Crawlers for the different language versions of the site. I see that there’s this option of Blacklisted URLs when creating the Crawler. However, the options I’ve used:

  1. Whole domain with slug and wildcard https: //mydomain. com/es/*
  2. Regexp of relative path

Didn’t work for me.
It’s a bit confusing cause the tooltip (that states a regex should be used) doesn’t match the placeholder (where a full URL is used).

Could you please let me know exactly what I should enter in the Blacklisted URLs field to filter the pages that include the path /es?

Thanks,
Mikel

1 Like

Hi Mikel,

Welcome to the Community!

I think your first approach of the slug and wildcard makes the most sense here!

The regex expression would just have to be slightly tweaked to use .* instead of just * at the end of the url pattern. With this change, your Blacklisted Url expression would look like:

https://www.mydomain.com/es/.*

Let me know if this ends up working for you!

Also, we appreciate your feedback on the confusion caused by the discrepancy between the tooltip and the placeholder example. I’ll make sure this feedback is passed along!

Best,
DJ

Hey DJ,

It worked, thanks so much for your help!

1 Like