Hi Everyone,
My client reached out because they are noticing that their URL structure has a ? after it upon launching Yext Answers (e.g. on their Homepage it is: www.mydomain.com?
)
I wanted to better understand how our robots.txt file works in terms of disallowing query strings. Is it an issue that their URLs now have this structure? Can you elaborate on how the robots.txt file works for not allowing our Answers production page from being crawled?
Thanks!
Alyssa
Hi Alyssa,
There are a few questions here, so I’ll answer them one-by-one!
How does the robots.txt file work?
The robots.txt file allows you to specify the user-agents that are allowed to crawl your site, the sitemap that lists all the URLs on your domain, and what URL patterns to avoid indexing and crawling.
For a client-hosted (iframe) implementation, your robots.txt file might look like this.
User-agent: *
Sitemap: https://domain.com/sitemap.xml
Disallow: /
The most important step here is the disallow
statement. The /
indicates that everything on this domain should not be indexed. We do this because we do not want the iFrame source URL to index organically; rather, we want the page that the client places the iFrame on to be indexed and ranked.
How are query parameters handled in indexing?
It is true that query parameters can be indexed in Google. However, in your case, you do not want these to rank independently in Google.
You can achieve this by setting the canonical URL of the page to the URL without any query parameters. For example, the canonical url of this page is below.
<link rel="canonical" href="https://hitchhikers.yext.com/community/t/robots-txt-file-and-url-query-strings/525">
Even if query parameters are added, this attribute tells search engines that this is the ultimate URL that should be indexed.
Hope this helps!
Additional Resources:
That’s very helpful! Thank you Amani!