How can i crawl specific pdf page. How to each pdf page is an entity.
I try add page number parameter into subURL of crawler but it didn’t work.
Can you help me.
http://www.example.com/myfile.pdf#page=*
Hello!
Does each page in the PDF live on a different URL? If that’s the case, the crawler will spider to each individual URL if the site is linked via an href
tag on a crawled domain. If the URL is not linked, you’d have to manually add each URL that should be crawled for each individual page.
However, if there is not a unique URL for each page (and the file simply all lives on the same URL) then this wouldn’t be possible, as our crawler would only pick up a single URL and the Connector can only ingest a URL hosting a single PDF file as a single entity.
If you share your crawler configuration, I can try to take a closer look and see what may or may not be possible!
Hope this helps,
Rachel