Crawl pdf data each specific page

Nguyen_Thi_Thuy · July 1, 2022, 2:13am

How can i crawl specific pdf page. How to each pdf page is an entity.
I try add page number parameter into subURL of crawler but it didn’t work.
Can you help me.
http://www.example.com/myfile.pdf#page=*

Rachel_Adler · July 5, 2022, 3:00pm

Hello!

Does each page in the PDF live on a different URL? If that’s the case, the crawler will spider to each individual URL if the site is linked via an href tag on a crawled domain. If the URL is not linked, you’d have to manually add each URL that should be crawled for each individual page.

However, if there is not a unique URL for each page (and the file simply all lives on the same URL) then this wouldn’t be possible, as our crawler would only pick up a single URL and the Connector can only ingest a URL hosting a single PDF file as a single entity.

If you share your crawler configuration, I can try to take a closer look and see what may or may not be possible!

Hope this helps,
Rachel

Topic		Replies	Views
Setting Up Crawler for PDFs Content	6	2117	February 22, 2023
Crawler PDF Support (Spring '22 Release) Spring '22 Release	0	1263	April 11, 2022
Crawler - Select specific address components within one span Content spring21-release	2	1011	April 19, 2021
Does the crawler work on pages that paginate? Search	1	985	November 12, 2021
How to set domain parameters to exclude URLs within Crawler Content	0	639	May 3, 2021

Crawl pdf data each specific page

Related topics