Setting Up Crawler for PDFs

I have a directory on our website that holds all of the PDF documents that are attached to our web pages. I want to crawl that directory and pick up all of those PDFs so that I can turn them into Entities. The only thing I’ve found so far that works is to list the full path to each PDF in the Domains field of the crawler settings. Obviously, that’s no good because every time we add a PDF we’re going to have to update the crawler.

Assume a single directory, for example, https://foobar.com/foo/bar, and files in bar named 1.pdf, 2.pdf… nn.pdf. How woudl you set up a crawler to grab the PDFs in that directory, one at a time, so that each (1.pdf, 2.pdf, etc.) gets a separate Page ID? Specifically, what would you put in the Domain, Strategy, File Types, and Sub-Pages fields to make that work?

@Chick_Webb can you post the url(s) you’ve entered into the crawler config and an example url that isn’t getting picked up ? (this will help anyone responding to assist)

My second paragraph above describes the form of the URLs for the directory and the files that are in it. As to what I’ve tried, it’s pretty much every combination of domains/subdomains/strategies that you can imagine. For example,

Domain - https://foobar.com/foo/bar
Strategy - Sub Pages
File Type - PDF
Sub Pages - /*.pdf

OR

Domain - https://foobar.com/foo/bar/*
Strategy - All Pages
File Type - PDF
Sub Pages -

Nothing works other than listing, in the Domain field, every full URL, as in:

ttps://foobar.com/foo/bar/1.pdf
ttps://foobar.com/foo/bar2.pdf
etc.

Which, as I said, is not a solution, since it would require that we update the Crawler every time a new document gets added to our web site.

Hey @Chick_Webb ,
are the relevant links to the PDFs contained in href tags of a crawled page (like a start Domain)?
The Crawler works by spidering through URLs based on URLs contained in href tags on each page. If the PDF links are there, they should be visible to the crawler.

Yes, the links are there. The PDF files are at a different path than the actual pages, so I’ve added the path as a Sub-Path. Configuration looks as below. It crawls all of the HTML pages, as far as I can tell, but not the PDFs.

What’s odd is that some PDFs have been pulled in (53, to be exact), but I was expecting several hundred. I don’t think I’m bumping up against a resource limit. At a bit of a loss at this point.

Hmm It’s hard for me to trouble shoot without looking at the crawler directly. Can you provide your account ID?
If some PDFs are crawled and some aren’t, it’s likely that the PDFs are not properly linked to via an href tag in a crawled HTML page.
If you can share a sample page that should have been crawled that was not, as well as the cralwed page that it was linked to from, that should enable me to see what’s going on.

@Rachel_Adler, thank you very much for the reply and offer. And for suffering my ignorance. :slight_smile: What seemed so straightforward at the start - point the crawler at a page and let 'er rip - turns out of course to be a bit more nuanced. I’ve done some additional digging and think I’ve got a pretty good handle on what’s going on here, and what I need to ultimately do if I want to be able to use the content from the PDFs in search.

I’ve still got some challenges, like how to link the PDF Entities created from that crawl to the Document Entities that are associated with each, but I think I’ll be able to work through them.