I love that Yext has added pdf support to the crawler. It’s ingesting pdf content data as expected. However, by default, it doesn’t seem that there is a content limit. This can be problematic for longer pdf documents. I can of course, configure the connector to truncate the contents, which works pretty well, but it’s a trade-off.
In a perfect world, I have all of the contents associated with the pdf in the index, so the pdf can be found for all relevant search terms. Including all of the pdf contents can be taxing for query time and it’s not something that we need for the presentation on the frontend. Would it be possible in the future to have a field associated with an entity in the KnowledgeGraph that is used for indexing but that is not returned in Answers API queries? I’m thinking of either a setting in the field in the KnowledgeGraph or a filtering mechanism for querying.
To my knowledge, there is no mechanism to filter the fields that you get back in an Answers query, but I could be wrong.
I’m thinking of an API query parameter similar to the “filters” parameter that would allow one to suppress fields from being returned in the response. This filtering parameter would allow us to have the benefit of full document indexation while also keeping the response payload sizes low, even when a query matches a large document.