You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dustin Lebsock <Du...@bentley.com> on 2020/03/24 16:22:12 UTC

Searching individual pages in solr

Hi!

I'm looking for some guidance on engineering a solution for searching individual pages of PDF documents. I currently have a SolrCloud setup that uses an external tika server to extract text data from PDFs. I'd like to be able to search individual pages for search results and for the overall documents themselves (such as titles that link to external repo). I'm having trouble coming up with a clean solution.

I ran across a discussion on stackoverflow about this found here:
https://stackoverflow.com/a/50160163

I can't really see the pros and cons verse indexing a single document with multiple fields for each page vs indexing each page separately and using group queries. What does the solr community recommend?

Thank you for all the help!

Dustin Lebsock

Re: Searching individual pages in solr

Posted by Erick Erickson <er...@gmail.com>.
Well, given the structure of an inverted index, how would you have a clue what page the hit was on? You could conceivably index enough data with payloads and the like, but that’d cause a lot more bloat than just indexing each page.

Using grouping would allow you to show, say, the top three pages from the books with the highest score on an individual page basis.

But there are complications (aren’t there always?). Consider a page with one sentence. Indexed as an individual document, it might score quite high even if not the best choice. Or any embedded illustrations, what do you do with those? Index the caption os apart of the text? Ignore the caption? Etc.

I’d certainly start with a doc-per-page. Not quite sure what I’d do with the title and such, but that depends on your use-case.

Best,
Erick

> On Mar 24, 2020, at 12:22 PM, Dustin Lebsock <Du...@bentley.com> wrote:
> 
> Hi!
> 
> I'm looking for some guidance on engineering a solution for searching individual pages of PDF documents. I currently have a SolrCloud setup that uses an external tika server to extract text data from PDFs. I'd like to be able to search individual pages for search results and for the overall documents themselves (such as titles that link to external repo). I'm having trouble coming up with a clean solution.
> 
> I ran across a discussion on stackoverflow about this found here:
> https://stackoverflow.com/a/50160163
> 
> I can't really see the pros and cons verse indexing a single document with multiple fields for each page vs indexing each page separately and using group queries. What does the solr community recommend?
> 
> Thank you for all the help!
> 
> Dustin Lebsock