You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by debdoot <de...@gmail.com> on 2012/08/06 15:13:37 UTC

Returning page numbers where match occurs

Suppose, we are provisioning search over large text documents (e.g., Word,
PPT). It would be nice to have the highlighter component to return the page
numbers where the matches are found so that the same may be included in the
search result summaries. What is the most efficient way to accomplish this?

Thanks
Debdoot



--
View this message in context: http://lucene.472066.n3.nabble.com/Returning-page-numbers-where-match-occurs-tp3999370.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Returning page numbers where match occurs

Posted by debdoot <de...@gmail.com>.
Thanks a lot Jack for your prompt reply! The JIRA issue indeed talks about
what I want to accomplish. I will try out Tricia's solution.

As regards your question - whether I want "real" page numbers? Yes, ideally
I want to get real page numbers (and am willing to put in the additional
parsing effort to get those). For starters, even integer page numbers will
work - do you have a simpler solution in mind for this case (than the one
described in SOLR-380)?

The text from the documents (for which I want page numbers) are represented
in certain fields of the Solr/Lucene document that I index, i.e., a
many-to-one relation exists between the office documents and Solr documents
in my index.

Regards,
Debdoot



--
View this message in context: http://lucene.472066.n3.nabble.com/Returning-page-numbers-where-match-occurs-tp3999370p3999380.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Returning page numbers where match occurs

Posted by Jack Krupansky <ja...@basetechnology.com>.
There is an old, open Jira, SOLR-380 - "There's no way to convert search 
results into page-level hits of a "structured document".", but no recent 
activity on it. It does have a lot of interesting commentary though. I 
wouldn't get my hopes up.

See:
https://issues.apache.org/jira/browse/SOLR-380

The short answer is that you would have to re-parse the document yourself 
since Tika/POI called from SolrCell simply parses the document into a 
linear, unstructured stream of text, with no markers for pages. The SOLR-380 
Jira issue may give you some clues.

I do have a related question: Would you want strictly integer page numbers 
where the first page of any front matter is "1" or the actual literal page 
numbers (e.g. "iii" or "A-1"). The former is simpler but incorrect if the 
user thinks they can simply look for that page number in the document.

-- Jack Krupansky

-----Original Message----- 
From: debdoot
Sent: Monday, August 06, 2012 9:13 AM
To: solr-user@lucene.apache.org
Subject: Returning page numbers where match occurs

Suppose, we are provisioning search over large text documents (e.g., Word,
PPT). It would be nice to have the highlighter component to return the page
numbers where the matches are found so that the same may be included in the
search result summaries. What is the most efficient way to accomplish this?

Thanks
Debdoot



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Returning-page-numbers-where-match-occurs-tp3999370.html
Sent from the Solr - User mailing list archive at Nabble.com.