You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by AHMET ARSLAN <io...@yahoo.com> on 2008/04/14 17:58:27 UTC

capturing page numbers

I have extracted text from .pdf files and I also
inserted page numbers of the .pdf file to the text. My
document looks something like:

  <content>
   <page no="2"> ..Some Text..</page>
   <page no="3"> ..Some Text..</page>
   ..................................
   ...........................</page>
  </content>

I indexed my data using solr and I am making
highlighted queries.(hl.fragsize=200&hlsnippets=5).
Currently I am displaying just snippets to the user,
however I also want to capture the page number of the
corresponding snippet. I will give a link to jump that
page in the original pdf file.

Is there a way that I can find out which page the
snipped was extracted from, by using the <page> tags?

Any ideas and help is appreciated.

Thank you...


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

RE: capturing page numbers

Posted by "Binkley, Peter" <Pe...@ualberta.ca>.
Tricia Williams is working on this problem in
https://issues.apache.org/jira/browse/SOLR-380, and there is a patch you
can try (instructions at
https://issues.apache.org/jira/browse/SOLR-380?focusedCommentId=12541699
#action_12541699). It uses Lucene payloads to carry the page
information, and requires a current version of Lucene.

The alternative is to index each page as its own Solr document, and make
your application do the work of grouping the results from each source
pdf.

Peter

-----Original Message-----
From: AHMET ARSLAN [mailto:iorixxx@yahoo.com] 
Sent: Monday, April 14, 2008 9:58 AM
To: solr-user@lucene.apache.org
Subject: capturing page numbers

I have extracted text from .pdf files and I also inserted page numbers
of the .pdf file to the text. My document looks something like:

  <content>
   <page no="2"> ..Some Text..</page>
   <page no="3"> ..Some Text..</page>
   ..................................
   ...........................</page>
  </content>

I indexed my data using solr and I am making highlighted
queries.(hl.fragsize=200&hlsnippets=5).
Currently I am displaying just snippets to the user, however I also want
to capture the page number of the corresponding snippet. I will give a
link to jump that page in the original pdf file.

Is there a way that I can find out which page the snipped was extracted
from, by using the <page> tags?

Any ideas and help is appreciated.

Thank you...


 
________________________________________________________________________
____________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ