You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Arovit Narula <ar...@gmail.com> on 2010/04/30 16:32:45 UTC

pdfbox

Hello ,
I am a software engineering student from India.I am building a search
engine for pdf docs.I came across your parser.I am planing to use your
parser pdfbox for this purpose.I require the page no. of every word to
be known.Is there a way to extract the page number along with the
word.Please help me in this.
Your reply would be highly appreciated.

Regards,
Arovit Narula

Re: pdfbox

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Adam@swmc.com schrieb:
> Take a look at the examples (src\main\java\org\apache\pdfbox\examples) and 
> utils (src\main\java\org\apache\pdfbox\util) for examples with text 
> extraction.  
As you have to define the start and the end page, if you use the PDFTextStripper 
class you should parse your pdfs page by page and you will always know the page 
number of every word you've extracted.

BR
Andreas Lehmkühler



Re: pdfbox

Posted by Ad...@swmc.com.
Take a look at the examples (src\main\java\org\apache\pdfbox\examples) and 
utils (src\main\java\org\apache\pdfbox\util) for examples with text 
extraction.  You may also be interested in lucene, which is the search 
engine which is used in PDFBox.  See 
src\main\java\org\apache\pdfbox\searchengine\lucene for more info on that.

---- 
Thanks,
Adam





From:
Arovit Narula <ar...@gmail.com>
To:
users@pdfbox.apache.org
Date:
04/30/2010 07:38
Subject:
pdfbox



Hello ,
I am a software engineering student from India.I am building a search
engine for pdf docs.I came across your parser.I am planing to use your
parser pdfbox for this purpose.I require the page no. of every word to
be known.Is there a way to extract the page number along with the
word.Please help me in this.
Your reply would be highly appreciated.

Regards,
Arovit Narula



?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.  is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call  (800) 453 7884.