You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by "Duseja, Sushil" <su...@fiserv.com> on 2008/11/10 15:39:03 UTC

Text Extraction

Hello,

Can anyone please let me know as to how can I extract text from a pdf
file (with multiple forms) using PDFBox? Is creating and accessing
bookmarks the way to go? If possible, please point me to some working
examples.

Thanks.

Re: Text Extraction

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

Please don't cross-post to pdfbox-dev. All devs are expected to also be
on the user list. Thanks.

If your PDF actually contained PDF Forms (which they don't), you could
use the ExtractFDF or ExtractXFDF tool to extract the form data. But
your Tax.pdf has the form mixed with the form data as normal text. There
are also no structure tags that identify certain values. The only thing
you can do is use the ExtractText tool as suggested earlier and try to
construct rules to find the values in the extracted text you're looking
for. But I don't expect that to work reliably. So either get your PDF
producer to generate PDF forms or structure tags in the content. But the
latter is probably more difficult and I don't know if PDFBox would be a
help extracting the values. But PDF forms is most probably the way to go.

On 13.11.2008 13:28:58 Duseja, Sushil wrote:
> Thank you very much for the response.
> 
> I have gone through the links mentioned below; however that didn't help
> me.
> 
> The pdf I want to extract the text from, contains multiple forms. I have
> attached a sample pdf for your kind reference.
> 
> Please advise as to how I can fetch a particular value (ex. Account
> Number).
> 
> Thanks again.
> 
> 
> -----Original Message-----
> From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
> Sent: Thursday, November 13, 2008 5:36 PM
> To: pdfbox-users@incubator.apache.org
> Subject: Re: Text Extraction
> 
> Have you looked at the documentation already?
> 
> 0.7.3 release:
> http://pdfbox.org/userguide/text_extraction.html
> 
> Development code:
> http://incubator.apache.org/pdfbox/userguide/text_extraction.html
> 
> You can also look at the "ExtractText" tool's source code for another
> working example to extract text from a PDF.
> 
> On 13.11.2008 11:27:04 Duseja, Sushil wrote:
> > Can anyone kindly respond to my question below?
> > 
> >  
> > 
> > Thanks!
> > 
> >  
> > 
> > -----Original Message-----
> > From: Duseja, Sushil 
> > Sent: Monday, November 10, 2008 8:09 PM
> > To: pdfbox-users@incubator.apache.org
> > Subject: Text Extraction
> > 
> >  
> > 
> > Hello,
> > 
> >  
> > 
> > Can anyone please let me know as to how can I extract text from a pdf
> > 
> > file (with multiple forms) using PDFBox? Is creating and accessing
> > 
> > bookmarks the way to go? If possible, please point me to some working
> > 
> > examples.
> > 
> >  
> > 
> > Thanks. 
> > 
> >  
> > 
> >  
> > 
> >  
> 
> 
> 
> 
> Jeremias Maerki
> 

Jeremias Maerki

RE: Text Extraction

Posted by "Duseja, Sushil" <su...@fiserv.com>.

Thank you very much for the response.

I have gone through the links mentioned below; however that didn't help
me.

The pdf I want to extract the text from, contains multiple forms. I have
attached a sample pdf for your kind reference.

Please advise as to how I can fetch a particular value (ex. Account
Number).

Thanks again.

-----Original Message-----
From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
Sent: Thursday, November 13, 2008 5:36 PM
To: pdfbox-users@incubator.apache.org
Subject: Re: Text Extraction

Have you looked at the documentation already?

0.7.3 release:
http://pdfbox.org/userguide/text_extraction.html

Development code:
http://incubator.apache.org/pdfbox/userguide/text_extraction.html

You can also look at the "ExtractText" tool's source code for another
working example to extract text from a PDF.

On 13.11.2008 11:27:04 Duseja, Sushil wrote:
> Can anyone kindly respond to my question below?
> 
>  
> 
> Thanks!
> 
>  
> 
> -----Original Message-----
> From: Duseja, Sushil 
> Sent: Monday, November 10, 2008 8:09 PM
> To: pdfbox-users@incubator.apache.org
> Subject: Text Extraction
> 
>  
> 
> Hello,
> 
>  
> 
> Can anyone please let me know as to how can I extract text from a pdf
> 
> file (with multiple forms) using PDFBox? Is creating and accessing
> 
> bookmarks the way to go? If possible, please point me to some working
> 
> examples.
> 
>  
> 
> Thanks. 
> 
>  
> 
>  
> 
>  

Jeremias Maerki

RE: Text Extraction

Posted by "Duseja, Sushil" <su...@fiserv.com>.

Thank you very much for the response.

I have gone through the links mentioned below; however that didn't help
me.

The pdf I want to extract the text from, contains multiple forms. I have
attached a sample pdf for your kind reference.

Please advise as to how I can fetch a particular value (ex. Account
Number).

Thanks again.

-----Original Message-----
From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
Sent: Thursday, November 13, 2008 5:36 PM
To: pdfbox-users@incubator.apache.org
Subject: Re: Text Extraction

Have you looked at the documentation already?

0.7.3 release:
http://pdfbox.org/userguide/text_extraction.html

Development code:
http://incubator.apache.org/pdfbox/userguide/text_extraction.html

You can also look at the "ExtractText" tool's source code for another
working example to extract text from a PDF.

On 13.11.2008 11:27:04 Duseja, Sushil wrote:
> Can anyone kindly respond to my question below?
> 
>  
> 
> Thanks!
> 
>  
> 
> -----Original Message-----
> From: Duseja, Sushil 
> Sent: Monday, November 10, 2008 8:09 PM
> To: pdfbox-users@incubator.apache.org
> Subject: Text Extraction
> 
>  
> 
> Hello,
> 
>  
> 
> Can anyone please let me know as to how can I extract text from a pdf
> 
> file (with multiple forms) using PDFBox? Is creating and accessing
> 
> bookmarks the way to go? If possible, please point me to some working
> 
> examples.
> 
>  
> 
> Thanks. 
> 
>  
> 
>  
> 
>  

Jeremias Maerki

Re: Text Extraction

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

Have you looked at the documentation already?

0.7.3 release:
http://pdfbox.org/userguide/text_extraction.html

Development code:
http://incubator.apache.org/pdfbox/userguide/text_extraction.html

You can also look at the "ExtractText" tool's source code for another
working example to extract text from a PDF.

On 13.11.2008 11:27:04 Duseja, Sushil wrote:
> Can anyone kindly respond to my question below?
> 
>  
> 
> Thanks!
> 
>  
> 
> -----Original Message-----
> From: Duseja, Sushil 
> Sent: Monday, November 10, 2008 8:09 PM
> To: pdfbox-users@incubator.apache.org
> Subject: Text Extraction
> 
>  
> 
> Hello,
> 
>  
> 
> Can anyone please let me know as to how can I extract text from a pdf
> 
> file (with multiple forms) using PDFBox? Is creating and accessing
> 
> bookmarks the way to go? If possible, please point me to some working
> 
> examples.
> 
>  
> 
> Thanks. 
> 
>  
> 
>  
> 
>  




Jeremias Maerki