You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Niranjan Rao <ni...@fiport.com> on 2011/08/16 04:03:28 UTC

Text extraction problem using PDFBox

Hi there,

Completely new user of PDFBox so chances are high that I am missing 
something. Also apologise for attaching the screen shot and large mail 
size. I am not sure protocol followed in this DL as this is my first mail.

I am working with a PDF file to extract some text using PDFTextStripper 
as well as my own code based on text stripper. It's missing lot of text 
during text extraction.

I tried  PDFReader application and even this application is missing the 
text in question. I have confirmed it is text as I am able to copy from 
adobe pdf viewer.

I would have loved to attach the PDF, but unfortunately its a credit 
card statement and can not be distributed easily. I can try to debug the 
code if some one is willing to guide me. During my debugging efforts (as 
far as I can see) hierarchy of operator looks like as follows and 
missing text is highlighted BT/q/q/Tj which PDFBox code is not reading.  
The text node Tj directly under BT is read properly. The screen shot was 
obtained using PDFEdit on ubuntu.




This is on ubuntu/java 1.6/PDFBox 1.6.0

Regards,

Niranjan

Re: Text extraction problem using PDFBox

Posted by Niranjan Rao <nh...@gmail.com>.
I did some more debugging and problem seems to be handling of BI 
operator. If I understand code correctly, it's trying to read image 
stream data until it sees EI. However it seems like stream is much 
smaller and I can see lot of other operators embedded as part of BI.

Thanks,

Niranjan

On 08/15/2011 07:03 PM, Niranjan Rao wrote:
> Hi there,
>
> Completely new user of PDFBox so chances are high that I am missing 
> something. Also apologise for attaching the screen shot and large mail 
> size. I am not sure protocol followed in this DL as this is my first mail.
>
> I am working with a PDF file to extract some text using 
> PDFTextStripper as well as my own code based on text stripper. It's 
> missing lot of text during text extraction.
>
> I tried  PDFReader application and even this application is missing 
> the text in question. I have confirmed it is text as I am able to copy 
> from adobe pdf viewer.
>
> I would have loved to attach the PDF, but unfortunately its a credit 
> card statement and can not be distributed easily. I can try to debug 
> the code if some one is willing to guide me. During my debugging 
> efforts (as far as I can see) hierarchy of operator looks like as 
> follows and missing text is highlighted BT/q/q/Tj which PDFBox code is 
> not reading.  The text node Tj directly under BT is read properly. The 
> screen shot was obtained using PDFEdit on ubuntu.
>
>
>
>
> This is on ubuntu/java 1.6/PDFBox 1.6.0
>
> Regards,
>
> Niranjan