You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Niranjan Rao <ni...@fiport.com> on 2011/08/16 04:03:28 UTC
Text extraction problem using PDFBox
Hi there,
Completely new user of PDFBox so chances are high that I am missing
something. Also apologise for attaching the screen shot and large mail
size. I am not sure protocol followed in this DL as this is my first mail.
I am working with a PDF file to extract some text using PDFTextStripper
as well as my own code based on text stripper. It's missing lot of text
during text extraction.
I tried PDFReader application and even this application is missing the
text in question. I have confirmed it is text as I am able to copy from
adobe pdf viewer.
I would have loved to attach the PDF, but unfortunately its a credit
card statement and can not be distributed easily. I can try to debug the
code if some one is willing to guide me. During my debugging efforts (as
far as I can see) hierarchy of operator looks like as follows and
missing text is highlighted BT/q/q/Tj which PDFBox code is not reading.
The text node Tj directly under BT is read properly. The screen shot was
obtained using PDFEdit on ubuntu.
This is on ubuntu/java 1.6/PDFBox 1.6.0
Regards,
Niranjan
Re: Text extraction problem using PDFBox
Posted by Niranjan Rao <nh...@gmail.com>.
I did some more debugging and problem seems to be handling of BI
operator. If I understand code correctly, it's trying to read image
stream data until it sees EI. However it seems like stream is much
smaller and I can see lot of other operators embedded as part of BI.
Thanks,
Niranjan
On 08/15/2011 07:03 PM, Niranjan Rao wrote:
> Hi there,
>
> Completely new user of PDFBox so chances are high that I am missing
> something. Also apologise for attaching the screen shot and large mail
> size. I am not sure protocol followed in this DL as this is my first mail.
>
> I am working with a PDF file to extract some text using
> PDFTextStripper as well as my own code based on text stripper. It's
> missing lot of text during text extraction.
>
> I tried PDFReader application and even this application is missing
> the text in question. I have confirmed it is text as I am able to copy
> from adobe pdf viewer.
>
> I would have loved to attach the PDF, but unfortunately its a credit
> card statement and can not be distributed easily. I can try to debug
> the code if some one is willing to guide me. During my debugging
> efforts (as far as I can see) hierarchy of operator looks like as
> follows and missing text is highlighted BT/q/q/Tj which PDFBox code is
> not reading. The text node Tj directly under BT is read properly. The
> screen shot was obtained using PDFEdit on ubuntu.
>
>
>
>
> This is on ubuntu/java 1.6/PDFBox 1.6.0
>
> Regards,
>
> Niranjan