You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by -A <aa...@hrtmn.net> on 2014/07/07 19:50:10 UTC

Custom PDFTextStripper Warning (sometimes)

Hi everyone; I have a program written that has two PDF function
requirements:


   1. It must be able to return all of the text from the file
   2. It must be able to find red text within the file


I have two different types of PDF files. One we can call a Job Output File,
which may or may not have red text in it. The other is a Job Location File
which contains a table with all of the locations of the Job Output Files.
Originally I wrote the program with a custom text stripper which simply
adds a state boolean to track whether it found red in a given file. I then
created an overloaded processTextPosition method that looks like the
following:

[I found this method through researching but if there is a better method,
by all means share]

@Override
    protected void processTextPosition(TextPosition textPos)
    {
        try
        {
            PDGraphicsState graphicsState = getGraphicsState();

            // IF the current text contains RED
            if (graphicsState.getNonStrokingColor().getJavaColor().getRed()
== 255)
            {
                this.hasRed = true;
            }

        }
        catch (IOException ioe)
        {
            ioe.printStackTrace();
        }

    }

If I run the program on a Job Output File it works flawlessly. If I run it
on a Job Location File (which will never have red in it), I get the
following warning:

org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule process
WARNING: java.lang.ClassCastException: MyPDFStripper cannot be cast to
org.apache.pdfbox.pdfviewer.PageDrawer
java.lang.ClassCastException: MyPDFStripper cannot be cast to
org.apache.pdfbox.pdfviewer.PageDrawer
at
org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at MyPDFStripper.containsRed(IncrementalPDFStripper.java:68)


The program will generate NO warnings if I comment out the method call for
containsRed when passing it a Job Location File. Knowing this, I could get
around this warning rather easily by handling this case differently (which
it would be, but this is what testing is for; right?). But my question to
all of you is, why am I getting this? Is it because this Job Location File
has locations in a table that is throwing off the TextStripper? This is the
only difference between the files (neither contains images) that I can tell.


Thank you guys for your time!
Sincerely,
Aaron

Re: Custom PDFTextStripper Warning (sometimes)

Posted by -A <aa...@hrtmn.net>.
John:

Excellent! That fixed it. I appreciate the fast reply. I've been scouring
about for any PDFBox resources I could find and unfortunately have not
found much. If there are any sites or books that go over the API that you
would recommend, then by all means, please do.

Thanks again though!

-Aaron


On Mon, Jul 7, 2014 at 12:13 PM, John Hewson <jo...@jahewson.com> wrote:

> Hi Aaron
>
> You’re using the operator classes from the
> “org.apache.pdfbox.util.operator.pagedrawer” package with your custom
> TextStripper, however these class are only for use with a PageDrawer. If
> you look at the top entry in the stack trace
> "org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)”
> then you’ll see that the code at this line is:
>
> PageDrawer drawer = (PageDrawer)context;
>
> But your context class is TextStripper (or at least a subclass of it) not
> a PageDrawer. The solution is not to initialise your TextStripper with the
> .properties file which maps PageDrawer operators, take a look at some of
> the subclasses of TextStripper which are already in PDFBox to see how this
> is done.
>
> -- John
>
> On 7 Jul 2014, at 10:50, -A <aa...@hrtmn.net> wrote:
>
> > Hi everyone; I have a program written that has two PDF function
> > requirements:
> >
> >
> >   1. It must be able to return all of the text from the file
> >   2. It must be able to find red text within the file
> >
> >
> > I have two different types of PDF files. One we can call a Job Output
> File,
> > which may or may not have red text in it. The other is a Job Location
> File
> > which contains a table with all of the locations of the Job Output Files.
> > Originally I wrote the program with a custom text stripper which simply
> > adds a state boolean to track whether it found red in a given file. I
> then
> > created an overloaded processTextPosition method that looks like the
> > following:
> >
> > [I found this method through researching but if there is a better method,
> > by all means share]
> >
> > @Override
> >    protected void processTextPosition(TextPosition textPos)
> >    {
> >        try
> >        {
> >            PDGraphicsState graphicsState = getGraphicsState();
> >
> >            // IF the current text contains RED
> >            if
> (graphicsState.getNonStrokingColor().getJavaColor().getRed()
> > == 255)
> >            {
> >                this.hasRed = true;
> >            }
> >
> >        }
> >        catch (IOException ioe)
> >        {
> >            ioe.printStackTrace();
> >        }
> >
> >    }
> >
> > If I run the program on a Job Output File it works flawlessly. If I run
> it
> > on a Job Location File (which will never have red in it), I get the
> > following warning:
> >
> > org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule process
> > WARNING: java.lang.ClassCastException: MyPDFStripper cannot be cast to
> > org.apache.pdfbox.pdfviewer.PageDrawer
> > java.lang.ClassCastException: MyPDFStripper cannot be cast to
> > org.apache.pdfbox.pdfviewer.PageDrawer
> > at
> >
> org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)
> > at
> >
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
> > at
> >
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> > at
> >
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> > at
> >
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
> > at MyPDFStripper.containsRed(IncrementalPDFStripper.java:68)
> >
> >
> > The program will generate NO warnings if I comment out the method call
> for
> > containsRed when passing it a Job Location File. Knowing this, I could
> get
> > around this warning rather easily by handling this case differently
> (which
> > it would be, but this is what testing is for; right?). But my question to
> > all of you is, why am I getting this? Is it because this Job Location
> File
> > has locations in a table that is throwing off the TextStripper? This is
> the
> > only difference between the files (neither contains images) that I can
> tell.
> >
> >
> > Thank you guys for your time!
> > Sincerely,
> > Aaron
>
>

Re: Custom PDFTextStripper Warning (sometimes)

Posted by John Hewson <jo...@jahewson.com>.
Hi Aaron

You’re using the operator classes from the “org.apache.pdfbox.util.operator.pagedrawer” package with your custom TextStripper, however these class are only for use with a PageDrawer. If you look at the top entry in the stack trace "org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)” then you’ll see that the code at this line is:

PageDrawer drawer = (PageDrawer)context;

But your context class is TextStripper (or at least a subclass of it) not a PageDrawer. The solution is not to initialise your TextStripper with the .properties file which maps PageDrawer operators, take a look at some of the subclasses of TextStripper which are already in PDFBox to see how this is done.

-- John

On 7 Jul 2014, at 10:50, -A <aa...@hrtmn.net> wrote:

> Hi everyone; I have a program written that has two PDF function
> requirements:
> 
> 
>   1. It must be able to return all of the text from the file
>   2. It must be able to find red text within the file
> 
> 
> I have two different types of PDF files. One we can call a Job Output File,
> which may or may not have red text in it. The other is a Job Location File
> which contains a table with all of the locations of the Job Output Files.
> Originally I wrote the program with a custom text stripper which simply
> adds a state boolean to track whether it found red in a given file. I then
> created an overloaded processTextPosition method that looks like the
> following:
> 
> [I found this method through researching but if there is a better method,
> by all means share]
> 
> @Override
>    protected void processTextPosition(TextPosition textPos)
>    {
>        try
>        {
>            PDGraphicsState graphicsState = getGraphicsState();
> 
>            // IF the current text contains RED
>            if (graphicsState.getNonStrokingColor().getJavaColor().getRed()
> == 255)
>            {
>                this.hasRed = true;
>            }
> 
>        }
>        catch (IOException ioe)
>        {
>            ioe.printStackTrace();
>        }
> 
>    }
> 
> If I run the program on a Job Output File it works flawlessly. If I run it
> on a Job Location File (which will never have red in it), I get the
> following warning:
> 
> org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule process
> WARNING: java.lang.ClassCastException: MyPDFStripper cannot be cast to
> org.apache.pdfbox.pdfviewer.PageDrawer
> java.lang.ClassCastException: MyPDFStripper cannot be cast to
> org.apache.pdfbox.pdfviewer.PageDrawer
> at
> org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
> at MyPDFStripper.containsRed(IncrementalPDFStripper.java:68)
> 
> 
> The program will generate NO warnings if I comment out the method call for
> containsRed when passing it a Job Location File. Knowing this, I could get
> around this warning rather easily by handling this case differently (which
> it would be, but this is what testing is for; right?). But my question to
> all of you is, why am I getting this? Is it because this Job Location File
> has locations in a table that is throwing off the TextStripper? This is the
> only difference between the files (neither contains images) that I can tell.
> 
> 
> Thank you guys for your time!
> Sincerely,
> Aaron