You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Doug Sackin <ds...@gmail.com> on 2013/04/01 20:13:25 UTC

OutOfMemoryError from FlatFilter (could be PDFBOX-453 again)

I appear to have something similar to the bug identified and fixed in
PDFBOX-453 - FlateFilter.decode() throwing OutOfMemoryError.

I'm doing text extraction through Twister Data Framework using Tika 1.2
which calls PDFBox. I have PDFBox 1.7. My OS is Scientific Linux 5.8. Java
is JDK 1.6.0_37.

The offending exception is below:

Caused by: java.lang.OutOfMemoryError
    at java.util.zip.Inflater.inflateBytes(Native Method)
    at java.util.zip.Inflater.inflate(Inflater.java:238)
    at java.util.zip.Inflater.inflate(Inflater.java:256)
    at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
    at
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
    at
org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:196)
    at
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
    at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217)
    at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
    at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
    at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

Before that, I have a long string of exceptions from PDFBox attempts on PDF
files, interspersed by "FlateFilter: stop reading corrupt stream due to a
DataFormatException". These are in the attached log file.

The other exceptions are IndexOutOfBounds, ClassCastException,
NegativeArraySizeException, NullPointerException, IOException (regarding
font(COSName}F2}) in map{}), IllegalArgumentException. These may or may not
be related (the exceptions are appearing on different files), but I wonder
if they served to corrupt the stream sufficiently that PDFBox got attempted
to inflate corrupt data.

If it is the same issue, it was reported to be fixed in 0.8. If it is a new
issue, is it possible to fix it? I cannot provide any of the source PDF
files (client data), but I am attaching the log output containing all of
the exception traces including the final OutOfMemoryError.

Thanks for any insights.

Doug

Re: OutOfMemoryError from FlatFilter (could be PDFBOX-453 again)

Posted by Doug Sackin <ds...@gmail.com>.

Per Maruan's suggestion, I tracked down the bad file and ran the
ExtractText command line utility on it. I only had access to pdfbox-1.7.1
on the system. The file definitely appears to be corrupt. I open it either
using Adobe or PDFBox ExtractText.

Using java -jar pdfbox-app-1.7.1.jar ExtractText bad_file.pdf, I get:

java.io.IOException: Error: Header doesn't contain versioninfo
    at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:315)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1055)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:980)
    at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:211)
    at org.apache.pdfbox.ExtractText.main(ExtractText.java:84)
    at org.apache.pdfbox.PDFBox.main(ExtractText.java:42)


Using java -jar pdfbox-app-1.7.1.jar ExtractText *-nonSeq *bad_file.pdf, I
get:

java.io.IOException: Error: Missing end of file marker '%%EOF'
    at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.getStartxrefOffset(NonSequentialPDFParser.java:456)
    at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:233)
    at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
    at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
    at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:207)
    at org.apache.pdfbox.ExtractText.main(ExtractText.java:84)
    at org.apache.pdfbox.PDFBox.main(ExtractText.java:42)

Now I will try a small app using the utility classes from the stack trace
to see if the same exception shows up.

Thank you for the tip.

Doug


On Tue, Apr 9, 2013 at 8:52 AM, Doug Sackin <ds...@gmail.com> wrote:

> Has anyone else encountered recent problems with FlateFilter and
> OutOfMemory errors? Is there anyway to trap it before it results in
> OutOfMemory exception?
>
> Thanks
>
> Doug
>
>
> On Mon, Apr 1, 2013 at 2:13 PM, Doug Sackin <ds...@gmail.com> wrote:
>
>> I appear to have something similar to the bug identified and fixed in
>> PDFBOX-453 - FlateFilter.decode() throwing OutOfMemoryError.
>>
>> I'm doing text extraction through Twister Data Framework using Tika 1.2
>> which calls PDFBox. I have PDFBox 1.7. My OS is Scientific Linux 5.8. Java
>> is JDK 1.6.0_37.
>>
>> The offending exception is below:
>>
>> Caused by: java.lang.OutOfMemoryError
>>     at java.util.zip.Inflater.inflateBytes(Native Method)
>>     at java.util.zip.Inflater.inflate(Inflater.java:238)
>>     at java.util.zip.Inflater.inflate(Inflater.java:256)
>>     at
>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
>>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
>>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
>>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>>     at
>> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>>     at
>> org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:196)
>>     at
>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
>>     at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253)
>>     at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
>>     at
>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217)
>>     at
>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
>>     at
>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
>>     at
>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
>>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>>     at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>     at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>     at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>     at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>     at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>     at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>>
>> Before that, I have a long string of exceptions from PDFBox attempts on
>> PDF files, interspersed by "FlateFilter: stop reading corrupt stream due to
>> a DataFormatException". These are in the attached log file.
>>
>> The other exceptions are IndexOutOfBounds, ClassCastException,
>> NegativeArraySizeException, NullPointerException, IOException (regarding
>> font(COSName}F2}) in map{}), IllegalArgumentException. These may or may not
>> be related (the exceptions are appearing on different files), but I wonder
>> if they served to corrupt the stream sufficiently that PDFBox got attempted
>> to inflate corrupt data.
>>
>> If it is the same issue, it was reported to be fixed in 0.8. If it is a
>> new issue, is it possible to fix it? I cannot provide any of the source PDF
>> files (client data), but I am attaching the log output containing all of
>> the exception traces including the final OutOfMemoryError.
>>
>> Thanks for any insights.
>>
>> Doug
>>
>>
>>
>>
>

Re: OutOfMemoryError from FlatFilter (could be PDFBOX-453 again)

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Dough,

although that is not an immediate answer to your question could you try the PDFBox command line tool ExtractText with your PDF and see if this gives a similar result. Please try it also with the -nonSeq option. The best would be to try using pdfbox 1.8.0 in addition to 1.7.0 to see if the issue is already fixed. 

BR
Maruan Sahyoun

Am 09.04.2013 um 14:52 schrieb Doug Sackin <ds...@gmail.com>:

> Has anyone else encountered recent problems with FlateFilter and
> OutOfMemory errors? Is there anyway to trap it before it results in
> OutOfMemory exception?
> 
> Thanks
> 
> Doug
> 
> 
> On Mon, Apr 1, 2013 at 2:13 PM, Doug Sackin <ds...@gmail.com> wrote:
> 
>> I appear to have something similar to the bug identified and fixed in
>> PDFBOX-453 - FlateFilter.decode() throwing OutOfMemoryError.
>> 
>> I'm doing text extraction through Twister Data Framework using Tika 1.2
>> which calls PDFBox. I have PDFBox 1.7. My OS is Scientific Linux 5.8. Java
>> is JDK 1.6.0_37.
>> 
>> The offending exception is below:
>> 
>> Caused by: java.lang.OutOfMemoryError
>>    at java.util.zip.Inflater.inflateBytes(Native Method)
>>    at java.util.zip.Inflater.inflate(Inflater.java:238)
>>    at java.util.zip.Inflater.inflate(Inflater.java:256)
>>    at
>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
>>    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
>>    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
>>    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>>    at
>> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>>    at
>> org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:196)
>>    at
>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
>>    at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253)
>>    at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
>>    at
>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217)
>>    at
>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
>>    at
>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
>>    at
>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
>>    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>>    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>>    at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>    at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>    at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>    at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>    at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>    at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>> 
>> Before that, I have a long string of exceptions from PDFBox attempts on
>> PDF files, interspersed by "FlateFilter: stop reading corrupt stream due to
>> a DataFormatException". These are in the attached log file.
>> 
>> The other exceptions are IndexOutOfBounds, ClassCastException,
>> NegativeArraySizeException, NullPointerException, IOException (regarding
>> font(COSName}F2}) in map{}), IllegalArgumentException. These may or may not
>> be related (the exceptions are appearing on different files), but I wonder
>> if they served to corrupt the stream sufficiently that PDFBox got attempted
>> to inflate corrupt data.
>> 
>> If it is the same issue, it was reported to be fixed in 0.8. If it is a
>> new issue, is it possible to fix it? I cannot provide any of the source PDF
>> files (client data), but I am attaching the log output containing all of
>> the exception traces including the final OutOfMemoryError.
>> 
>> Thanks for any insights.
>> 
>> Doug
>> 
>> 
>> 
>>

Re: OutOfMemoryError from FlatFilter (could be PDFBOX-453 again)

Posted by Doug Sackin <ds...@gmail.com>.

Has anyone else encountered recent problems with FlateFilter and
OutOfMemory errors? Is there anyway to trap it before it results in
OutOfMemory exception?

Thanks

Doug


On Mon, Apr 1, 2013 at 2:13 PM, Doug Sackin <ds...@gmail.com> wrote:

> I appear to have something similar to the bug identified and fixed in
> PDFBOX-453 - FlateFilter.decode() throwing OutOfMemoryError.
>
> I'm doing text extraction through Twister Data Framework using Tika 1.2
> which calls PDFBox. I have PDFBox 1.7. My OS is Scientific Linux 5.8. Java
> is JDK 1.6.0_37.
>
> The offending exception is below:
>
> Caused by: java.lang.OutOfMemoryError
>     at java.util.zip.Inflater.inflateBytes(Native Method)
>     at java.util.zip.Inflater.inflate(Inflater.java:238)
>     at java.util.zip.Inflater.inflate(Inflater.java:256)
>     at
> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>     at
> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>     at
> org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:196)
>     at
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
>     at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253)
>     at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
>     at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217)
>     at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
>     at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
>     at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>     at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>     at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>     at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>
> Before that, I have a long string of exceptions from PDFBox attempts on
> PDF files, interspersed by "FlateFilter: stop reading corrupt stream due to
> a DataFormatException". These are in the attached log file.
>
> The other exceptions are IndexOutOfBounds, ClassCastException,
> NegativeArraySizeException, NullPointerException, IOException (regarding
> font(COSName}F2}) in map{}), IllegalArgumentException. These may or may not
> be related (the exceptions are appearing on different files), but I wonder
> if they served to corrupt the stream sufficiently that PDFBox got attempted
> to inflate corrupt data.
>
> If it is the same issue, it was reported to be fixed in 0.8. If it is a
> new issue, is it possible to fix it? I cannot provide any of the source PDF
> files (client data), but I am attaching the log output containing all of
> the exception traces including the final OutOfMemoryError.
>
> Thanks for any insights.
>
> Doug
>
>
>
>