You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2018/06/11 17:00:00 UTC

[jira] [Closed] (PDFBOX-4243) DataFormatException: "invalid stored block lengths" in FlateFilter

     [ https://issues.apache.org/jira/browse/PDFBOX-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr closed PDFBOX-4243.
-----------------------------------
    Resolution: Won't Fix

I'll do it. The best is to have the text extraction run separately for each page. This way you'll only lose one page and get all the rest. (Apache Tika can do this too).

> DataFormatException: "invalid stored block lengths" in FlateFilter
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-4243
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4243
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.8, 2.0.9
>         Environment: Java 8 update 172
> Tika parsers 1.17 (and 1.18)
> Windows 7 (and server 2012)
>            Reporter: Isabelle Giguere
>            Priority: Major
>         Attachments: IN_THE_UNITED_STATES_DISTRICT_COURT_(78).pdf
>
>
> The attached PDF document causes this exception.  Similar to PDFBOX-3546, but probably not the same root cause.
> Observed using Tika 1.17 + PDF Box 2.0.8, and with Tika 1.18 + PDF Box 2.0.9
> {noformat}
> org.apache.tika.exception.TikaException: Unable to extract PDF content
>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>     at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.conversion.tika.Conversion.parse(Conversion.java:56)
>     at test.conversion.tika.Conversion.main(Conversion.java:40)
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid stored block lengths
>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>     at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>     at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77)
>     at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>     at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>     at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:157)
>     at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:91)
>     at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:493)
>     at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
>     at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>     at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>     at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>     at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>     at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>     at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>     ... 6 more
> Caused by: java.util.zip.DataFormatException: invalid stored block lengths
>     at java.util.zip.Inflater.inflateBytes(Native Method)
>     at java.util.zip.Inflater.inflate(Inflater.java:259)
>     at java.util.zip.Inflater.inflate(Inflater.java:280)
>     at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:108)
>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>     ... 21 more
> {noformat}
> Thank you for looking into this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org