You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2014/04/28 23:11:18 UTC

[jira] [Created] (PDFBOX-2048) TextExtraction only working after uncompressing with pdftk

Tilman Hausherr created PDFBOX-2048:
---------------------------------------

             Summary: TextExtraction only working after uncompressing with pdftk
                 Key: PDFBOX-2048
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2048
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing, Rendering, Text extraction
    Affects Versions: 2.0.0
            Reporter: Tilman Hausherr


>From Jonas Karlsson on the user list:
===
We have a user with PDFs generated by a commercial transcription service.
When we try to extract text from these pdfs, pdfbox returns a few empty
lines. We get this result both from our own code, and when using the
ExtractText command line tool

If I specify the non-sequential parser, with the -nonSeq flag, the
following error is produced:

Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream

If I uncompress the file with pdftk, pdfbox is able to successfully extract
the text.
===

I will attach the file "committers only". Don't pass it around, avoid quoting details from the file. The file is also not rendering. The lengths of the streams are 0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)