You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2014/04/28 23:11:18 UTC
[jira] [Created] (PDFBOX-2048) TextExtraction only working after
uncompressing with pdftk
Tilman Hausherr created PDFBOX-2048:
---------------------------------------
Summary: TextExtraction only working after uncompressing with pdftk
Key: PDFBOX-2048
URL: https://issues.apache.org/jira/browse/PDFBOX-2048
Project: PDFBox
Issue Type: Bug
Components: Parsing, Rendering, Text extraction
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
>From Jonas Karlsson on the user list:
===
We have a user with PDFs generated by a commercial transcription service.
When we try to extract text from these pdfs, pdfbox returns a few empty
lines. We get this result both from our own code, and when using the
ExtractText command line tool
If I specify the non-sequential parser, with the -nonSeq flag, the
following error is produced:
Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
validateStreamLength
SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream
If I uncompress the file with pdftk, pdfbox is able to successfully extract
the text.
===
I will attach the file "committers only". Don't pass it around, avoid quoting details from the file. The file is also not rendering. The lengths of the streams are 0.
--
This message was sent by Atlassian JIRA
(v6.2#6252)