You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Joel Hirsh (JIRA)" <ji...@apache.org> on 2014/10/29 22:39:35 UTC
[jira] [Created] (PDFBOX-2463) ExtractTextByArea mangling second
half of this string - transposed, skipped, etc
Joel Hirsh created PDFBOX-2463:
----------------------------------
Summary: ExtractTextByArea mangling second half of this string - transposed, skipped, etc
Key: PDFBOX-2463
URL: https://issues.apache.org/jira/browse/PDFBOX-2463
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.7
Reporter: Joel Hirsh
PDF snippet is being completely mangled by ExtractTextByArea. Have a large PDF file where this is happening on every line.
Visually (and Acrobat) show the text:
12 Jun EP COPY WORKS LIMITED 503646200256 5637 3.70 11,252.49 OD
However ExtractTextByArea comes up with:
12 Jun EP COPY WORKS LIMITED 503646200256 35 .6 70
11,
3 257 2.49
OD
So the first half of the string is ok, but starting at '5637' characters are skipped, other characters are inserted, completely mangled.
FWIW I did dump the COSString's in PDFStreamEngine and the strings all show correctly, nothing unusual.
Test file to be attached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)