You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/06/20 17:15:00 UTC

[jira] [Comment Edited] (PDFBOX-3833) Characters in wrong order

    [ https://issues.apache.org/jira/browse/PDFBOX-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056038#comment-16056038 ] 

Tilman Hausherr edited comment on PDFBOX-3833 at 6/20/17 5:14 PM:
------------------------------------------------------------------

This is the output of PrintTextLocations:
{code}
String[45.56,191.7999 fs=1.0 xscale=12.0 height=9.156 space=12.000002 width=12.0]?
String[65.337204,191.7999 fs=1.0 xscale=12.0 height=9.156 space=12.000002 width=12.0]??
{code}
It should be 3 lines... and what's also suspicious is this URL about
http://www.fileformat.info/info/unicode/char/30fc/index.htm
it mentions that the ー is a "modifier". {{TextPosition.isDiacritic()}} returns true.


was (Author: tilman):
This is the output of PrintTextLocations:
{code}
String[45.56,191.7999 fs=1.0 xscale=12.0 height=9.156 space=12.000002 width=12.0]?
String[65.337204,191.7999 fs=1.0 xscale=12.0 height=9.156 space=12.000002 width=12.0]??
{code}
It should be 3 lines... and what's also suspicious is this URL about
http://www.fileformat.info/info/unicode/char/30fc/index.htm
it mentions that the ー is a "modifier".

> Characters in wrong order
> -------------------------
>
>                 Key: PDFBOX-3833
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3833
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.5
>            Reporter: Christopher Creutzig
>         Attachments: ML_mathworks_unc2.pdf, PDFBOX-3833-reduced.pdf
>
>
> The attached pdf file (which is page 3 of https://jp.mathworks.com/tagteam/89688_93050v00_JP_machine_learning_section1_ebook.pdf) shows multiple problems when reading with PDFBox in standard settings. This bug report in particular is about the Katakana ー being misplaced.
> In the text block on the left, the second line starts with ターン. PDFTextStripper.getText returns text starting with タ ンー (i.e., adding a space after the first character and swapping the second and third one). This effect also happens at other places in the (complete) file.
> The PDF itself at this point has [<03BB>43.9 <0294>156 <03EF>-24.5 ...]TJ, listing the characters in the proper order. Copy&paste using Apple's Preview.App also preserves that order.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org