You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Jira)" <ji...@apache.org> on 2020/04/26 12:51:00 UTC

[jira] [Updated] (PDFBOX-4749) Text Extraction leads to weird result - toUnicodeCMap is 'AdHoc-UCS'

     [ https://issues.apache.org/jira/browse/PDFBOX-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-4749:
---------------------------------------
    Fix Version/s: 3.0.0 PDFBox

> Text Extraction leads to weird result - toUnicodeCMap is 'AdHoc-UCS'
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-4749
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4749
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.18
>            Reporter: Benoit Lacelle
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: PDFBOX-4749-reduced.pdf
>
>
> I consider the attached PDF. I consider the text on the first page:
> "Am Fährweg"
> It appears the code for the first character 'A' is 65 and is parsed correctly, while the code for the fourth character 'F' is 70 which is parsed as a 'c'.
> org.apache.pdfbox.pdmodel.font.PDFont.toUnicode(int) relies on a CMap named 'AdHoc-UCS' which mapping in :
> {129=ü, 3= , 8=%, 9=&, 11=(, 12=), 15=,, 16=-, 17=., 18=/, 19=0, 20=1, 21=2, 22=3, 23=4, 24=5, 25=6, 26=7, 27=8, 28=9, 29=:, 34=?, 36=A, 37=B, 38=C, 39=D, 40=E, 41=F, 42=G, 43=H, 44=I, 46=K, 47=L, 48=M, 49=N, 50=O, 51=P, 53=R, 54=S, 55=T, 56=U, 57=V, 58=W, 59=X, 61=Z, 68=a, 69=b, 70=c, 71=d, 72=e, 73=f, 74=g, 75=h, 76=i, 78=k, 79=l, 80=m, 81=n, 82=o, 83=p, 85=r, 86=s, 87=t, 88=u, 89=v, 90=w, 93=z, 95=|, 108=ä, 124=ö}
> -> 'A' is parsed as 'A' as it is out of the mapping of CMap, while 'F' conflicts the entry mapping 70 to c.
> The document is correctly parsed in Acrobat Reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org