You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2019/03/12 18:36:00 UTC

[jira] [Updated] (PDFBOX-4481) Text extraction error with Thai combined glyph depending on space after it

     [ https://issues.apache.org/jira/browse/PDFBOX-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr updated PDFBOX-4481:
------------------------------------
    Labels: Thai  (was: )

> Text extraction error with Thai combined glyph depending on space after it
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-4481
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4481
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.14
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: Thai
>         Attachments: SO54981236-reduced.pdf, SO54981236-reduced.txt, SO54981236.pdf
>
>
> In the first extracted line of the reduced file, the "accent" (somebody please correct me what that thing is) is separate. On the second line it is at the proper place. Content stream:
> {code}
> BT
>   1 0 0 1 67.3 756.98 Tm
>   [ (\000\203\000\227\000q) ] TJ
>   1 0 0 1 77.5 756.98 Tm
>   [ (\000\003) ] TJ
>   1 0 0 1 67.3 730 Tm
>   [ (\000\203\000\227\000q\000\003) ] TJ
> ET
> {code}
> The weird thing is that the "\003" is just a space. So when the space is in the string the extraction works, and when it isn't, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org