You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2019/03/12 19:13:00 UTC

[jira] [Commented] (PDFBOX-4481) Text extraction error with Thai combined glyph depending on space after it

    [ https://issues.apache.org/jira/browse/PDFBOX-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790892#comment-16790892 ] 

Tilman Hausherr commented on PDFBOX-4481:
-----------------------------------------

The problem in the reduced file is that the space is at the same start position as the diacritic. I was able to fix the text extraction in the reduced file by removing the "=" in {{if (tp2Xend <= thisXstart || tp2Xstart >= thisXend)}} in TextPosition.java but this causes regressions in the complete file. So I'll have to create another reduced file and keep searching :(

> Text extraction error with Thai combined glyph depending on space after it
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-4481
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4481
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.14
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: Thai
>         Attachments: SO54981236-reduced.pdf, SO54981236-reduced.txt, SO54981236.pdf
>
>
> In the first extracted line of the reduced file, the "accent" (somebody please correct me what that thing is) is separate. On the second line it is at the proper place. Content stream:
> {code}
> BT
>   1 0 0 1 67.3 756.98 Tm
>   [ (\000\203\000\227\000q) ] TJ
>   1 0 0 1 77.5 756.98 Tm
>   [ (\000\003) ] TJ
>   1 0 0 1 67.3 730 Tm
>   [ (\000\203\000\227\000q\000\003) ] TJ
> ET
> {code}
> The weird thing is that the "\003" is just a space. So when the space is in the string the extraction works, and when it isn't, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org