You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2019/03/05 17:19:00 UTC

[jira] [Created] (PDFBOX-4481) Text extraction error with Thai combined glyph depending on space after it

Tilman Hausherr created PDFBOX-4481:
---------------------------------------

             Summary: Text extraction error with Thai combined glyph depending on space after it
                 Key: PDFBOX-4481
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4481
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.14
            Reporter: Tilman Hausherr
         Attachments: SO54981236-reduced.pdf, SO54981236-reduced.txt, SO54981236.pdf

In the first extracted line of the reduced file, the "accent" (somebody please correct me what that thing is) is separate. On the second line it is at the proper place. Content stream:
{code}
BT
  1 0 0 1 67.3 756.98 Tm
  [ (\000\203\000\227\000q) ] TJ
  1 0 0 1 77.5 756.98 Tm
  [ (\000\003) ] TJ
  1 0 0 1 67.3 730 Tm
  [ (\000\203\000\227\000q\000\003) ] TJ
ET
{code}
The weird thing is that the "\003" is just a space. So when the space is in the string the extraction works, and when it isn't, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org