You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michael Klink (JIRA)" <ji...@apache.org> on 2018/06/04 15:55:00 UTC

[jira] [Created] (PDFBOX-4236) PDFTextStripper diacritic merge sometimes chooses wrong base glyph

Michael Klink created PDFBOX-4236:
-------------------------------------

             Summary: PDFTextStripper diacritic merge sometimes chooses wrong base glyph
                 Key: PDFBOX-4236
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4236
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 3.0.0 PDFBox
            Reporter: Michael Klink
         Attachments: SA-U-NA.png, pattern3.pdf

In the course of answering [this stack overflow question|https://stackoverflow.com/q/50664162/1729265] I saw that text extraction from the example file pattern3.pdf exposes an error in the diacritic merging code, the wrong base glyph is chosen.

From the bottom of [my answer|https://stackoverflow.com/a/50679508/1729265] there:

{quote}By the way, your test file exposes an error in the PDFBox determination of the base glyph to merge a diacritic with: The "स[1434]ु[1441]न[1418]" is meant to be rendered as "सुन", i.e. the vowel sign u "ु" is combined with the letter sa "स", but PDFBox combines it with the subsequent letter na "न" as "सनु".

The cause is that it determines the letter to combine the diacritic with by its origin which here indeed is in the range of the latter letter na "न", but as the vowel sign glyph is rendered before its origin (it is drawn in an area with a negative x coordinate), PDFBox determines the wrong association.
{quote}

Also see SA-U-NA.png, screen shots of the glyph coordinate ranges.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org