You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/10/24 18:42:00 UTC

[jira] [Closed] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

     [ https://issues.apache.org/jira/browse/PDFBOX-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr closed PDFBOX-3975.
-----------------------------------
    Resolution: Won't Fix

> ExtractText converts some diacritics to combining forms that don't get combined
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3975
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3975
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>            Reporter: Matthew Self
>            Priority: Minor
>              Labels: diacritics
>         Attachments: PDF_32000_2008-p23-reduced1.pdf, PDF_32000_2008-p23-reduced2.pdf
>
>
> When I use ExtractText on the file http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf, there is an issue with the "^" character on page 15.
> The extracted text is "special characters ( * ! & } ̂  % and so on ) . )".
> Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT) when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).
> I believe that what is happening is the original U+005E character is being converted to U+0302 by the DIACRITICS map in TextPosition.java:
>         map.put(0x005e, "\u0302");
> This is probably because the character slightly overlaps the preceding space character.  But then this combining diacritic can't be combined with space character, so the extracted text contains the combining character instead of the original.
> One solution would be to tighten up the detection of overlaps so that combineDiacritic() is not called in this instance.
> Another (perhaps more robust) solution would be to verify in combineDiacritic() that the call to Normalizer.normalize() actually does combine the combining form of the diacritic with the previous character.  If the result of calling Normalizer.normalize() has more than one character in it, then the diacritic must not have been combined with the previous character.  In that case, the diacritic should not be merged.
> The goal would be for the extracted text to never contain combining characters that failed to combine.
> P.S.  Thank you for the great library of PDFBox!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org