You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Matthew Self (JIRA)" <ji...@apache.org> on 2017/10/23 01:31:00 UTC
[jira] [Comment Edited] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

    [ https://issues.apache.org/jira/browse/PDFBOX-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214498#comment-16214498 ] 

Matthew Self edited comment on PDFBOX-3975 at 10/23/17 1:30 AM:
----------------------------------------------------------------

Looking more closely at the code, I see that mergeDiacritic() isn't actually merging the base character and the diacritic into its NFC form (a single character), but rather leaving it in NFD form (the base char followed by combining diacritic).

For example, I have a PDF document that contains the name "Krkošek".  In the Tj, this consists of "s" followed by U+02C7 (CANON), which will be displayed as two characters in a text editor.  The output of ExtractText is "s" followed U+030C (COMBINING CANON).  This is valid UTF-8 and will display correctly in a text editor, but it is in NFD form rather than NFC form.  The desired output would be the single character U+0161 (LATIN SMALL LETTER S WITH CANON), which is the same Unicode string but in NFC form.

My suggestion would be to rework this code so that instead of just converting the diacritics from stand-alone form to combining form, it also uses Normalizer.Form.NFC() to combine the base character and the diacritic.  If this results in a single character, then the output is in the desired NFC form.  If this results in no change to the string, then mergeDiacritic() should not merge the characters (even though they appear to overlap) and leave the diacritic character in its original (stand-alone) form.

This would fix both issues (unwanted conversion of U+005E to U+0302 and failure to produce the NFC form U+0161).


was (Author: mself):
Looking more closely at the code, I see that mergeDiacritic() isn't actually merging the base character and the diacritic into its NFC form (a single character), but rather leaving it in NFD form (the base char followed by combining diacritic).

For example, I have a PDF document that contains the name "Krkošek".  In the Tj, this consists of "s" followed by U+02C7 (CANON), which will be displayed as two characters in a text editor.  The output of ExtractText is "s" followed U+030C (COMBINING CANON).  This is valid UTF-8 and will display correctly in a text editor, but it is in NFD form rather than NFC form.  The desired output would be the single character U+0161 (LATIN SMALL LETTER S WITH CANON), which is the same Unicode string but in NFC form.

My suggestion would be to rework this code so that instead of just converting the diacritics from stand-alone form to combining form, it also uses Normalizer.Form.NFC() to combine the base character and the diacritic.  If this results in a single character, then the output is in the desired NFC form.  If this results in no change to the string, then mergeDiacritic() should not merge the characters (even though they appear to overlap) and leave the diacritic character in its original (stand-alone) form.

This would fix both issues (unwanted conversion of U+005E to U+0302 and failure to produce the NFC form U+0161).

If you agree with this approach, I can work on a patchset and run the regression tests.

> ExtractText converts some diacritics to combining forms that don't get combined
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3975
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3975
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>            Reporter: Matthew Self
>
> When I use ExtractText on the file http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf, there is an issue with the "^" character on page 15.
> The extracted text is "special characters ( * ! & } ̂  % and so on ) . )".
> Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT) when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).
> I believe that what is happening is the original U+005E character is being converted to U+0302 by the DIACRITICS map in TextPosition.java:
>         map.put(0x005e, "\u0302");
> This is probably because the character slightly overlaps the preceding space character.  But then this combining diacritic can't be combined with space character, so the extracted text contains the combining character instead of the original.
> One solution would be to tighten up the detection of overlaps so that combineDiacritic() is not called in this instance.
> Another (perhaps more robust) solution would be to verify in combineDiacritic() that the call to Normalizer.normalize() actually does combine the combining form of the diacritic with the previous character.  If the result of calling Normalizer.normalize() has more than one character in it, then the diacritic must not have been combined with the previous character.  In that case, the diacritic should not be merged.
> The goal would be for the extracted text to never contain combining characters that failed to combine.
> P.S.  Thank you for the great library of PDFBox!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org