You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Robert Baruch (JIRA)" <ji...@apache.org> on 2008/08/20 02:36:44 UTC

[jira] Created: (PDFBOX-371) Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested fix)

Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested fix)
---------------------------------------------------------------------------------

                 Key: PDFBOX-371
                 URL: https://issues.apache.org/jira/browse/PDFBOX-371
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.7.3
         Environment: Java 1.5, OSX 10.5
            Reporter: Robert Baruch
            Priority: Minor


When running text extraction on a PDF file that contains the soft hyphen character in the WinAnsiEncoding (that is, 0255), the text extractor incorrectly maps this as a space, when it should be a hyphen. As the PDF Reference 1.7 says in note 5 of table D.1:

'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning of this duplicate code is "soft hyphen," but it is typographically the same as hyphen.'

The reason that a soft hyphen is typographically the same as hyphen is that a soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e. breaking a word across lines). Since the soft hyphen should only be put, by the PDF producer, at the end of a line to break a word, it stands to reason that the option to place a hyphen must be taken.

I think I've traced the reason for the substitution to Encoding.getName, where because there is no mapping in the codeToName mapping for this code in WinAnsiEncoding, by default it returns "space".

The fix is not as simple as adding an addCharacterEncoding( 0255, COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both the codeToName mapping AND the nameToMap encoding, which will overwrite the 055 nameToCode mapping.

Adding this line:

codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));

to the end of the WinAnsiEncoding constructor seems to fix the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-371) Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested fix)

Posted by "Navendu Garg (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764710#action_12764710 ] 

Navendu Garg commented on PDFBOX-371:
-------------------------------------

Hi,

Just wanted to bring this issue at the fore front. I am not sure replacing soft hyphens with a regular hyphens is a good solution. However, I am not aware of any character that acts like an invisible character in unicode that can be substituted for the soft hyphen, so the text would appear without any characters and still yet the character count will be accurate.

Navendu

> Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested fix)
> ---------------------------------------------------------------------------------
>
>                 Key: PDFBOX-371
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-371
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: Java 1.5, OSX 10.5
>            Reporter: Robert Baruch
>            Priority: Minor
>
> When running text extraction on a PDF file that contains the soft hyphen character in the WinAnsiEncoding (that is, 0255), the text extractor incorrectly maps this as a space, when it should be a hyphen. As the PDF Reference 1.7 says in note 5 of table D.1:
> 'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning of this duplicate code is "soft hyphen," but it is typographically the same as hyphen.'
> The reason that a soft hyphen is typographically the same as hyphen is that a soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e. breaking a word across lines). Since the soft hyphen should only be put, by the PDF producer, at the end of a line to break a word, it stands to reason that the option to place a hyphen must be taken.
> I think I've traced the reason for the substitution to Encoding.getName, where because there is no mapping in the codeToName mapping for this code in WinAnsiEncoding, by default it returns "space".
> The fix is not as simple as adding an addCharacterEncoding( 0255, COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both the codeToName mapping AND the nameToCode mapping, which will overwrite the 055 nameToCode mapping.
> Adding this line:
> codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));
> to the end of the WinAnsiEncoding constructor seems to fix the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-371) Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested fix)

Posted by "Robert Baruch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Baruch updated PDFBOX-371:
---------------------------------

    Description: 
When running text extraction on a PDF file that contains the soft hyphen character in the WinAnsiEncoding (that is, 0255), the text extractor incorrectly maps this as a space, when it should be a hyphen. As the PDF Reference 1.7 says in note 5 of table D.1:

'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning of this duplicate code is "soft hyphen," but it is typographically the same as hyphen.'

The reason that a soft hyphen is typographically the same as hyphen is that a soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e. breaking a word across lines). Since the soft hyphen should only be put, by the PDF producer, at the end of a line to break a word, it stands to reason that the option to place a hyphen must be taken.

I think I've traced the reason for the substitution to Encoding.getName, where because there is no mapping in the codeToName mapping for this code in WinAnsiEncoding, by default it returns "space".

The fix is not as simple as adding an addCharacterEncoding( 0255, COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both the codeToName mapping AND the nameToCode mapping, which will overwrite the 055 nameToCode mapping.

Adding this line:

codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));

to the end of the WinAnsiEncoding constructor seems to fix the issue.

  was:
When running text extraction on a PDF file that contains the soft hyphen character in the WinAnsiEncoding (that is, 0255), the text extractor incorrectly maps this as a space, when it should be a hyphen. As the PDF Reference 1.7 says in note 5 of table D.1:

'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning of this duplicate code is "soft hyphen," but it is typographically the same as hyphen.'

The reason that a soft hyphen is typographically the same as hyphen is that a soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e. breaking a word across lines). Since the soft hyphen should only be put, by the PDF producer, at the end of a line to break a word, it stands to reason that the option to place a hyphen must be taken.

I think I've traced the reason for the substitution to Encoding.getName, where because there is no mapping in the codeToName mapping for this code in WinAnsiEncoding, by default it returns "space".

The fix is not as simple as adding an addCharacterEncoding( 0255, COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both the codeToName mapping AND the nameToMap encoding, which will overwrite the 055 nameToCode mapping.

Adding this line:

codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));

to the end of the WinAnsiEncoding constructor seems to fix the issue.


> Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested fix)
> ---------------------------------------------------------------------------------
>
>                 Key: PDFBOX-371
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-371
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: Java 1.5, OSX 10.5
>            Reporter: Robert Baruch
>            Priority: Minor
>
> When running text extraction on a PDF file that contains the soft hyphen character in the WinAnsiEncoding (that is, 0255), the text extractor incorrectly maps this as a space, when it should be a hyphen. As the PDF Reference 1.7 says in note 5 of table D.1:
> 'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning of this duplicate code is "soft hyphen," but it is typographically the same as hyphen.'
> The reason that a soft hyphen is typographically the same as hyphen is that a soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e. breaking a word across lines). Since the soft hyphen should only be put, by the PDF producer, at the end of a line to break a word, it stands to reason that the option to place a hyphen must be taken.
> I think I've traced the reason for the substitution to Encoding.getName, where because there is no mapping in the codeToName mapping for this code in WinAnsiEncoding, by default it returns "space".
> The fix is not as simple as adding an addCharacterEncoding( 0255, COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both the codeToName mapping AND the nameToCode mapping, which will overwrite the 055 nameToCode mapping.
> Adding this line:
> codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));
> to the end of the WinAnsiEncoding constructor seems to fix the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.