You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Timo Boehme (JIRA)" <ji...@apache.org> on 2009/02/23 11:13:02 UTC

[jira] Created: (PDFBOX-433) parse Unicode glyph names

parse Unicode glyph names
-------------------------

                 Key: PDFBOX-433
                 URL: https://issues.apache.org/jira/browse/PDFBOX-433
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing, Text extraction
    Affects Versions: 0.8.0-incubator
            Reporter: Timo Boehme
            Priority: Minor


Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
It first strips off suffix and tests later on for names starting with 'uni'.

Timo

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-433) parse Unicode glyph names

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier resolved PDFBOX-433.
----------------------------------

    Resolution: Fixed

Confirmed that patch worked using file supplied.  

Sending        Encoding.java
Transmitting file data .
Committed revision 747425.


> parse Unicode glyph names
> -------------------------
>
>                 Key: PDFBOX-433
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-433
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Timo Boehme
>            Priority: Minor
>
> Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
> It first strips off suffix and tests later on for names starting with 'uni'.
> Timo
>     /**
>      * This will get the character from the name.
>      *
>      * @param name The name of the character.
>      *
>      * @return The printable character for the code.
>      */
>     public static String getCharacter( COSName name )
>     {
>     	  COSName baseName = name;
>     	  String  nameStr  = baseName.getName();
>     	  // test if we have a suffix and if so remove it
>     	  if ( nameStr.indexOf('.') > 0 ) {
>     	  	nameStr  = nameStr.substring( 0, nameStr.indexOf('.') );
>     	  	baseName = COSName.getPDFName( nameStr ); 
>     	  }
>     	  
>         String character = (String)NAME_TO_CHARACTER.get( baseName );
>         if( character == null )
>         {
>         	  // test for Unicode name
>         	  // (uniXXXX - XXXX must be a multiple of four;
>         	  //  each representing a hexadecimal Unicode code point) 
>         	  if ( nameStr.startsWith( "uni" ) )
>         	  {
>         	  	  StringBuffer uniStr = new StringBuffer();
>         	  	  
>         	  		for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {
>         	  			try {
>         	  				
> 	        	  			int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
> 	        	  			
> 	        	  			if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
> 	        	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> 	        	  																												"Unicode character name with not allowed code area: " +
> 	        	  																												nameStr );
> 	        	  			else
> 	        	  				uniStr.append( (char) characterCode );
> 	        	  			
>         	  			} catch (NumberFormatException nfe) {
>         	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> 												"Not a number in Unicode character name: " +
> 												nameStr );
>         	  			}
>         	  		}
>         	  		character = uniStr.toString();
>         	  }
>         	  else
>         	  	  character = nameStr;
>         }
>         return character;
>     }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-433) parse Unicode glyph names

Posted by "Timo Boehme (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme updated PDFBOX-433:
-------------------------------

    Description: 
Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
It first strips off suffix and tests later on for names starting with 'uni'.

Timo

    /**
     * This will get the character from the name.
     *
     * @param name The name of the character.
     *
     * @return The printable character for the code.
     */
    public static String getCharacter( COSName name )
    {
    	  COSName baseName = name;
    	  String  nameStr  = baseName.getName();

    	  // test if we have a suffix and if so remove it
    	  if ( nameStr.indexOf('.') > 0 ) {
    	  	nameStr  = nameStr.substring( 0, nameStr.indexOf('.') );
    	  	baseName = COSName.getPDFName( nameStr ); 
    	  }
    	  
        String character = (String)NAME_TO_CHARACTER.get( baseName );
        if( character == null )
        {
        	  // test for Unicode name
        	  // (uniXXXX - XXXX must be a multiple of four;
        	  //  each representing a hexadecimal Unicode code point) 
        	  if ( nameStr.startsWith( "uni" ) )
        	  {
        	  	  StringBuffer uniStr = new StringBuffer();
        	  	  
        	  		for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {

        	  			try {
        	  				
	        	  			int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
	        	  			
	        	  			if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
	        	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
	        	  																												"Unicode character name with not allowed code area: " +
	        	  																												nameStr );
	        	  			else
	        	  				uniStr.append( (char) characterCode );
	        	  			
        	  			} catch (NumberFormatException nfe) {
        	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
												"Not a number in Unicode character name: " +
												nameStr );
        	  			}
        	  		}
        	  		character = uniStr.toString();
        	  }
        	  else
        	  	  character = nameStr;
        }
        return character;
    }


  was:
Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
It first strips off suffix and tests later on for names starting with 'uni'.

Timo


> parse Unicode glyph names
> -------------------------
>
>                 Key: PDFBOX-433
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-433
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Timo Boehme
>            Priority: Minor
>
> Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
> It first strips off suffix and tests later on for names starting with 'uni'.
> Timo
>     /**
>      * This will get the character from the name.
>      *
>      * @param name The name of the character.
>      *
>      * @return The printable character for the code.
>      */
>     public static String getCharacter( COSName name )
>     {
>     	  COSName baseName = name;
>     	  String  nameStr  = baseName.getName();
>     	  // test if we have a suffix and if so remove it
>     	  if ( nameStr.indexOf('.') > 0 ) {
>     	  	nameStr  = nameStr.substring( 0, nameStr.indexOf('.') );
>     	  	baseName = COSName.getPDFName( nameStr ); 
>     	  }
>     	  
>         String character = (String)NAME_TO_CHARACTER.get( baseName );
>         if( character == null )
>         {
>         	  // test for Unicode name
>         	  // (uniXXXX - XXXX must be a multiple of four;
>         	  //  each representing a hexadecimal Unicode code point) 
>         	  if ( nameStr.startsWith( "uni" ) )
>         	  {
>         	  	  StringBuffer uniStr = new StringBuffer();
>         	  	  
>         	  		for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {
>         	  			try {
>         	  				
> 	        	  			int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
> 	        	  			
> 	        	  			if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
> 	        	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> 	        	  																												"Unicode character name with not allowed code area: " +
> 	        	  																												nameStr );
> 	        	  			else
> 	        	  				uniStr.append( (char) characterCode );
> 	        	  			
>         	  			} catch (NumberFormatException nfe) {
>         	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> 												"Not a number in Unicode character name: " +
> 												nameStr );
>         	  			}
>         	  		}
>         	  		character = uniStr.toString();
>         	  }
>         	  else
>         	  	  character = nameStr;
>         }
>         return character;
>     }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-433) parse Unicode glyph names

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676104#action_12676104 ] 

Brian Carrier commented on PDFBOX-433:
--------------------------------------

Do you have a PDF file with these glyph names so that we can verify that the integration works? 


> parse Unicode glyph names
> -------------------------
>
>                 Key: PDFBOX-433
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-433
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Timo Boehme
>            Priority: Minor
>
> Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
> It first strips off suffix and tests later on for names starting with 'uni'.
> Timo
>     /**
>      * This will get the character from the name.
>      *
>      * @param name The name of the character.
>      *
>      * @return The printable character for the code.
>      */
>     public static String getCharacter( COSName name )
>     {
>     	  COSName baseName = name;
>     	  String  nameStr  = baseName.getName();
>     	  // test if we have a suffix and if so remove it
>     	  if ( nameStr.indexOf('.') > 0 ) {
>     	  	nameStr  = nameStr.substring( 0, nameStr.indexOf('.') );
>     	  	baseName = COSName.getPDFName( nameStr ); 
>     	  }
>     	  
>         String character = (String)NAME_TO_CHARACTER.get( baseName );
>         if( character == null )
>         {
>         	  // test for Unicode name
>         	  // (uniXXXX - XXXX must be a multiple of four;
>         	  //  each representing a hexadecimal Unicode code point) 
>         	  if ( nameStr.startsWith( "uni" ) )
>         	  {
>         	  	  StringBuffer uniStr = new StringBuffer();
>         	  	  
>         	  		for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {
>         	  			try {
>         	  				
> 	        	  			int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
> 	        	  			
> 	        	  			if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
> 	        	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> 	        	  																												"Unicode character name with not allowed code area: " +
> 	        	  																												nameStr );
> 	        	  			else
> 	        	  				uniStr.append( (char) characterCode );
> 	        	  			
>         	  			} catch (NumberFormatException nfe) {
>         	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> 												"Not a number in Unicode character name: " +
> 												nameStr );
>         	  			}
>         	  		}
>         	  		character = uniStr.toString();
>         	  }
>         	  else
>         	  	  character = nameStr;
>         }
>         return character;
>     }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-433) parse Unicode glyph names

Posted by "Timo Boehme (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676318#action_12676318 ] 

Timo Boehme commented on PDFBOX-433:
------------------------------------

Yes, if you look in Google for 'uni0049 pdf' you get a link to http://shongane.ie.u-ryukyu.ac.jp/viewvc/y06/e065763/info3/final/pic/uml.pdf?revision=1.1&view=markup&sortby=date&sortdir=down.
There you can download the PDF file (short link: http://shongane.ie.u-ryukyu.ac.jp/viewvc/y06/e065763/info3/final/pic/uml.pdf?revision=1.1 ).
This file has uniXXXX names with suffix (e.g. uni30A2.926).
Without my patch you won't get any usable characters.


> parse Unicode glyph names
> -------------------------
>
>                 Key: PDFBOX-433
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-433
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Timo Boehme
>            Priority: Minor
>
> Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
> It first strips off suffix and tests later on for names starting with 'uni'.
> Timo
>     /**
>      * This will get the character from the name.
>      *
>      * @param name The name of the character.
>      *
>      * @return The printable character for the code.
>      */
>     public static String getCharacter( COSName name )
>     {
>     	  COSName baseName = name;
>     	  String  nameStr  = baseName.getName();
>     	  // test if we have a suffix and if so remove it
>     	  if ( nameStr.indexOf('.') > 0 ) {
>     	  	nameStr  = nameStr.substring( 0, nameStr.indexOf('.') );
>     	  	baseName = COSName.getPDFName( nameStr ); 
>     	  }
>     	  
>         String character = (String)NAME_TO_CHARACTER.get( baseName );
>         if( character == null )
>         {
>         	  // test for Unicode name
>         	  // (uniXXXX - XXXX must be a multiple of four;
>         	  //  each representing a hexadecimal Unicode code point) 
>         	  if ( nameStr.startsWith( "uni" ) )
>         	  {
>         	  	  StringBuffer uniStr = new StringBuffer();
>         	  	  
>         	  		for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {
>         	  			try {
>         	  				
> 	        	  			int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
> 	        	  			
> 	        	  			if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
> 	        	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> 	        	  																												"Unicode character name with not allowed code area: " +
> 	        	  																												nameStr );
> 	        	  			else
> 	        	  				uniStr.append( (char) characterCode );
> 	        	  			
>         	  			} catch (NumberFormatException nfe) {
>         	  				Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> 												"Not a number in Unicode character name: " +
> 												nameStr );
>         	  			}
>         	  		}
>         	  		character = uniStr.toString();
>         	  }
>         	  else
>         	  	  character = nameStr;
>         }
>         return character;
>     }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.