You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Timo Boehme (JIRA)" <ji...@apache.org> on 2009/02/23 11:13:02 UTC
[jira] Created: (PDFBOX-433) parse Unicode glyph names
parse Unicode glyph names
-------------------------
Key: PDFBOX-433
URL: https://issues.apache.org/jira/browse/PDFBOX-433
Project: PDFBox
Issue Type: Improvement
Components: Parsing, Text extraction
Affects Versions: 0.8.0-incubator
Reporter: Timo Boehme
Priority: Minor
Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
It first strips off suffix and tests later on for names starting with 'uni'.
Timo
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PDFBOX-433) parse Unicode glyph names
Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Carrier resolved PDFBOX-433.
----------------------------------
Resolution: Fixed
Confirmed that patch worked using file supplied.
Sending Encoding.java
Transmitting file data .
Committed revision 747425.
> parse Unicode glyph names
> -------------------------
>
> Key: PDFBOX-433
> URL: https://issues.apache.org/jira/browse/PDFBOX-433
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Text extraction
> Affects Versions: 0.8.0-incubator
> Reporter: Timo Boehme
> Priority: Minor
>
> Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
> It first strips off suffix and tests later on for names starting with 'uni'.
> Timo
> /**
> * This will get the character from the name.
> *
> * @param name The name of the character.
> *
> * @return The printable character for the code.
> */
> public static String getCharacter( COSName name )
> {
> COSName baseName = name;
> String nameStr = baseName.getName();
> // test if we have a suffix and if so remove it
> if ( nameStr.indexOf('.') > 0 ) {
> nameStr = nameStr.substring( 0, nameStr.indexOf('.') );
> baseName = COSName.getPDFName( nameStr );
> }
>
> String character = (String)NAME_TO_CHARACTER.get( baseName );
> if( character == null )
> {
> // test for Unicode name
> // (uniXXXX - XXXX must be a multiple of four;
> // each representing a hexadecimal Unicode code point)
> if ( nameStr.startsWith( "uni" ) )
> {
> StringBuffer uniStr = new StringBuffer();
>
> for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {
> try {
>
> int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
>
> if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> "Unicode character name with not allowed code area: " +
> nameStr );
> else
> uniStr.append( (char) characterCode );
>
> } catch (NumberFormatException nfe) {
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> "Not a number in Unicode character name: " +
> nameStr );
> }
> }
> character = uniStr.toString();
> }
> else
> character = nameStr;
> }
> return character;
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-433) parse Unicode glyph names
Posted by "Timo Boehme (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Timo Boehme updated PDFBOX-433:
-------------------------------
Description:
Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
It first strips off suffix and tests later on for names starting with 'uni'.
Timo
/**
* This will get the character from the name.
*
* @param name The name of the character.
*
* @return The printable character for the code.
*/
public static String getCharacter( COSName name )
{
COSName baseName = name;
String nameStr = baseName.getName();
// test if we have a suffix and if so remove it
if ( nameStr.indexOf('.') > 0 ) {
nameStr = nameStr.substring( 0, nameStr.indexOf('.') );
baseName = COSName.getPDFName( nameStr );
}
String character = (String)NAME_TO_CHARACTER.get( baseName );
if( character == null )
{
// test for Unicode name
// (uniXXXX - XXXX must be a multiple of four;
// each representing a hexadecimal Unicode code point)
if ( nameStr.startsWith( "uni" ) )
{
StringBuffer uniStr = new StringBuffer();
for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {
try {
int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
"Unicode character name with not allowed code area: " +
nameStr );
else
uniStr.append( (char) characterCode );
} catch (NumberFormatException nfe) {
Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
"Not a number in Unicode character name: " +
nameStr );
}
}
character = uniStr.toString();
}
else
character = nameStr;
}
return character;
}
was:
Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
It first strips off suffix and tests later on for names starting with 'uni'.
Timo
> parse Unicode glyph names
> -------------------------
>
> Key: PDFBOX-433
> URL: https://issues.apache.org/jira/browse/PDFBOX-433
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Text extraction
> Affects Versions: 0.8.0-incubator
> Reporter: Timo Boehme
> Priority: Minor
>
> Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
> It first strips off suffix and tests later on for names starting with 'uni'.
> Timo
> /**
> * This will get the character from the name.
> *
> * @param name The name of the character.
> *
> * @return The printable character for the code.
> */
> public static String getCharacter( COSName name )
> {
> COSName baseName = name;
> String nameStr = baseName.getName();
> // test if we have a suffix and if so remove it
> if ( nameStr.indexOf('.') > 0 ) {
> nameStr = nameStr.substring( 0, nameStr.indexOf('.') );
> baseName = COSName.getPDFName( nameStr );
> }
>
> String character = (String)NAME_TO_CHARACTER.get( baseName );
> if( character == null )
> {
> // test for Unicode name
> // (uniXXXX - XXXX must be a multiple of four;
> // each representing a hexadecimal Unicode code point)
> if ( nameStr.startsWith( "uni" ) )
> {
> StringBuffer uniStr = new StringBuffer();
>
> for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {
> try {
>
> int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
>
> if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> "Unicode character name with not allowed code area: " +
> nameStr );
> else
> uniStr.append( (char) characterCode );
>
> } catch (NumberFormatException nfe) {
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> "Not a number in Unicode character name: " +
> nameStr );
> }
> }
> character = uniStr.toString();
> }
> else
> character = nameStr;
> }
> return character;
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-433) parse Unicode glyph names
Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676104#action_12676104 ]
Brian Carrier commented on PDFBOX-433:
--------------------------------------
Do you have a PDF file with these glyph names so that we can verify that the integration works?
> parse Unicode glyph names
> -------------------------
>
> Key: PDFBOX-433
> URL: https://issues.apache.org/jira/browse/PDFBOX-433
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Text extraction
> Affects Versions: 0.8.0-incubator
> Reporter: Timo Boehme
> Priority: Minor
>
> Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
> It first strips off suffix and tests later on for names starting with 'uni'.
> Timo
> /**
> * This will get the character from the name.
> *
> * @param name The name of the character.
> *
> * @return The printable character for the code.
> */
> public static String getCharacter( COSName name )
> {
> COSName baseName = name;
> String nameStr = baseName.getName();
> // test if we have a suffix and if so remove it
> if ( nameStr.indexOf('.') > 0 ) {
> nameStr = nameStr.substring( 0, nameStr.indexOf('.') );
> baseName = COSName.getPDFName( nameStr );
> }
>
> String character = (String)NAME_TO_CHARACTER.get( baseName );
> if( character == null )
> {
> // test for Unicode name
> // (uniXXXX - XXXX must be a multiple of four;
> // each representing a hexadecimal Unicode code point)
> if ( nameStr.startsWith( "uni" ) )
> {
> StringBuffer uniStr = new StringBuffer();
>
> for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {
> try {
>
> int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
>
> if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> "Unicode character name with not allowed code area: " +
> nameStr );
> else
> uniStr.append( (char) characterCode );
>
> } catch (NumberFormatException nfe) {
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> "Not a number in Unicode character name: " +
> nameStr );
> }
> }
> character = uniStr.toString();
> }
> else
> character = nameStr;
> }
> return character;
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-433) parse Unicode glyph names
Posted by "Timo Boehme (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676318#action_12676318 ]
Timo Boehme commented on PDFBOX-433:
------------------------------------
Yes, if you look in Google for 'uni0049 pdf' you get a link to http://shongane.ie.u-ryukyu.ac.jp/viewvc/y06/e065763/info3/final/pic/uml.pdf?revision=1.1&view=markup&sortby=date&sortdir=down.
There you can download the PDF file (short link: http://shongane.ie.u-ryukyu.ac.jp/viewvc/y06/e065763/info3/final/pic/uml.pdf?revision=1.1 ).
This file has uniXXXX names with suffix (e.g. uni30A2.926).
Without my patch you won't get any usable characters.
> parse Unicode glyph names
> -------------------------
>
> Key: PDFBOX-433
> URL: https://issues.apache.org/jira/browse/PDFBOX-433
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Text extraction
> Affects Versions: 0.8.0-incubator
> Reporter: Timo Boehme
> Priority: Minor
>
> Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names should be constructed to easily convert them (to Unicode). What is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names (uniXXXX). I have therefore attached an updated method getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
> It first strips off suffix and tests later on for names starting with 'uni'.
> Timo
> /**
> * This will get the character from the name.
> *
> * @param name The name of the character.
> *
> * @return The printable character for the code.
> */
> public static String getCharacter( COSName name )
> {
> COSName baseName = name;
> String nameStr = baseName.getName();
> // test if we have a suffix and if so remove it
> if ( nameStr.indexOf('.') > 0 ) {
> nameStr = nameStr.substring( 0, nameStr.indexOf('.') );
> baseName = COSName.getPDFName( nameStr );
> }
>
> String character = (String)NAME_TO_CHARACTER.get( baseName );
> if( character == null )
> {
> // test for Unicode name
> // (uniXXXX - XXXX must be a multiple of four;
> // each representing a hexadecimal Unicode code point)
> if ( nameStr.startsWith( "uni" ) )
> {
> StringBuffer uniStr = new StringBuffer();
>
> for ( int chPos = 3; chPos + 4 <= nameStr.length(); chPos += 4 ) {
> try {
>
> int characterCode = Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
>
> if ( ( characterCode > 0xD7FF ) && ( characterCode < 0xE000 ) )
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> "Unicode character name with not allowed code area: " +
> nameStr );
> else
> uniStr.append( (char) characterCode );
>
> } catch (NumberFormatException nfe) {
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
> "Not a number in Unicode character name: " +
> nameStr );
> }
> }
> character = uniStr.toString();
> }
> else
> character = nameStr;
> }
> return character;
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.