You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Antoni Mylka (JIRA)" <ji...@apache.org> on 2010/12/13 19:48:04 UTC

[jira] Created: (PDFBOX-920) PDFStreamEngine.processEncodedText fails on UTF-16 text

PDFStreamEngine.processEncodedText fails on UTF-16 text
-------------------------------------------------------

                 Key: PDFBOX-920
                 URL: https://issues.apache.org/jira/browse/PDFBOX-920
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 1.3.1
            Reporter: Antoni Mylka


I have a PDF document which yields gibberish text. When I debug it, I get to the PDFStreamEngine.processEncodedText. The method gets a following byte array:

[0, 47, 0, 82, 0, 82, 0, 78, 0, 3, 0, 68, 0, 87, 0, 3, 0, 87, 0, 75, 0, 72, 0, 3, 0, -64, 0, 85, 0, 86, 0, 87, 0, 3, 0, 83, 0, 76, 0, 70, 0, 87, 0, 88, 0, 85, 0, 72, 0, 3, 0, 68, 0, 69, 0, 82, 0, 89, 0, 72, 0, 17, 0, 3]

This looks to me like some UTF16 text, but the codes seem different than what you'd normally expect. I don't understand the encoding. In 1.2.1 this yielded the correct output though ("Look at the picture above"). In the 1.3.1 and the current trunk this is converted to garbage. The culprit is here:

codeLength = 1;
String c = font.encode( string, i, codeLength );
if( c == null && i+1<string.length)
{
	//maybe a multibyte encoding
	codeLength++;
	c = font.encode( string, i, codeLength );
}

So the code first tries to 'encode' a single byte as a character, and then tries two bytes, three bytes etc. First it starts with a 00 byte. In 1.2.1 the PDFont.encode would return null. The program would then try with two bytes getting a correct character on the second attempt.

In the current trunk the font.encode method returns a space " " when 00 is passed. This is clearly wrong, because afterwards the entire string is parsed incorrectly. I tried to debug further and it seems to me that the problem is in the Encoding class, in the getName method. It looks like this:

public String getName( int code ) throws IOException
{
	String name = codeToName.get( code );
	if( name == null )
	{
		//lets be forgiving for now
		name = "space";
	}
	return name;
}

The crucial bit is the "let's be forgiving for now". If a code is unknown in the encoding, a space is returned. In my case this completely breaks the parsing of a file. 

What was the rationale behind this behavior? Removing it fixed my problem and didn't break anything. All unit tests of pdfbox pass. The regression tests of my applications (based on the pdf extraction code from the Aperture Framework) also pass. The "forgiving" part has been added in PDFBOX-626, but the issue description doesn't name any reasons for that. If the "forgiveness" is there for a good reason, I'd be grateful for advice how to deal with the problem. Otherwise please remove it.

Unfortunately I can't share the problem file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-920) PDFStreamEngine.processEncodedText fails on UTF-16 text

Posted by "Antoni Mylka (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka updated PDFBOX-920:
--------------------------------

    Component/s: Text extraction

This is a text extraction issue. Added an appropriate "component" marking.

> PDFStreamEngine.processEncodedText fails on UTF-16 text
> -------------------------------------------------------
>
>                 Key: PDFBOX-920
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-920
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Antoni Mylka
>         Attachments: nullcharactername.patch
>
>
> I have a PDF document which yields gibberish text. When I debug it, I get to the PDFStreamEngine.processEncodedText. The method gets a following byte array:
> [0, 47, 0, 82, 0, 82, 0, 78, 0, 3, 0, 68, 0, 87, 0, 3, 0, 87, 0, 75, 0, 72, 0, 3, 0, -64, 0, 85, 0, 86, 0, 87, 0, 3, 0, 83, 0, 76, 0, 70, 0, 87, 0, 88, 0, 85, 0, 72, 0, 3, 0, 68, 0, 69, 0, 82, 0, 89, 0, 72, 0, 17, 0, 3]
> This looks to me like some UTF16 text, but the codes seem different than what you'd normally expect. I don't understand the encoding. In 1.2.1 this yielded the correct output though ("Look at the picture above"). In the 1.3.1 and the current trunk this is converted to garbage. The culprit is here:
> codeLength = 1;
> String c = font.encode( string, i, codeLength );
> if( c == null && i+1<string.length)
> {
> 	//maybe a multibyte encoding
> 	codeLength++;
> 	c = font.encode( string, i, codeLength );
> }
> So the code first tries to 'encode' a single byte as a character, and then tries two bytes, three bytes etc. First it starts with a 00 byte. In 1.2.1 the PDFont.encode would return null. The program would then try with two bytes getting a correct character on the second attempt.
> In the current trunk the font.encode method returns a space " " when 00 is passed. This is clearly wrong, because afterwards the entire string is parsed incorrectly. I tried to debug further and it seems to me that the problem is in the Encoding class, in the getName method. It looks like this:
> public String getName( int code ) throws IOException
> {
> 	String name = codeToName.get( code );
> 	if( name == null )
> 	{
> 		//lets be forgiving for now
> 		name = "space";
> 	}
> 	return name;
> }
> The crucial bit is the "let's be forgiving for now". If a code is unknown in the encoding, a space is returned. In my case this completely breaks the parsing of a file. 
> What was the rationale behind this behavior? Removing it fixed my problem and didn't break anything. All unit tests of pdfbox pass. The regression tests of my applications (based on the pdf extraction code from the Aperture Framework) also pass. The "forgiving" part has been added in PDFBOX-626, but the issue description doesn't name any reasons for that. If the "forgiveness" is there for a good reason, I'd be grateful for advice how to deal with the problem. Otherwise please remove it.
> Unfortunately I can't share the problem file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PDFBOX-920) PDFStreamEngine.processEncodedText fails on UTF-16 text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-920.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.5.0
         Assignee: Andreas Lehmkühler

I agree with Antoni. I applied the patch as proposed with revision 1056730.

Thanks for the contribution!

> PDFStreamEngine.processEncodedText fails on UTF-16 text
> -------------------------------------------------------
>
>                 Key: PDFBOX-920
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-920
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Antoni Mylka
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.5.0
>
>         Attachments: nullcharactername.patch
>
>
> I have a PDF document which yields gibberish text. When I debug it, I get to the PDFStreamEngine.processEncodedText. The method gets a following byte array:
> [0, 47, 0, 82, 0, 82, 0, 78, 0, 3, 0, 68, 0, 87, 0, 3, 0, 87, 0, 75, 0, 72, 0, 3, 0, -64, 0, 85, 0, 86, 0, 87, 0, 3, 0, 83, 0, 76, 0, 70, 0, 87, 0, 88, 0, 85, 0, 72, 0, 3, 0, 68, 0, 69, 0, 82, 0, 89, 0, 72, 0, 17, 0, 3]
> This looks to me like some UTF16 text, but the codes seem different than what you'd normally expect. I don't understand the encoding. In 1.2.1 this yielded the correct output though ("Look at the picture above"). In the 1.3.1 and the current trunk this is converted to garbage. The culprit is here:
> codeLength = 1;
> String c = font.encode( string, i, codeLength );
> if( c == null && i+1<string.length)
> {
> 	//maybe a multibyte encoding
> 	codeLength++;
> 	c = font.encode( string, i, codeLength );
> }
> So the code first tries to 'encode' a single byte as a character, and then tries two bytes, three bytes etc. First it starts with a 00 byte. In 1.2.1 the PDFont.encode would return null. The program would then try with two bytes getting a correct character on the second attempt.
> In the current trunk the font.encode method returns a space " " when 00 is passed. This is clearly wrong, because afterwards the entire string is parsed incorrectly. I tried to debug further and it seems to me that the problem is in the Encoding class, in the getName method. It looks like this:
> public String getName( int code ) throws IOException
> {
> 	String name = codeToName.get( code );
> 	if( name == null )
> 	{
> 		//lets be forgiving for now
> 		name = "space";
> 	}
> 	return name;
> }
> The crucial bit is the "let's be forgiving for now". If a code is unknown in the encoding, a space is returned. In my case this completely breaks the parsing of a file. 
> What was the rationale behind this behavior? Removing it fixed my problem and didn't break anything. All unit tests of pdfbox pass. The regression tests of my applications (based on the pdf extraction code from the Aperture Framework) also pass. The "forgiving" part has been added in PDFBOX-626, but the issue description doesn't name any reasons for that. If the "forgiveness" is there for a good reason, I'd be grateful for advice how to deal with the problem. Otherwise please remove it.
> Unfortunately I can't share the problem file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-920) PDFStreamEngine.processEncodedText fails on UTF-16 text

Posted by "Antoni Mylka (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka updated PDFBOX-920:
--------------------------------

    Attachment: nullcharactername.patch

A patch which fixes the problem for me.

> PDFStreamEngine.processEncodedText fails on UTF-16 text
> -------------------------------------------------------
>
>                 Key: PDFBOX-920
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-920
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.3.1
>            Reporter: Antoni Mylka
>         Attachments: nullcharactername.patch
>
>
> I have a PDF document which yields gibberish text. When I debug it, I get to the PDFStreamEngine.processEncodedText. The method gets a following byte array:
> [0, 47, 0, 82, 0, 82, 0, 78, 0, 3, 0, 68, 0, 87, 0, 3, 0, 87, 0, 75, 0, 72, 0, 3, 0, -64, 0, 85, 0, 86, 0, 87, 0, 3, 0, 83, 0, 76, 0, 70, 0, 87, 0, 88, 0, 85, 0, 72, 0, 3, 0, 68, 0, 69, 0, 82, 0, 89, 0, 72, 0, 17, 0, 3]
> This looks to me like some UTF16 text, but the codes seem different than what you'd normally expect. I don't understand the encoding. In 1.2.1 this yielded the correct output though ("Look at the picture above"). In the 1.3.1 and the current trunk this is converted to garbage. The culprit is here:
> codeLength = 1;
> String c = font.encode( string, i, codeLength );
> if( c == null && i+1<string.length)
> {
> 	//maybe a multibyte encoding
> 	codeLength++;
> 	c = font.encode( string, i, codeLength );
> }
> So the code first tries to 'encode' a single byte as a character, and then tries two bytes, three bytes etc. First it starts with a 00 byte. In 1.2.1 the PDFont.encode would return null. The program would then try with two bytes getting a correct character on the second attempt.
> In the current trunk the font.encode method returns a space " " when 00 is passed. This is clearly wrong, because afterwards the entire string is parsed incorrectly. I tried to debug further and it seems to me that the problem is in the Encoding class, in the getName method. It looks like this:
> public String getName( int code ) throws IOException
> {
> 	String name = codeToName.get( code );
> 	if( name == null )
> 	{
> 		//lets be forgiving for now
> 		name = "space";
> 	}
> 	return name;
> }
> The crucial bit is the "let's be forgiving for now". If a code is unknown in the encoding, a space is returned. In my case this completely breaks the parsing of a file. 
> What was the rationale behind this behavior? Removing it fixed my problem and didn't break anything. All unit tests of pdfbox pass. The regression tests of my applications (based on the pdf extraction code from the Aperture Framework) also pass. The "forgiving" part has been added in PDFBOX-626, but the issue description doesn't name any reasons for that. If the "forgiveness" is there for a good reason, I'd be grateful for advice how to deal with the problem. Otherwise please remove it.
> Unfortunately I can't share the problem file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.