You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2008/09/17 22:23:44 UTC
[jira] Created: (PDFBOX-373) (null) printed when characters cannot
be decoded during text extraction
(null) printed when characters cannot be decoded during text extraction
-----------------------------------------------------------------------
Key: PDFBOX-373
URL: https://issues.apache.org/jira/browse/PDFBOX-373
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 0.8.0-incubator
Reporter: Brian Carrier
Fix For: 0.8.0-incubator
We have some PDF files where the TO_UNICODE map is corrupt and PDFBox cannot extract the text. font.encode() returns null and PDFStreamEngine.showString() adds the null to the result, which is then printed as "(null)".
Here is a patch (against the trunk) that replaces the null with "?".
--- PDFStreamEngine.java 2008-09-17 16:09:13.529318500 -0400
+++ PDFStreamEngine-new.java 2008-09-17 16:12:51.617318500 -0400
@@ -422,6 +422,11 @@
}
}
+ // Replace a null entry with "?" so it is not printed as "(null)"
+ if (c == null)
+ {
+ c = "?";
+ }
totalStringWidth += width;
stringResult.append( c );
}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PDFBOX-373) (null) printed when characters cannot
be decoded during text extraction
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved PDFBOX-373.
----------------------------------
Resolution: Fixed
Assignee: Jukka Zitting
Good point, thanks!
I committed a slightly modified version of your patch (I merged the if statement with the preceding one) in revision 703273.
> (null) printed when characters cannot be decoded during text extraction
> -----------------------------------------------------------------------
>
> Key: PDFBOX-373
> URL: https://issues.apache.org/jira/browse/PDFBOX-373
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Reporter: Brian Carrier
> Assignee: Jukka Zitting
> Fix For: 0.8.0-incubator
>
>
> We have some PDF files where the TO_UNICODE map is corrupt and PDFBox cannot extract the text. font.encode() returns null and PDFStreamEngine.showString() adds the null to the result, which is then printed as "(null)".
> Here is a patch (against the trunk) that replaces the null with "?".
> --- PDFStreamEngine.java 2008-09-17 16:09:13.529318500 -0400
> +++ PDFStreamEngine-new.java 2008-09-17 16:12:51.617318500 -0400
> @@ -422,6 +422,11 @@
> }
> }
>
> + // Replace a null entry with "?" so it is not printed as "(null)"
> + if (c == null)
> + {
> + c = "?";
> + }
> totalStringWidth += width;
> stringResult.append( c );
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.