You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/08/09 15:04:15 UTC
[jira] Commented: (PDFBOX-789) Error by text extraction
[ https://issues.apache.org/jira/browse/PDFBOX-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896528#action_12896528 ]
Jukka Zitting commented on PDFBOX-789:
--------------------------------------
The problem seems to be related to the large COSStream on page 134. I can avoid the issue easily enough with the following patch, but it would be better to find the root cause instead of relying on a workaround like this.
Index: pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java
===================================================================
--- pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java (revision 982911)
+++ pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java (working copy)
@@ -191,7 +191,11 @@
}
catch( NumberFormatException e )
{
- throw new IOException( "Error: Expected hex number, actual='" + hexChars + "'" );
+ retval.append( '?' );
}
}
return retval;
> Error by text extraction
> ------------------------
>
> Key: PDFBOX-789
> URL: https://issues.apache.org/jira/browse/PDFBOX-789
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.1
> Environment: winndows xp,
> Reporter: Slavomir Varchula
> Fix For: 1.3.0
>
> Attachments: pdf_euba.pdf, Skuska.java
>
>
> Hello,
> I tried to extract text from pdf and extraction ended with error. Here is pdf, source file and stacktrace.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.