You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/08/09 15:04:15 UTC

[jira] Commented: (PDFBOX-789) Error by text extraction

    [ https://issues.apache.org/jira/browse/PDFBOX-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896528#action_12896528 ] 

Jukka Zitting commented on PDFBOX-789:
--------------------------------------

The problem seems to be related to the large COSStream on page 134. I can avoid the issue easily enough with the following patch, but it would be better to find the root cause instead of relying on a workaround like this.

Index: pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java
===================================================================
--- pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java	(revision 982911)
+++ pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java	(working copy)
@@ -191,7 +191,11 @@
             }
             catch( NumberFormatException e )
             {
-                throw new IOException( "Error: Expected hex number, actual='" + hexChars + "'" );
+                retval.append( '?' );
             }
         }
         return retval;


> Error by text extraction
> ------------------------
>
>                 Key: PDFBOX-789
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-789
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: winndows xp, 
>            Reporter: Slavomir Varchula
>             Fix For: 1.3.0
>
>         Attachments: pdf_euba.pdf, Skuska.java
>
>
> Hello,  
> I tried to extract text from pdf and extraction ended with error. Here is pdf, source file and stacktrace.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.