You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "DURGA DEEP (JIRA)" <ji...@apache.org> on 2008/08/26 21:18:46 UTC

[jira] Created: (PDFBOX-372) java.io.IOException: Error: expected hex character and not :32

java.io.IOException: Error: expected hex character and not  :32
---------------------------------------------------------------

                 Key: PDFBOX-372
                 URL: https://issues.apache.org/jira/browse/PDFBOX-372
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.7.3
         Environment: Solaris OS JDK 6
            Reporter: DURGA DEEP
             Fix For: 0.7.3



Unable to parse the following PDF Attachment. 

java.io.IOException: Error: expected hex character and not  :32
        at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
        at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
        at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
        at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
        at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
        at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
        at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
        at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
        at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-372) java.io.IOException: Error: expected hex character and not :32

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier resolved PDFBOX-372.
----------------------------------

    Resolution: Cannot Reproduce

Resolving based on Justin's comment. 

> java.io.IOException: Error: expected hex character and not  :32
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-372
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-372
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: Solaris OS JDK 6
>            Reporter: DURGA DEEP
>         Attachments: Webmail02.pdf
>
>
> Unable to parse the following PDF Attachment. 
> java.io.IOException: Error: expected hex character and not  :32
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-372) java.io.IOException: Error: expected hex character and not :32

Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696232#action_12696232 ] 

Justin LeFebvre commented on PDFBOX-372:
----------------------------------------

Ran this through the trunk version of Pdfbox and had no issues extracting the text. I believe that the changes Brian and I made to the Parser fixed this issue. 

> java.io.IOException: Error: expected hex character and not  :32
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-372
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-372
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: Solaris OS JDK 6
>            Reporter: DURGA DEEP
>         Attachments: Webmail02.pdf
>
>
> Unable to parse the following PDF Attachment. 
> java.io.IOException: Error: expected hex character and not  :32
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-372) java.io.IOException: Error: expected hex character and not :32

Posted by "DURGA DEEP (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DURGA DEEP updated PDFBOX-372:
------------------------------

    Attachment: Webmail02.pdf

Trying to extract the contents give me the following error.

        String contents = "";
        /** PDFTextStripper Object. **/
        PDFTextStripper pdftextstrip = null;
        /** A new PDFParser Instance. **/
        PDFParser pdfp = null;
        /** PDDocument Object. **/
        PDDocument pdfDocument = null;
        /** The document metadata. **/
        PDDocumentInformation pdfDocumentInfo = null;

        try {

           pdfp = new PDFParser(isr);
           // pdfp.parse() is not thread safe; ensure only
           // one PDFConverter is calling it at a time
           // otherwise chance of one thread getting stuck at
           // org.pdfbox.cos.COSNumber.<clinit>(COSNumber.java:49)
           synchronized (PDFConverter.class) {
                 // This will parse the stream and create the PDF document.
                 pdfp.parse();
           }
           pdfDocument = pdfp.getPDDocument();
           pdfDocumentInfo = pdfDocument.getDocumentInformation();
           pdftextstrip = new PDFTextStripper();
           contents = pdftextstrip.getText(pdfDocument);

           try {
               // convert first page to image object.
               PDPage firstPage
                  = (PDPage)
                    pdfDocument.getDocumentCatalog().getAllPages().get(0);
               image = firstPage.convertToImage();
           } catch (Exception ex) {
                if (LOGGER.isLoggable(Level.WARNING)) {
                    String msg = "Unable to convert PDF to image: ";
                    LOGGER.log(Level.WARNING, msg + ex);
                }
           }
        } catch (IOException ioe) {
            String msg = "Error parsing the inputstream";
            if (LOGGER.isLoggable(Level.FINEST)) {
                LOGGER.log(Level.FINEST, msg+ioe.getMessage());
            }
            ioe.printStackTrace();
            throw new IssException(msg,
                       IssException.Reason.INDEX_DOCUMENT_FAILURE);
        } finally {
            PDFont.clearResources();
            try {
                  if (null != pdfDocument) {
                        pdfDocument.close();
                  }
                } catch (IOException ioe) {
                    String msg = "Unable to close pdfDocument";
                    if (LOGGER.isLoggable(Level.FINEST)) {
                        LOGGER.log(Level.FINEST, msg+ioe.getMessage());
                    }
                } finally {
                    try {
                        if (isr != null) {
                            isr.close();
                        }
                        } catch (IOException ioe) {
                            String msg = "Unable to close the Input Stream";
                            if (LOGGER.isLoggable(Level.FINEST)) {
                                LOGGER.log(Level.FINEST, msg+ioe.getMessage());
                            }
                        }
                }
        }
        return contents;

> java.io.IOException: Error: expected hex character and not  :32
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-372
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-372
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: Solaris OS JDK 6
>            Reporter: DURGA DEEP
>             Fix For: 0.7.3
>
>         Attachments: Webmail02.pdf
>
>
> Unable to parse the following PDF Attachment. 
> java.io.IOException: Error: expected hex character and not  :32
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-372) java.io.IOException: Error: expected hex character and not :32

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated PDFBOX-372:
---------------------------------

    Fix Version/s:     (was: 0.7.3)

> java.io.IOException: Error: expected hex character and not  :32
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-372
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-372
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: Solaris OS JDK 6
>            Reporter: DURGA DEEP
>         Attachments: Webmail02.pdf
>
>
> Unable to parse the following PDF Attachment. 
> java.io.IOException: Error: expected hex character and not  :32
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.