You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "DURGA DEEP (JIRA)" <ji...@apache.org> on 2008/08/26 21:18:46 UTC
[jira] Created: (PDFBOX-372) java.io.IOException: Error: expected
hex character and not :32
java.io.IOException: Error: expected hex character and not :32
---------------------------------------------------------------
Key: PDFBOX-372
URL: https://issues.apache.org/jira/browse/PDFBOX-372
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.7.3
Environment: Solaris OS JDK 6
Reporter: DURGA DEEP
Fix For: 0.7.3
Unable to parse the following PDF Attachment.
java.io.IOException: Error: expected hex character and not :32
at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PDFBOX-372) java.io.IOException: Error: expected
hex character and not :32
Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Carrier resolved PDFBOX-372.
----------------------------------
Resolution: Cannot Reproduce
Resolving based on Justin's comment.
> java.io.IOException: Error: expected hex character and not :32
> ---------------------------------------------------------------
>
> Key: PDFBOX-372
> URL: https://issues.apache.org/jira/browse/PDFBOX-372
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: Solaris OS JDK 6
> Reporter: DURGA DEEP
> Attachments: Webmail02.pdf
>
>
> Unable to parse the following PDF Attachment.
> java.io.IOException: Error: expected hex character and not :32
> at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
> at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
> at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
> at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
> at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
> at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
> at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
> at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-372) java.io.IOException: Error: expected
hex character and not :32
Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696232#action_12696232 ]
Justin LeFebvre commented on PDFBOX-372:
----------------------------------------
Ran this through the trunk version of Pdfbox and had no issues extracting the text. I believe that the changes Brian and I made to the Parser fixed this issue.
> java.io.IOException: Error: expected hex character and not :32
> ---------------------------------------------------------------
>
> Key: PDFBOX-372
> URL: https://issues.apache.org/jira/browse/PDFBOX-372
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: Solaris OS JDK 6
> Reporter: DURGA DEEP
> Attachments: Webmail02.pdf
>
>
> Unable to parse the following PDF Attachment.
> java.io.IOException: Error: expected hex character and not :32
> at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
> at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
> at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
> at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
> at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
> at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
> at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
> at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-372) java.io.IOException: Error: expected
hex character and not :32
Posted by "DURGA DEEP (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
DURGA DEEP updated PDFBOX-372:
------------------------------
Attachment: Webmail02.pdf
Trying to extract the contents give me the following error.
String contents = "";
/** PDFTextStripper Object. **/
PDFTextStripper pdftextstrip = null;
/** A new PDFParser Instance. **/
PDFParser pdfp = null;
/** PDDocument Object. **/
PDDocument pdfDocument = null;
/** The document metadata. **/
PDDocumentInformation pdfDocumentInfo = null;
try {
pdfp = new PDFParser(isr);
// pdfp.parse() is not thread safe; ensure only
// one PDFConverter is calling it at a time
// otherwise chance of one thread getting stuck at
// org.pdfbox.cos.COSNumber.<clinit>(COSNumber.java:49)
synchronized (PDFConverter.class) {
// This will parse the stream and create the PDF document.
pdfp.parse();
}
pdfDocument = pdfp.getPDDocument();
pdfDocumentInfo = pdfDocument.getDocumentInformation();
pdftextstrip = new PDFTextStripper();
contents = pdftextstrip.getText(pdfDocument);
try {
// convert first page to image object.
PDPage firstPage
= (PDPage)
pdfDocument.getDocumentCatalog().getAllPages().get(0);
image = firstPage.convertToImage();
} catch (Exception ex) {
if (LOGGER.isLoggable(Level.WARNING)) {
String msg = "Unable to convert PDF to image: ";
LOGGER.log(Level.WARNING, msg + ex);
}
}
} catch (IOException ioe) {
String msg = "Error parsing the inputstream";
if (LOGGER.isLoggable(Level.FINEST)) {
LOGGER.log(Level.FINEST, msg+ioe.getMessage());
}
ioe.printStackTrace();
throw new IssException(msg,
IssException.Reason.INDEX_DOCUMENT_FAILURE);
} finally {
PDFont.clearResources();
try {
if (null != pdfDocument) {
pdfDocument.close();
}
} catch (IOException ioe) {
String msg = "Unable to close pdfDocument";
if (LOGGER.isLoggable(Level.FINEST)) {
LOGGER.log(Level.FINEST, msg+ioe.getMessage());
}
} finally {
try {
if (isr != null) {
isr.close();
}
} catch (IOException ioe) {
String msg = "Unable to close the Input Stream";
if (LOGGER.isLoggable(Level.FINEST)) {
LOGGER.log(Level.FINEST, msg+ioe.getMessage());
}
}
}
}
return contents;
> java.io.IOException: Error: expected hex character and not :32
> ---------------------------------------------------------------
>
> Key: PDFBOX-372
> URL: https://issues.apache.org/jira/browse/PDFBOX-372
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: Solaris OS JDK 6
> Reporter: DURGA DEEP
> Fix For: 0.7.3
>
> Attachments: Webmail02.pdf
>
>
> Unable to parse the following PDF Attachment.
> java.io.IOException: Error: expected hex character and not :32
> at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
> at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
> at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
> at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
> at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
> at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
> at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
> at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-372) java.io.IOException: Error: expected
hex character and not :32
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting updated PDFBOX-372:
---------------------------------
Fix Version/s: (was: 0.7.3)
> java.io.IOException: Error: expected hex character and not :32
> ---------------------------------------------------------------
>
> Key: PDFBOX-372
> URL: https://issues.apache.org/jira/browse/PDFBOX-372
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: Solaris OS JDK 6
> Reporter: DURGA DEEP
> Attachments: Webmail02.pdf
>
>
> Unable to parse the following PDF Attachment.
> java.io.IOException: Error: expected hex character and not :32
> at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:283)
> at org.fontbox.cmap.CMapParser.parse(CMapParser.java:105)
> at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
> at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
> at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
> at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
> at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
> at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.