You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Dylan Vaughn (JIRA)" <ji...@apache.org> on 2010/08/10 18:47:16 UTC

[jira] Updated: (PDFBOX-790) Text extraction from PDF generated from MS Word fails

     [ https://issues.apache.org/jira/browse/PDFBOX-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dylan Vaughn updated PDFBOX-790:
--------------------------------

    Attachment: Document2.v1.20100630.pdf

> Text extraction from PDF generated from MS Word fails
> -----------------------------------------------------
>
>                 Key: PDFBOX-790
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-790
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.2.1
>            Reporter: Dylan Vaughn
>         Attachments: Document2.v1.20100630.pdf
>
>
> The attached PDF gives the following error when trying to extract text with PDFBox 1.2.1:
> dylan@dylan-laptop:~/desktop/pdfbox$ java org.apache.pdfbox.ExtractText -console Document2.v1.20100630.pdf 
> Jul 12, 2010 9:00:31 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> WARNING: java.io.IOException: Error: expected hex character and not :32
> java.io.IOException: Error: expected hex character and not :32
> at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:336)
> at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:139)
> at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:556)
> at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:390)
> at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:386)
> at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:61)
> at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:567)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:250)
> at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:208)
> at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:378)
> at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:302)
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:258)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:236)
> Jul 12, 2010 9:00:31 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> WARNING: java.io.IOException: Error: expected hex character and not :32
> java.io.IOException: Error: expected hex character and not :32
> at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:336)
> at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:139)
> at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:556)
> at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:390)
> at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:386)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
> at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:567)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:250)
> at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:208)
> at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:378)
> at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:302)
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:258)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:236)
> Jul 12, 2010 9:00:31 AM org.apache.pdfbox.util.PDFStreamEngine processOperator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.