You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/12/15 22:56:02 UTC

[jira] Resolved: (PDFBOX-267) CMap parse fails during text extract

     [ https://issues.apache.org/jira/browse/PDFBOX-267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved PDFBOX-267.
----------------------------------

    Resolution: Incomplete

Test document not available.

> CMap parse fails during text extract
> ------------------------------------
>
>                 Key: PDFBOX-267
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-267
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>            Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1702313
> Originally submitted by matthillsdon on 2007-04-17 09:21.
> Unfortunately I cannot supply the PDF file.  Any suggestion appreciated.
> Exception in thread "main" java.io.IOException: Error: expected the end of a dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:220)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:79)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> ...
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1702313&file_id=226802
> ExtractFonts.java (text/java), 1721 bytes
> A simple program to extract fonts and CMap streams
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Sorry for the delay.  Updated extract output at
> http://www.hillsdon.net/CMapDocument3.pdf
> Stack trace for text extract as before:
> Exception in thread "main" java.io.IOException: Error: expected the end of a dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:117)
> ...
> Thanks, Matt.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> Hi Matt,
> any update?
> Ben
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> ok, I looked at it some more and I'd like to have you get the latest nightly build and try to run ExtractText on your original PDF again.  If it doesn't work then run the ExtractFonts again(using the nightly build) and post the results.
> The issue is that there is some extra data at the end of the Cmap stream and tonight I happened to fix an issue with parsing and having extra data at the end of the stream for a different user.  So I don't know if this is the same issue but I'd rather have you try the nightly build than have me chasing a ghost.
> Ben
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Output with the decryption here
> http://www.hillsdon.net/CMapDocument2.pdf
> Thanks.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> shoot, I think your document was encrypted.  It needs to be decrypted for the extraction to work, I should have had that as part of the program.  Can you take the attached program and add the lines after the PDDocument.load call
> if( doc.isEncrypted() )
> {
>     doc.decrypt( "" );
> }
> and resend the CMapDocument.pdf
> Thanks,
> Ben
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Result too large to attach.  Please see
> http://www.hillsdon.net/CMapDocument.pdf
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> Attached is a simple java program that will create a new pseudo PDF document that contains just the Font information.  Please run it on the problem PDF and upload the resulting CmapDocument.pdf 
> It is a simple command line program, first compile then run it like this
> java ExtractFonts my.pdf
> Let me know if you have any questions getting it running.
> Ben
> File Added: ExtractFonts.java
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> No change unfortunately - with FontBox-0.2.0-dev-20070424 the stack trace is identical.
> Exception in thread "main" java.io.IOException: Error: expected the end of a dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
> ...
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> I just update the CMapParser with a bug from 
> https://sourceforge.net/forum/message.php?msg_id=4269559
> please get tonights FontBox build and give it a try
> http://www.fontbox.org/fontbox
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Hi Ben, thanks for the quick response.
> Using the nightly build [1] the stack trace is the same except for line numbers:
> Exception in thread "main" java.io.IOException: Error: expected the end of a dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:117)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> ...
> Extracting the fonts sounds ideal.
> [1] http://www.pdfbox.org/dist/PDFBox-0.7.4-dev-20070418.zip
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> Hi Matt,
> Can you try one for me first; upgrade to the latest nightly build of PDFBox( http://www.pdfbox.org/dist/ ) and see if this is still an issue.  There have been some changes to the CMAPParser.
> If it is still an issue I think we can write a simple program to extract just the fonts from your PDF and that should be enough for me to fix the bug.
> Ben

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.