You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/10/21 12:19:59 UTC
[jira] Updated: (PDFBOX-269) ExtractText errors

     [ https://issues.apache.org/jira/browse/PDFBOX-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated PDFBOX-269:
---------------------------------

         Reporter: Jukka Zitting
    Fix Version/s: 0.8.0-incubator

> ExtractText errors
> ------------------
>
>                 Key: PDFBOX-269
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-269
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.8.0-incubator
>
>         Attachments: endstream_missing_fix.diff
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1706491
> Originally submitted by wrwessel on 2007-04-24 04:31.
> Wrote a batch file to convert over 500 powerpoint files I had to pdf (using DocumentConverter.py and OpenOffice) then the batch file uses ExtractText.exe to extract the text.  Most of these files converted fine but I had 4 files where ExtractText could not get any text and threw various error messages.  I have attached one of these as a sample.  Using version 0.7.4 from 19/5/07 and same problem with 0.7.3 release.  It is easy enough for me to convert the last 4 by hand, but might be a bug you can fix.
> Many thanks for the ExtractText program, saved a long time converting these by hand.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1706491&file_id=226382
> Sample.zip (application/x-zip-compressed), 216343 bytes
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> I've looked at the attached PDF, technically I believe the root issue is that OpenOffice is not correctly writing the PDF.  I have submitted the issue with those guys and can be monitored by going to http://www.openoffice.org/issues/show_bug.cgi?id=76879
> The issue is that the PDF is sometimes missing the 'endstream' tag; which PDFBox looks for to tell it that the stream is done.
> My rule of thumb is that if Acrobat can open it, then so should PDFBox, so this is still a 'bug' with PDFBox.  Fixing this is possible but is not straightforward, so it may be a little bit before it is complete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.