You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/10/21 12:19:59 UTC
[jira] Updated: (PDFBOX-269) ExtractText errors
[ https://issues.apache.org/jira/browse/PDFBOX-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting updated PDFBOX-269:
---------------------------------
Reporter: Jukka Zitting
Fix Version/s: 0.8.0-incubator
> ExtractText errors
> ------------------
>
> Key: PDFBOX-269
> URL: https://issues.apache.org/jira/browse/PDFBOX-269
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Jukka Zitting
> Priority: Minor
> Fix For: 0.8.0-incubator
>
> Attachments: endstream_missing_fix.diff
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1706491
> Originally submitted by wrwessel on 2007-04-24 04:31.
> Wrote a batch file to convert over 500 powerpoint files I had to pdf (using DocumentConverter.py and OpenOffice) then the batch file uses ExtractText.exe to extract the text. Most of these files converted fine but I had 4 files where ExtractText could not get any text and threw various error messages. I have attached one of these as a sample. Using version 0.7.4 from 19/5/07 and same problem with 0.7.3 release. It is easy enough for me to convert the last 4 by hand, but might be a bug you can fix.
> Many thanks for the ExtractText program, saved a long time converting these by hand.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1706491&file_id=226382
> Sample.zip (application/x-zip-compressed), 216343 bytes
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES
> user_id=601708
> Originator: NO
> I've looked at the attached PDF, technically I believe the root issue is that OpenOffice is not correctly writing the PDF. I have submitted the issue with those guys and can be monitored by going to http://www.openoffice.org/issues/show_bug.cgi?id=76879
> The issue is that the PDF is sometimes missing the 'endstream' tag; which PDFBox looks for to tell it that the stream is done.
> My rule of thumb is that if Acrobat can open it, then so should PDFBox, so this is still a 'bug' with PDFBox. Fixing this is possible but is not straightforward, so it may be a little bit before it is complete.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.