You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Nicholas Cottrell (JIRA)" <ji...@apache.org> on 2010/02/14 21:26:27 UTC
[jira] Updated: (PDFBOX-620) Text extract fails on some PDF files
but not others...
[ https://issues.apache.org/jira/browse/PDFBOX-620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Cottrell updated PDFBOX-620:
-------------------------------------
Attachment: pdf620-fails.pdf
This file generates strange errors in extraction
> Text extract fails on some PDF files but not others...
> ------------------------------------------------------
>
> Key: PDFBOX-620
> URL: https://issues.apache.org/jira/browse/PDFBOX-620
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3, 0.8.0-incubator
> Environment: Tried in Java 5 and 6
> Reporter: Nicholas Cottrell
> Attachments: pdf620-fails.pdf, pdf620-works.pdf
>
>
> Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits stratenull nullr launnull chinnulla new skincare rannull in a competitive market. nulle resultnullIncreased sales nullr pharmacies that used the updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a key part o?its strate? ?r laun?
> chin?a new skincare ran? in a competitive market. ?e result?Increased
> sales ?r pharmacies that used the updates."
> Maybe this is a font problem? Or encoding? I debugged the code in PDFTextStripper and and these appear in the charactersByArticle field even before normalization.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.