You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/03/10 19:39:27 UTC
[jira] Resolved: (PDFBOX-55) Invalid character while extracting
text from a chinese pdf
[ https://issues.apache.org/jira/browse/PDFBOX-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-55.
--------------------------------------
Resolution: Fixed
Fix Version/s: 1.1.0
After resolving PDFBOX-654 the extraction works like a charm.
> Invalid character while extracting text from a chinese pdf
> ----------------------------------------------------------
>
> Key: PDFBOX-55
> URL: https://issues.apache.org/jira/browse/PDFBOX-55
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Fix For: 1.1.0
>
> Attachments: PDFBOX55-ENUtxt.pdf
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1185058
> Originally submitted by seblaunay on 2005-04-18 01:59.
> First, thanks for this wonderful api.
> I have a problem extracting text from a pdf document
> provided with adobe acrobat reader : ENUtxt.pdf.
> The pdf contains text with chinese fonts which cannot
> be extracted.
> But, it contains also this text (extract with xpdf or
> acrobat reader) :
> ---------------------------------------
> Lorem ipsum dolor
> ad minim
> ---------------------------------------
> The problem is i obtain on my Writer with
> PDFTextStripper.WriteText something like this :
> ---------------------------------------
> -PSFNJQTVNEPMPS
> BENJOJNWFSOJBNôH
> ---------------------------------------
> And between this valid characters, there are these
> invalid characters :
> 0x0, 0x1, 0x5, 0x6, 0x18.
> Because, i sax the content of a document into a xml,
> the resulting xml is not valid because it contains
> invalid characters...
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1185058&file_id=130664
> ENUtxt.pdf (application/pdf), 7582 bytes
> The pdf used
> [comment on SourceForge]
> Originally sent by seblaunay.
> Logged In: YES
> user_id=1261395
> Document to test added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.