You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/06/17 21:55:08 UTC

[jira] [Closed] (PDFBOX-322) Don't want gibberish character when extracting text.

     [ https://issues.apache.org/jira/browse/PDFBOX-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-322.
------------------------------

    Resolution: Won't Fix

There's no way to know what is and isn't gibberish, so this can't really be done. The good news is that text extraction is much more reliable than when this issue was created.

> Don't want gibberish character when extracting text.
> ----------------------------------------------------
>
>                 Key: PDFBOX-322
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-322
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1827099
> Originally submitted by ibuzz on 2007-11-06 11:53.
> Hi,
> I'm not sure if its a bug, but I don't really like to get gibberish characters when extracting text using a custom encoding.  Its cool when debugging, but in real case situation, we don't necessarily want to extract and see them.
> It would be cool if the method "font.encode( string, i, codeLength );" had an option to return nothing (if the font use a custom encoding).
> Another way would be to give us a flag to be able to identify easily custom encoding for a font (in PDFont?).
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.2#6252)