You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2013/03/13 18:20:13 UTC
[jira] [Closed] (PDFBOX-82) Strange encoding in PDF-file

     [ https://issues.apache.org/jira/browse/PDFBOX-82?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-82.
------------------------------------

    Resolution: Not A Problem
      Assignee: Andreas Lehmkühler

The pdf uses a non human readable encoding, even acrobat reader can't extract the text.

Set to closed.
                
> Strange encoding in PDF-file
> ----------------------------
>
>                 Key: PDFBOX-82
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-82
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX82-strangeEncoding.pdf
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1261693
> Originally submitted by mai00 on 2005-08-16 23:16.
> I've got a PDF-file with a strange encoding. When I use
> ExtractText it gives me lines of 
> MT73MT110MT102MT111MT45MT66.... 
> I tried to use other extracting tools like pdftohtml
> and it could extract the text in plain English.
> Maybe I'm just doing something wrong with the encoding
> but I tried different things and could't extract the
> right text.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1261693&file_id=145929
> strangeEncoding.pdf (application/pdf), 155806 bytes
> PDF-File with strange encoding
> [comment on SourceForge]
> Originally sent by mai00.
> Logged In: YES 
> user_id=1310361
> Hello,
> I have found more files which are strange encoded but e.g.
> can be extracted by Adobe. I though it might have to do with
> the Font. All of these files don't provide a BaseFont and I
> think it might be because maybe this files were created on
> Unix systems. Maybe if somehow the font can be provided or
> someone can find out which font is used one can find out how
> to decode the text.
> [comment on SourceForge]
> Originally sent by salchow.
> Logged In: YES 
> user_id=1316887
> I have got exactly the same problem. I registred a bug (ID 
> 1250097) and asked a question in the forum at the begining 
> of August under the title : "Text Extraction unsuccessful with 
> PDFBox". Unfortunately, I haven't got any solution yet. 
> If you find a way to solve this problem, i am interested in ...
> Thanks
> Salchow

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira