You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by reinhard schwab <re...@aon.at> on 2010/09/08 00:42:09 UTC

Re: text extraction

Andreas Lehmkühler schrieb:
> Hi,
>
>
> Gesendet: Sa, 04. Sep 2010 Von: reinhard schwab<re...@aon.at>
>
>   
>> extracted text with
>>
>> PDDocument doc = PDDocument.load(new URL(
>>                            
>> "http://people.ischool.berkeley.edu/~hearst/irbook/print/chap10.pdf"));
>> PDFTextStripper stripper = new PDFTextStripper();
>> stripper.writeText(doc, new OutputStreamWriter(System.out));
>>
>> looks like this
>>
>> ¡ ¢¤£¦¥¨§ª© ­®©°¯±¢²§ª³ ´¶µ¸·¹¢º© » ¥¼µ½§?·?¥??¼´²Â
>>  "!$#&%ª')(+* ,-%ª.?/0%?132"%?45.?6
>> ,-.7'84:97!;.7'< "!>=?.ª!>'?*ª1A@B.C4®*
>> ACM Press
>> New York
>> Addison-Wesley
>> D)EGFIH J>KMLON8P$QRH ESPUT?V?WYXZE>TR[\PUQ]L_^`E>ababE>cedgfUahX;ijija
>>     
> The mentioned pdf uses type3 fonts for most of the text. Those font type consists of glyphs for every single letter and doesn't have any encoding. In most cases those kind of text content can't be extracted, even the acrobat reader won't do it (try it by selecting some of the text and just c&p it to a texteditor. The text will be scrambled).
>
> BR
> Andreas Lehmkühler
>
>   
hi,
so what is pdfbox doing now with such fonts?
when i try to extract some text from a pdf file, i expect to get
readable text.
i interface pdfbox by using the tika api.
the code is:

        if  ("application/pdf".equals(contentType)) {
            parser = new PDFParser();
        }
        InputStream responseBody = new ByteArrayInputStream(content);

        ContentHandler textHandler = new BodyContentHandler(10000000);
        ParseContext pc = new ParseContext();
        try {
            parser.parse(responseBody, textHandler, metadata, pc);
        } catch (Exception e) {
            e.printStackTrace();
        }

should the PDFParser in Tika catch this or should pdfbox catch this or
should my application interfacing Tika catch this?
i now have to check the text returned by Tika for such nonreadable text
because i index it with lucene etc...
is it obvious for pdfbox that it cant extract the text in this situation?

is there no chance to translate or map these glyphs back into characters?

best regards
reinhard