You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2011/09/19 19:11:10 UTC

[jira] [Commented] (TIKA-722) Arabic PDF doesn't extract correctly

    [ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107988#comment-13107988 ] 

Uwe Schindler commented on TIKA-722:
------------------------------------

I dont think there is much we can do. Some PDF files (especially those created by e.g. Latex (dvips -> pdf, pdflatex mostly works fine) use internal, dynamically compressed fonts that have their glyphs at totally different places. This is often done when the pdf creator use antique software/fonts, that only know 256 code points (pre-unicode time). In that case, the font file only contains the glyphs actually present in the text, compressed to codepoints available.

Those PDFs are unparseable and full text extraction not even works with Acrobat Reader. But those are still valid PDF files, as they are intended to be printed out. This is like a PDF file only containing a bg TIFF image instead of text - text cannot be extracted.

> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira