You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Robert Muir (Commented) (JIRA)" <ji...@apache.org> on 2011/10/03 19:07:33 UTC

[jira] [Commented] (TIKA-722) Arabic PDF doesn't extract correctly

    [ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119403#comment-13119403 ] 

Robert Muir commented on TIKA-722:
----------------------------------

Actually in this case the original TTF font (AxtManal) is buggy.
The font actually uses glyph codes with a unicode mapping (1-1 to their unicode chars) but the names are WRONG.

So arabic glyphs in this font have misleading names like 'circumflex' and stuff like that in the font, causing 
whatever produced this PDF to be really confused when it embedded it... you can see this if you open the original TTF
in fontforge, it will give tons of warnings:

'The glyph named circumflex is mapped to U+F0F6 But its name indicates it should be mapped to U+02C6'

Its not possible to open the embedded font in the PDF, it claims its corrumpted :)

                
> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf, JUFO96.PDF, metadata.png
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira