You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/19 19:01:10 UTC

[jira] [Created] (TIKA-722) Arabic PDF doesn't extract correctly

Arabic PDF doesn't extract correctly
------------------------------------

                 Key: TIKA-722
                 URL: https://issues.apache.org/jira/browse/TIKA-722
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor


I have a PDF w/ Arabic font that Tika fails to extract (gets all
gibberish).

Looks like the PDF does not include the separate Unicode text metadata
(hmm: would Tika extract that if it were present?), and copy/paste out
of the PDF also produces gibberish.

To fix this I think we'd somehow have to know the mapping for the
font (this particular font is AXTManal)?


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-722) Arabic PDF doesn't extract correctly

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated TIKA-722:
-------------------------------

    Attachment: metadata.png

I checked this file: Thats exactly this type of file I am talking about, here the Metadata, attached as screen shot: Power Macintosh in 1999 with Acrobat Distiller 3.0, embedded only subsets of the fonts. At this time, Macintosh did not even know unicode...


> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf, metadata.png
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-722) Arabic PDF doesn't extract correctly

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107988#comment-13107988 ] 

Uwe Schindler commented on TIKA-722:
------------------------------------

I dont think there is much we can do. Some PDF files (especially those created by e.g. Latex (dvips -> pdf, pdflatex mostly works fine) use internal, dynamically compressed fonts that have their glyphs at totally different places. This is often done when the pdf creator use antique software/fonts, that only know 256 code points (pre-unicode time). In that case, the font file only contains the glyphs actually present in the text, compressed to codepoints available.

Those PDFs are unparseable and full text extraction not even works with Acrobat Reader. But those are still valid PDF files, as they are intended to be printed out. This is like a PDF file only containing a bg TIFF image instead of text - text cannot be extracted.

> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-722) Arabic PDF doesn't extract correctly

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated TIKA-722:
-------------------------------

    Attachment: JUFO96.PDF

Here is a non-persian example (which is actually a very-very early writeup from myself, back in 1996, from my personal archive - don't read it). If you try to copypaste text out of it you will see the same problem. It's also Acrobat Distiller 3.0 with font subsets.

> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf, JUFO96.PDF, metadata.png
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-722) Arabic PDF doesn't extract correctly

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108031#comment-13108031 ] 

Michael McCandless commented on TIKA-722:
-----------------------------------------

Thanks Uwe; it sounds like there's not much we can do for such old PDFs.

> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf, JUFO96.PDF, metadata.png
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-722) Arabic PDF doesn't extract correctly

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119403#comment-13119403 ] 

Robert Muir commented on TIKA-722:
----------------------------------

Actually in this case the original TTF font (AxtManal) is buggy.
The font actually uses glyph codes with a unicode mapping (1-1 to their unicode chars) but the names are WRONG.

So arabic glyphs in this font have misleading names like 'circumflex' and stuff like that in the font, causing 
whatever produced this PDF to be really confused when it embedded it... you can see this if you open the original TTF
in fontforge, it will give tons of warnings:

'The glyph named circumflex is mapped to U+F0F6 But its name indicates it should be mapped to U+02C6'

Its not possible to open the embedded font in the PDF, it claims its corrumpted :)

                
> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf, JUFO96.PDF, metadata.png
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-722) Arabic PDF doesn't extract correctly

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-722:
------------------------------------

    Attachment: 000279.pdf

> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-722) Arabic PDF doesn't extract correctly

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-722.
-------------------------------------

    Resolution: Won't Fix

OK resolving as Won't Fix.

I don't see how Tika can recover when the font itself is buggy... though it is tantalizing that the glyph IDs for this font are in fact Unicode code points.

I just hope there are not too many buggy fonts out there!
                
> Arabic PDF doesn't extract correctly
> ------------------------------------
>
>                 Key: TIKA-722
>                 URL: https://issues.apache.org/jira/browse/TIKA-722
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: 000279.pdf, JUFO96.PDF, metadata.png
>
>
> I have a PDF w/ Arabic font that Tika fails to extract (gets all
> gibberish).
> Looks like the PDF does not include the separate Unicode text metadata
> (hmm: would Tika extract that if it were present?), and copy/paste out
> of the PDF also produces gibberish.
> To fix this I think we'd somehow have to know the mapping for the
> font (this particular font is AXTManal)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira