You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/01/10 14:29:00 UTC

[jira] [Comment Edited] (TIKA-3634) Failed to Parser Apple related files

    [ https://issues.apache.org/jira/browse/TIKA-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472053#comment-17472053 ] 

Tim Allison edited comment on TIKA-3634 at 1/10/22, 2:28 PM:
-------------------------------------------------------------

Thank you for submitting the bug and sharing triggering files.

A couple of items unrelated to the problem:
 * AppleSingleFileParser does not handle iworks files.  That is for a completely unrelated file format: [https://en.wikipedia.org/wiki/AppleSingle_and_AppleDouble_formats]
 * You shouldn't need to add: tika-parser-zip-commons,tika-parser-apple-module.  These should be included in tika-parsers-standard-package.  If they're not, that's a serious problem.  Please open a different ticket.

I regret I'm still not clear on what we need to fix.

With Tika 1.28, I get {{application/vnd.apple.unknown.13}} for the *.numbers file and *.pages file; I get {{application/vnd.apple.keynote.13}} for the .key file.  No attachments or text are extracted from any of those.

 

With Tika 2.2.1, I get {{application/vnd.apple.unknown.13}} all three (*.pages, *.key , *.numbers files), but then the packageparser parses all embedded files that Tika supports.

 

What is the desired behavior?

As you've pointed out, we don't have a parser for these formats, and it would be non-trivial. :(

 

My guess is that you want the same detection as 1.28, but with the parsing of all component files?


was (Author: tallison@mitre.org):
Thank you for submitting the bug and sharing triggering files.

A couple of items unrelated to the problem:
 * AppleSingleFileParser does not handle iworks files.  That is for a completely unrelated file format: [https://en.wikipedia.org/wiki/AppleSingle_and_AppleDouble_formats]
 * You shouldn't need to add: tika-parser-zip-commons,tika-parser-apple-module.  These should be included in tika-parsers-standard-package.  If they're not, that's a serious problem.  Please open a different ticket.

I regret I'm still not clear on what we need to fix.

With Tika 1.28, I get {{application/vnd.apple.unknown.13}} for the *.numbers file and *.pages file; I get {{application/vnd.apple.keynote.13}} for the .key file.  No attachments or text are extracted from any of those.

 

With Tika 2.2.1, I get {{application/vnd.apple.unknown.13}} all three (*.pages, *.key , *.numbers files), but then the packageparser parses all embedded files that Tika supports.

 

What is the desired behavior?

> Failed to Parser Apple related files
> ------------------------------------
>
>                 Key: TIKA-3634
>                 URL: https://issues.apache.org/jira/browse/TIKA-3634
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.2.1
>            Reporter: Tika User
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: brochure.pages, keynotecreated.key, mortgagecalculator.numbers
>
>
> Unable to parse '.Number', '.key', '.pages' file using below class in xml file(org.apache.tika.parser.apple.AppleSingleFileParser)
> Getting unkown mimetype : application/vnd.apple.unknown.13
> Using all these modules :
> tika-core,tika-parsers-standard-package,tika-parser-microsoft-module,tika-parser-sqlite3-package,tika-parser-scientific-module,tika-parser-zip-commons,tika-parser-apple-module



--
This message was sent by Atlassian Jira
(v8.20.1#820001)