You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/02/05 14:13:39 UTC

[jira] [Comment Edited] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

    [ https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134083#comment-15134083 ] 

Tim Allison edited comment on TIKA-1854 at 2/5/16 1:13 PM:
-----------------------------------------------------------

Will commit shortly.  Thank you for the patch and test case!

First, out of curiosity, how will you make use of the storage class ids?  

bq. By the way, the Content-Type of the embedded document IS already available, but this only works for some popular formats (e.g. embedded MS Office documents).

I'm not sure what exactly you mean by "is already available"?  Do you mean that the MSOffice document often has a metadata item around an attachment which identifies the attachment's mime type and we therefore shouldn't bother running Tika's detection code on an embedded file?

bq. Is there a way for clients to configure the Content-Type detection for more exotic formats?

Do you mean generally for Tika or specifically within MSOffice docs based on the internally stored metadata around an attachment?  If generally for Tika, see e.g. [our documentation|https://tika.apache.org/1.11/parser_guide.html] or [so|http://stackoverflow.com/questions/30895761/how-to-add-new-mime-type-to-apache-tika].  If you mean specifically within MSOffice, it would be great if you could submit another patch for the more exotic formats; but no, I don't think there is currently a way to configure that within MSOffice docs. :)




was (Author: tallison@mitre.org):
Will commit shortly.  Thank you for the patch and test case!

First, out of curiosity/ignorance, how will you make use of the storage class ids?  What do they actually mean?


bq. By the way, the Content-Type of the embedded document IS already available, but this only works for some popular formats (e.g. embedded MS Office documents).

I'm not sure what exactly you mean by "is already available"?  Do you mean that the MSOffice document often has a metadata item around an attachment which identifies the attachment's mime type and we therefore shouldn't bother running Tika's detection code on an embedded file?

bq. Is there a way for clients to configure the Content-Type detection for more exotic formats?

Do you mean generally for Tika or specifically within MSOffice docs based on the internally stored metadata around an attachment?  If generally for Tika, see e.g. [our documentation|https://tika.apache.org/1.11/parser_guide.html] or [so|http://stackoverflow.com/questions/30895761/how-to-add-new-mime-type-to-apache-tika].  If you mean specifically within MSOffice, it would be great if you could submit another patch for the more exotic formats; but no, I don't think there is currently a way to configure that within MSOffice docs. :)



> Include the storage class ID of documents embedded in MS Office documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1854
>                 URL: https://issues.apache.org/jira/browse/TIKA-1854
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Assignee: Tim Allison
>         Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the storage class ID of the embedded document would be a useful metadata to have, but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)