You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2014/10/15 13:44:38 UTC

[jira] [Comment Edited] (CONNECTORS-1074) Replace ExtensionMimeMap with new Tika().detect(filename)

    [ https://issues.apache.org/jira/browse/CONNECTORS-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172266#comment-14172266 ] 

Karl Wright edited comment on CONNECTORS-1074 at 10/15/14 11:44 AM:
--------------------------------------------------------------------

Hi Abe-san,

Replacing ExtensionMimeMap wherever it is used with a Tika.detect(String filename) call, which is what you are proposing, will require that all Tika jars and their dependencies be included in all the war files, rather than once (in connector-lib).  This is because they will be required to be accessed by the root class loader.  It may be the case that you could put only one Tika jar and have the detect(String filename) method work, but you would need to experiment to see.

In the Tika connector itself, the RepositoryDocument.setMimeType() method is supposed to describe the binary stream that you get from RepositoryDocument.getBinaryStream().  Since the output of Tika is always characters, which the Tika transformer converts to utf-8 bytes, the content type should always be "text/plain;charset=utf-8".

If you want to modify the Tika connector to report what the *original* mime type was in some other metadata field, that is fine with me, but you should not call setMimeType() because it will break things.



was (Author: kwright@metacarta.com):
Hi Abe-san,

the RepositoryDocument.setMimeType() method is supposed to describe the binary stream that you get from RepositoryDocument.getBinaryStream().  Since the output of Tika is always characters, which the Tika transformer converts to utf-8 bytes, the content type should always be "text/plain;charset=utf-8".

If you want to modify the Tika connector to report what the *original* mime type was in some other metadata field, that is fine with me, but you should not call setMimeType() because it will break things.


> Replace ExtensionMimeMap with new Tika().detect(filename)
> ---------------------------------------------------------
>
>                 Key: CONNECTORS-1074
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1074
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework core
>            Reporter: Shinichiro Abe
>             Fix For: ManifoldCF 2.0
>
>
> It would be nice if we could support many mime type since ManifoldCF has already been using Tika.
> {noformat}
>  new Tika().detect(fileName);
> {noformat}
> returns String MimeType. Then we could set this into RepositoryDocument#setMimeType(mimeType) on each connector;
> Tika reference:
> [javadoc|http://tika.apache.org/1.6/api/org/apache/tika/Tika.html]
> [test code|http://svn.apache.org/viewvc/tika/tags/1.6-rc2/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java?view=markup]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)