You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Mingchun Zhao (JIRA)" <ji...@apache.org> on 2014/10/25 19:10:34 UTC
[jira] [Comment Edited] (CONNECTORS-1079) the parsing in
TikaExtractor always return empty result
[ https://issues.apache.org/jira/browse/CONNECTORS-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184165#comment-14184165 ]
Mingchun Zhao edited comment on CONNECTORS-1079 at 10/25/14 5:10 PM:
---------------------------------------------------------------------
Hi Karl,
Thank you for your help, I've tried your fix.
Unfortunately, this symptom still occurs even we have two tika-core.jar in both of lib and connector-lib directory.
It looks like that the two same jars cause jar conflict.
I tried to use ClassLoader to fix it, but gave up eventually. because that makes things more confusing.
Could you please confirm my suggestion as below:
1. Get rid of the tika-core.jar from lib directory(need to modify build.xml?)
2. Directly call Tika().detect to get MimeType instead of calling ExtensionMimeMap.mapToMimeType.
The related connectors as below(4 files):
connectors/filesystem/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/filesystem/FileConnector.java
connectors/hdfs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/hdfs/HDFSRepositoryConnector.java
connectors/jcifs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.java
connectors/sharepoint/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharepoint/SharePointRepository.java
3.Delete unused ExtensionMimeMap class which just contains one method to call Tika().detect to get MimeType.
framework/core/src/main/java/org/apache/manifoldcf/core/extmimemap/ExtensionMimeMap.java
Thanks.
was (Author: mingchun.zhao):
Hi Karl,
Thank you for your help, I've tried your fix.
Unfortunately, this symptom still occurs even we have two ika-core.jar in both of lib and connector-lib directory.
It looks like that the two same jars cause jar conflict.
I tried to use ClassLoader to fix it, but gave up eventually. because that makes things more confusing.
Could you please confirm my suggestion as below:
1. Get rid of the tika-core.jar from lib directory(need to modify build.xml?)
2. Directly call Tika().detect to get MimeType instead of calling ExtensionMimeMap.mapToMimeType.
The related connectors as below(4 files):
connectors/filesystem/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/filesystem/FileConnector.java
connectors/hdfs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/hdfs/HDFSRepositoryConnector.java
connectors/jcifs/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.java
connectors/sharepoint/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/sharepoint/SharePointRepository.java
3.Delete unused ExtensionMimeMap class which just contains one method to call Tika().detect to get MimeType.
framework/core/src/main/java/org/apache/manifoldcf/core/extmimemap/ExtensionMimeMap.java
Thanks.
> the parsing in TikaExtractor always return empty result
> -------------------------------------------------------
>
> Key: CONNECTORS-1079
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1079
> Project: ManifoldCF
> Issue Type: Bug
> Components: Tika extractor
> Affects Versions: ManifoldCF 2.0
> Reporter: Mingchun Zhao
> Assignee: Karl Wright
> Fix For: ManifoldCF 1.8, ManifoldCF 2.0
>
>
> When I use latest trunk source(2.0) to try the Tika content extractor,It did not return any expected results.
> I looked at it using debugging tools, found that the parser of Tika content extractor does not return any data.
> I've tried to move lib/tika-core-1.6.jar into connector-lib/,
> Then, the Tika content extractor returned data as expected.
> My configurations are as below:
> ==
> Transformation:
> Type: Tika content extractor
> Output:
> Type:Solr(Use extract update handler=false)
> Repository:
> type: Web
> Job:
> 1.type: repository
> 2.type: transformation
> 3.type: output
> ==
> Maybe, it is related to CONNECTORS-1074(?),
> It looks like that the place of tika-core-1.6.jar affects the result of TikaExtractor.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)