You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@oodt.apache.org by "Rishi Verma (JIRA)" <ji...@apache.org> on 2015/05/19 02:26:00 UTC

[jira] [Created] (OODT-848) AutoDetectProductCrawler's mimeExtractorRepo argument overridden by Tika

Rishi Verma created OODT-848:
--------------------------------

Summary: AutoDetectProductCrawler's mimeExtractorRepo argument overridden by Tika
Key: OODT-848
URL: https://issues.apache.org/jira/browse/OODT-848
Project: OODT
Issue Type: Bug
Components: crawler, metadata container
Affects Versions: 0.8.1
Reporter: Rishi Verma
Assignee: Rishi Verma
Fix For: 0.9

AutoDetectProductCrawler [1] is not able to leverage customized extractors specified via the mimeExtractorRepo argument that use common file glob patterns. In other words, if the user has a custom "mime-extractor-map.xml" leveraging a custom "mime-types.xml" that maps specific glob patterns to specific extractors, this mapping will be overridden by Tika's default glob mappings if Tika finds a match internally. This leads to the fact that for many basic types of files, such as text files, AutoDetectProductCrawler will identify the mime type as "text/plain" no matter what mime type the user has specified within their own mime-types.xml. This is a problem if one has multiple extractors which need to filter for different types of text/plain files.

I found this problem appeared when I updated from OODT 0.7 to 0.8.1, because OODT 0.7 used Tika 0.8 and 0.8.1 now uses Tika 1.7.

Recreating the problem:
1. Make a custom extractor that handles a file of type text/plain
2. In your mime-extractor-map.xml, add a mime type for your custom extractor
3. In your mime-types.xml, add a glob patter matching your file name pattern, to the mime type in (2)
4. Run crawler_launcher using AutoDetectProductCrawler, and you'll find that your text file will NOT match your extractor in OODT v0.8.1
i.e. OODT will tell you:
WARNING: No extractor specs specified for /your/text/file

Tracing the flow of the problem:
1. AutoDetectProductCrawler calls "passesPreconditions" method
2. AutoDetectProductCrawler#passesPreconditions calls MimeExtractorRepo#getExtractorSpecsForFile [2]
3. MimeExtractorRepo#getExtractorSpecsForFile calls MimeTypeUtils#getMimeType [3]
4. MimeTypeUtils#getMimeType calls Tika#detect, where MimeTypeUtils's constructor has loaded a Tika instance using DefaultDetector [4]
5. DefaultDetector#getDefaultDetectors [4] specifies that the user-provided mime-types.xml file must take LAST precedence. Thus, Tika's default, internal mime-type mappings will override mime-types.xml.

--
[1] https://github.com/apache/oodt/blob/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/AutoDetectProductCrawler.java
[2] https://github.com/apache/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/typedetection/MimeExtractorRepo.java
[3] https://github.com/apache/oodt/blob/trunk/metadata/src/main/java/org/apache/oodt/cas/metadata/util/MimeTypeUtils.java
[4] https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/DefaultDetector.java

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)