You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2015/05/20 17:44:59 UTC
[jira] [Created] (OAK-2895) Provide config option to exclude
certain mimeTypes from getting indexed
Chetan Mehrotra created OAK-2895:
------------------------------------
Summary: Provide config option to exclude certain mimeTypes from getting indexed
Key: OAK-2895
URL: https://issues.apache.org/jira/browse/OAK-2895
Project: Jackrabbit Oak
Issue Type: Improvement
Components: lucene
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
Priority: Minor
Fix For: 1.3.0, 1.2.3, 1.0.15
Currently the recommended way to exclude certain types of files from getting indexed is to add them to {{EmptyParser}} in Tika Config. However looking at how Tika works even if mimetype is provided as part metadata.
Tika Detector try to determine the mimetype by actually reading some bytes from InputStream [1] before looking up from passed MetaData. This would cause unnecessary IO in case large number of binaries are excluded.
To avoid this IO we should expose a multi value config property which takes a list of mimetypes to be excluded from indexing. If the mimeType provided as part of JCR data is part of that excluded list then call to Tika should be avoided
[1] https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)