You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Shinichiro Abe (JIRA)" <ji...@apache.org> on 2014/06/24 16:58:25 UTC

[jira] [Created] (CONNECTORS-984) Give Tika's metadata some hints

Shinichiro Abe created CONNECTORS-984:
-----------------------------------------

             Summary: Give Tika's metadata some hints
                 Key: CONNECTORS-984
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-984
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Amazon CloudSearch output connector
    Affects Versions: ManifoldCF 1.7
            Reporter: Shinichiro Abe
             Fix For: ManifoldCF 1.7


Component: Tika connector

Currently in trunk code, we don't set data in Tika's metadata object.
We likely have to give metadata some hints to detect and extract from document.
* resourceName
* ContentType
* stream size
* charset(new feature)
* Password handling(new feature)

Also, when TikaException(e.g. parsing error at PDFBox/POI) is thrown, we need to decide to ignore or not about the parsing document. Solr Cell has 'ignoreTikaException' param. When TikaException is thrown, if true, metadata only is indexed, if false, Solr responds server error and the document is not indexed.

Reference-->Solr Cell:
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l142



--
This message was sent by Atlassian JIRA
(v6.2#6252)