You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2014/06/24 17:30:26 UTC

[jira] [Commented] (CONNECTORS-984) Give Tika's metadata some hints

    [ https://issues.apache.org/jira/browse/CONNECTORS-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042254#comment-14042254 ] 

Karl Wright commented on CONNECTORS-984:
----------------------------------------

This all sounds very reasonable.  Abe-san, do you want to propose a patch?

> Give Tika's metadata some hints
> -------------------------------
>
>                 Key: CONNECTORS-984
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-984
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Shinichiro Abe
>             Fix For: ManifoldCF 1.7
>
>
> Component: Tika connector
> Currently in trunk code, we don't set data in Tika's metadata object.
> We likely have to give metadata some hints to detect and extract from document.
> * resourceName
> * ContentType
> * stream size
> * charset(new feature)
> * Password handling(new feature)
> Also, when TikaException(e.g. parsing error at PDFBox/POI) is thrown, we need to decide to ignore or not about the parsing document. Solr Cell has 'ignoreTikaException' param. When TikaException is thrown, if true, metadata only is indexed, if false, Solr responds server error and the document is not indexed.
> Reference-->Solr Cell:
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l142



--
This message was sent by Atlassian JIRA
(v6.2#6252)