You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Richard Nichols (JIRA)" <ji...@apache.org> on 2013/05/29 00:26:22 UTC
[jira] [Commented] (CONNECTORS-690) ElasticSearch connector does not put in a mappings statement but perhaps should

    [ https://issues.apache.org/jira/browse/CONNECTORS-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668739#comment-13668739 ] 

Richard Nichols commented on CONNECTORS-690:
--------------------------------------------

I've been reading the connector code and have been testing scenarios against ElasticSearch 0.90, and as far as I can tell, the connector as written should not be indexing documents correctly *by itself*.  By default, ElasticSearch creates a schema based on the fields found in the JSON code sent to it as part of an index operation.  The JSON code being sent to ElasticSearch by the current connector to index a document is something like:

{
  "field1" : "value1",
  "field2" : "value2",
  "document" : "<ACL Values>",
  "share" : "<ACL Values>",
  "type" : "attachment",
  "_content_type" : "<MIME type>",
  "_name" : "<fileName>",
  "file" : "<base64 encoded data"
}

This JSON code causes ElasticSearch to create a 'flat' schema and simply encodes the attachement (or other data) as a base64 string, which ElasticSearch can't process.  Likewise, the "type", "_content_type", and "_name" fields are all simply stored as strings, and in no way affects the processing of the document.

If a mapping statement such as the following is sent to ElasticSearch *prior* to any indexing by MCF, document processing will *seem* to work, but there's still a problem:

{
  "attachment" : {
    "properties" : {
      "file" : {
        "type" : "attachment",
        "fields" : {
          "title" : { "store" : "yes" },
          "keywords" : { "store" : "yes" },
          "author" : { "store" : "yes" },
	  "content_type" : {"store" : "yes"},
	  "name" : {"store" : "yes"},
	  "date" : {"store" : "yes"},
          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }
        }
      }
    }
  }
}

In this case, the document is processed, but the "type", "_name", and "_content_type" fields are still created by MCF (and stored as simple strings) and aren't really used at all by the attachments plugin.  (This means that MCF is not really specifying the content type as the connector code seems to be expecting.)

To completely control the attachments plugin, the JSON code to index a document would need to look like the following:
{
  "field1" : "value1",
  "field2" : "value2",
  "document" : "<ACL Values>",
  "share" : "<ACL Values>",
  "file" : {
    "_content_type" : "<MIME type>",
    "_name" : "<fileName>",
    "content" : "<base64 encoded data"
  }
}

One question:  Should any mapping behaviour be determined by a configuration parameter within the connector?  I'm not sure how this connector is working for others, but if it is, we probably don't want to break things.  (For example, the current connector will work if the user's ElasticSearch implementation doesn't have the Mapper Attachment plugin installed.  I'm not sure that this would be true if these changes are made.)
                
> ElasticSearch connector does not put in a mappings statement but perhaps should
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-690
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-690
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector
>    Affects Versions: ManifoldCF 1.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.3
>
>
> According to http://www.elasticsearch.org/guide/reference/mapping/attachment-type/ it seems that the connector should use a mapping command to set the ‘file’ property with a type of ‘attachment’, with “_content_type” and “_name” fields as subfields of the ‘file’ property.  Also, through testing I found that if you want the ‘date’, ‘title’, ‘author’, and ‘keywords’ fields extracted from the document and saved, they need to be listed in the mapping too.   (Unfortunately, using a mapping changes the JSON code for adding the document to the index.  Instead of sending the base64 encoded file attached to the ‘file’ field, it’s attached to the ‘contents’ subfield.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira