You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by "Nichols, Richard" <Ri...@tellabs.com> on 2013/05/20 17:37:10 UTC

Attachment processing with ElasticSearch Connector to ElasticSearch 0.90

Hi,

I'm using ManifoldCF 1.2 with ElasticSearch 0.90.  I'm trying to index PDF files via the "Windows Shares" repository connector.  I have the elasticsearch-mapper-attachments plugin installed in ElasticSearch.

When I run the job on an empty index, a 'flat' schema is created:
{
  "pdf_docs_flat_schema" : {
    "pdf_docs" : {
      "properties" : {
        "_content_type" : {
          "type" : "string"
        },
        "_name" : {
          "type" : "string"
        },
        "allow_token_document" : {
          "type" : "string"
        },
        "allow_token_share" : {
          "type" : "string"
        },
        "deny_token_document" : {
          "type" : "string"
        },
        "deny_token_share" : {
          "type" : "string"
        },
        "file" : {
          "type" : "string"
        },
        "lastModified" : {
          "type" : "string"
        },
        "type" : {
          "type" : "string"
        }
      }
    }
  }
}

Notice that the _content_type, _name, file, and type fields are all properties of type "string".  As far as I can tell the 'type' of "attachment" sent with indexed file is just treated as a normal piece of metadata and the 'file' field (which is snet as a base64 encoded string) is never processed as an attachment.

According to http://www.elasticsearch.org/guide/reference/mapping/attachment-type/ it seems that the connector should use a mapping command to set the 'file' property with a type of 'attachment', with "_content_type" and "_name" fields as subfields of the 'file' property.  Also, through testing I found that if you want the 'date', 'title', 'author', and 'keywords' fields extracted from the document and saved, they need to be listed in the mapping too.   (Unfortunately, using a mapping changes the JSON code for adding the document to the index.  Instead of sending the base64 encoded file attached to the 'file' field, it's attached to the 'contents' subfield.)

Am I missing something obvious here?  All I want is my documents properly indexed.
Is this something for the 'dev' mailing list instead?

Thanks,
Rick


============================================================
The information contained in this message may be privileged
and confidential and protected from disclosure. If the reader
of this message is not the intended recipient, or an employee
or agent responsible for delivering this message to the
intended recipient, you are hereby notified that any reproduction,
dissemination or distribution of this communication is strictly
prohibited. If you have received this communication in error,
please notify us immediately by replying to the message and
deleting it from your computer. Thank you. Tellabs
============================================================

RE: Attachment processing with ElasticSearch Connector to ElasticSearch 0.90

Posted by "Nichols, Richard" <Ri...@tellabs.com>.
Thanks.  I've been looking at the source and have some ideas percolating for improvements.  However, I need to do some more due diligence to make they don't break things (like using the JDBC repository connector).    I'm going to be out of the office until Tuesday of next week, so I don't think it makes sense for you to go out of your way to fix anything until I have time to work on it.  (I'm sure there are other issues that have more pressing needs.  :) )

Like you said, others are using this connector.  Maybe I'll just figure out I'm missing something simple.

Rick

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: Monday, May 20, 2013 2:04 PM
To: user@manifoldcf.apache.org
Subject: Re: Attachment processing with ElasticSearch Connector to ElasticSearch 0.90

Hi Rick,

I looked over the indexing code for the ElasticSearch connector and found no mapping statement.  The history of the ES connector is that it was contributed to us a while back, and while I've been fixing and adding to it, I don't have the full vision of the best way a connector should be constructed.  I've therefore opened a ticket CONNECTORS-690.  What I'd like to do is to find out your example of the ideal json we should be producing for indexing.  Please comment directly on the ticket at https://issues.apache.org/jira/browse/CONNECTORS-690 and include an example of the way you'd like to see it; I can create a branch and we can experiment if you like, probably starting Wednesday evening.
Bear in mind, however, that we have people successfully using this connector, so it is quite likely that there are other ways to accomplish the same thing, although I am not certain that folks were looking at the same features you are.

Thanks,
Karl


On Mon, May 20, 2013 at 11:37 AM, Nichols, Richard <Ri...@tellabs.com>> wrote:
Hi,

I'm using ManifoldCF 1.2 with ElasticSearch 0.90.  I'm trying to index PDF files via the "Windows Shares" repository connector.  I have the elasticsearch-mapper-attachments plugin installed in ElasticSearch.

When I run the job on an empty index, a 'flat' schema is created:
{
  "pdf_docs_flat_schema" : {
    "pdf_docs" : {
      "properties" : {
        "_content_type" : {
          "type" : "string"
        },
        "_name" : {
          "type" : "string"
        },
        "allow_token_document" : {
          "type" : "string"
        },
        "allow_token_share" : {
          "type" : "string"
        },
        "deny_token_document" : {
          "type" : "string"
        },
        "deny_token_share" : {
          "type" : "string"
        },
        "file" : {
          "type" : "string"
        },
        "lastModified" : {
          "type" : "string"
        },
        "type" : {
          "type" : "string"
        }
      }
    }
  }
}

Notice that the _content_type, _name, file, and type fields are all properties of type "string".  As far as I can tell the 'type' of "attachment" sent with indexed file is just treated as a normal piece of metadata and the 'file' field (which is snet as a base64 encoded string) is never processed as an attachment.

According to http://www.elasticsearch.org/guide/reference/mapping/attachment-type/ it seems that the connector should use a mapping command to set the 'file' property with a type of 'attachment', with "_content_type" and "_name" fields as subfields of the 'file' property.  Also, through testing I found that if you want the 'date', 'title', 'author', and 'keywords' fields extracted from the document and saved, they need to be listed in the mapping too.   (Unfortunately, using a mapping changes the JSON code for adding the document to the index.  Instead of sending the base64 encoded file attached to the 'file' field, it's attached to the 'contents' subfield.)

Am I missing something obvious here?  All I want is my documents properly indexed.
Is this something for the 'dev' mailing list instead?

Thanks,
Rick


============================================================
The information contained in this message may be privileged
and confidential and protected from disclosure. If the reader
of this message is not the intended recipient, or an employee
or agent responsible for delivering this message to the
intended recipient, you are hereby notified that any reproduction,
dissemination or distribution of this communication is strictly
prohibited. If you have received this communication in error,
please notify us immediately by replying to the message and
deleting it from your computer. Thank you. Tellabs
============================================================


============================================================
The information contained in this message may be privileged
and confidential and protected from disclosure. If the reader
of this message is not the intended recipient, or an employee
or agent responsible for delivering this message to the
intended recipient, you are hereby notified that any reproduction,
dissemination or distribution of this communication is strictly
prohibited. If you have received this communication in error,
please notify us immediately by replying to the message and
deleting it from your computer. Thank you. Tellabs
============================================================

Re: Attachment processing with ElasticSearch Connector to ElasticSearch 0.90

Posted by Karl Wright <da...@gmail.com>.
Hi Rick,

I looked over the indexing code for the ElasticSearch connector and found
no mapping statement.  The history of the ES connector is that it was
contributed to us a while back, and while I've been fixing and adding to
it, I don't have the full vision of the best way a connector should be
constructed.  I've therefore opened a ticket CONNECTORS-690.  What I'd like
to do is to find out your example of the ideal json we should be producing
for indexing.  Please comment directly on the ticket at
https://issues.apache.org/jira/browse/CONNECTORS-690 and include an example
of the way you'd like to see it; I can create a branch and we can
experiment if you like, probably starting Wednesday evening.

Bear in mind, however, that we have people successfully using this
connector, so it is quite likely that there are other ways to accomplish
the same thing, although I am not certain that folks were looking at the
same features you are.

Thanks,
Karl





On Mon, May 20, 2013 at 11:37 AM, Nichols, Richard <
Richard.Nichols@tellabs.com> wrote:

>  Hi,****
>
> ** **
>
> I’m using ManifoldCF 1.2 with ElasticSearch 0.90.  I’m trying to index PDF
> files via the “Windows Shares” repository connector.  I have the
> elasticsearch-mapper-attachments plugin installed in ElasticSearch.****
>
> ** **
>
> When I run the job on an empty index, a ‘flat’ schema is created:****
>
> {****
>
>   "pdf_docs_flat_schema" : {****
>
>     "pdf_docs" : {****
>
>       "properties" : {****
>
>         "_content_type" : {****
>
>           "type" : "string"****
>
>         },****
>
>         "_name" : {****
>
>           "type" : "string"****
>
>         },****
>
>         "allow_token_document" : {****
>
>           "type" : "string"****
>
>         },****
>
>         "allow_token_share" : {****
>
>           "type" : "string"****
>
>         },****
>
>         "deny_token_document" : {****
>
>           "type" : "string"****
>
>         },****
>
>         "deny_token_share" : {****
>
>           "type" : "string"****
>
>         },****
>
>         "file" : {****
>
>           "type" : "string"****
>
>         },****
>
>         "lastModified" : {****
>
>           "type" : "string"****
>
>         },****
>
>         "type" : {****
>
>           "type" : "string"****
>
>         }****
>
>       }****
>
>     }****
>
>   }****
>
> }****
>
> ** **
>
> Notice that the *_content_type*, *_name*, *file*, and *type* fields are
> all properties of type “string”.  As far as I can tell the ‘type’ of
> “attachment” sent with indexed file is just treated as a normal piece of
> metadata and the ‘file’ field (which is snet as a base64 encoded string) is
> never processed as an attachment.****
>
> ** **
>
> According to
> http://www.elasticsearch.org/guide/reference/mapping/attachment-type/ it
> seems that the connector should use a *mapping* command to set the ‘file’
> property with a type of ‘attachment’, with “_content_type” and “_name”
> fields as subfields of the ‘file’ property.  Also, through testing I found
> that if you want the ‘date’, ‘title’, ‘author’, and ‘keywords’ fields
> extracted from the document and saved, they need to be listed in the
> mapping too.   (Unfortunately, using a mapping changes the JSON code for
> adding the document to the index.  Instead of sending the base64 encoded
> file attached to the ‘file’ field, it’s attached to the ‘contents’
> subfield.)****
>
> ** **
>
> Am I missing something obvious here?  All I want is my documents properly
> indexed.****
>
> Is this something for the ‘dev’ mailing list instead?****
>
> ** **
>
> Thanks,****
>
> Rick****
>
> ** **
>
> ============================================================
> The information contained in this message may be privileged
> and confidential and protected from disclosure. If the reader
> of this message is not the intended recipient, or an employee
> or agent responsible for delivering this message to the
> intended recipient, you are hereby notified that any reproduction,
> dissemination or distribution of this communication is strictly
> prohibited. If you have received this communication in error,
> please notify us immediately by replying to the message and
> deleting it from your computer. Thank you. Tellabs
> ============================================================
>