You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Tony Edgin <te...@gmail.com> on 2013/02/11 20:38:12 UTC

Determining document model passed to search engine

I'm sure this is documented somewhere, and I apologize in advance for not
being able to find it.

How do I determine the model or schema of the document passed to the search
engine by a given job?

For instance, I'm running a job that crawls a directory on my local file
system and passes to to Elastic Search.  Interrogating Elastic Search, I
can determine that the document has three fields, "file", "type" and "uri",
all strings.  How would I have known that in advance?

Thanks for any help.

Re: Determining document model passed to search engine

Posted by Karl Wright <da...@gmail.com>.
The elastic search connector always base-64 encodes the content.  I
gather that is standard for ElasticSearch.

Karl

On Mon, Feb 11, 2013 at 4:00 PM, Tony Edgin <te...@gmail.com> wrote:
> Thanks again.
>
> I just ran an example set up to understand better what you said.
>
> As you said, the web page URL get's set to the _id field.
> The metadata that is sent to Elastic Search is as follows:
>
>       header-Content-Type: "text/html; charset=UTF-8"
>       header-Content-Length: "3278"
>       header-Keep-Alive: "timeout=5, max=100"
>       header-Server: "Apache/2.2"
>       header-Connection: "Keep-Alive"
>       type: "attachment"
>       file: ...
>
> The file field looks to be base64 encoded.  Is this always the case, or is
> this unique to web repo + elastic search?
>
> This must be the web page. I'm guessing header-Content-Type field holds the
> document type and not the type field.
>
>
>
>
>
> On Mon, Feb 11, 2013 at 1:17 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> What emerges from the web connector is the following:
>>
>> -       metadata, which you define on the web connector’s “Metadata” tab,
>> that are named however you want;
>> -       forced acls, which get added to the document based on what you
>> select on the “Security” tab;
>> -       the document’s content type;
>> -       the document’s url;
>> -       the document itself.
>>
>> What the elastic search connector does is:
>> -       Map the document’s url to ElasticSearch’s document id field (which
>> I
>> guess shows up in Elastic Search as the ‘uri’ field)
>> -       Output all the metadata directly to ElasticSearch using the name
>> provided by the repository connector
>> -       Set the file value to “” (which seems wrong, since that could be
>> helpful if available - let me know if you think a fix for this would
>> be useful)
>> -       NONE of the rest of the document fields (content type, acls, etc)
>> are communicated to Elastic Search at all right now, except for the
>> document itself.
>>
>> Karl
>>
>>
>> On Mon, Feb 11, 2013 at 2:55 PM, Tony Edgin <te...@gmail.com>
>> wrote:
>> > Thanks for the speedy response!
>> >
>> > I eventually want to index the contents of our local website with
>> > Elastic
>> > Search.
>> >
>> > I would use the Web repository connector with the no authority connector
>> > and
>> > the Elasticsearch output connector.  Would you mind letting me know the
>> > names and meanings of the metadata that get's passed to Elastic Search?
>> >
>> > Thanks again.
>> >
>> >
>> > On Mon, Feb 11, 2013 at 12:45 PM, Karl Wright <da...@gmail.com>
>> > wrote:
>> >>
>> >> So let me get this clear - you are looking to find out what the
>> >> names/meanings are of the metadata that gets passed to the output
>> >> connector, for a given repository connection?
>> >>
>> >> If this is what you are looking for, I'm afraid that while at one
>> >> point the end-user documentation described this pretty accurately, it
>> >> is now significantly out of date.  While it's not terribly hard to
>> >> compile this information from source code etc., the work definitely
>> >> needs to be repeated by somebody.
>> >>
>> >> If you want to ask this question about a specific connector, I can
>> >> certainly try to answer it, though.  If you want to contribute either
>> >> the information or a documentation patch, this would be great too.
>> >>
>> >> Karl
>> >>
>> >> On Mon, Feb 11, 2013 at 2:38 PM, Tony Edgin <te...@gmail.com>
>> >> wrote:
>> >> > I'm sure this is documented somewhere, and I apologize in advance for
>> >> > not
>> >> > being able to find it.
>> >> >
>> >> > How do I determine the model or schema of the document passed to the
>> >> > search
>> >> > engine by a given job?
>> >> >
>> >> > For instance, I'm running a job that crawls a directory on my local
>> >> > file
>> >> > system and passes to to Elastic Search.  Interrogating Elastic
>> >> > Search, I
>> >> > can
>> >> > determine that the document has three fields, "file", "type" and
>> >> > "uri",
>> >> > all
>> >> > strings.  How would I have known that in advance?
>> >> >
>> >> > Thanks for any help.
>> >
>> >
>
>

Re: Determining document model passed to search engine

Posted by Tony Edgin <te...@gmail.com>.
Thanks again.

I just ran an example set up to understand better what you said.

As you said, the web page URL get's set to the _id field.
The metadata that is sent to Elastic Search is as follows:

      header-Content-Type: "text/html; charset=UTF-8"
      header-Content-Length: "3278"
      header-Keep-Alive: "timeout=5, max=100"
      header-Server: "Apache/2.2"
      header-Connection: "Keep-Alive"
      type: "attachment"
      file: ...

The file field looks to be base64 encoded.  Is this always the case, or is
this unique to web repo + elastic search?

This must be the web page. I'm guessing header-Content-Type field holds the
document type and not the type field.





On Mon, Feb 11, 2013 at 1:17 PM, Karl Wright <da...@gmail.com> wrote:

> What emerges from the web connector is the following:
>
> -       metadata, which you define on the web connector’s “Metadata” tab,
> that are named however you want;
> -       forced acls, which get added to the document based on what you
> select on the “Security” tab;
> -       the document’s content type;
> -       the document’s url;
> -       the document itself.
>
> What the elastic search connector does is:
> -       Map the document’s url to ElasticSearch’s document id field (which
> I
> guess shows up in Elastic Search as the ‘uri’ field)
> -       Output all the metadata directly to ElasticSearch using the name
> provided by the repository connector
> -       Set the file value to “” (which seems wrong, since that could be
> helpful if available - let me know if you think a fix for this would
> be useful)
> -       NONE of the rest of the document fields (content type, acls, etc)
> are communicated to Elastic Search at all right now, except for the
> document itself.
>
> Karl
>
>
> On Mon, Feb 11, 2013 at 2:55 PM, Tony Edgin <te...@gmail.com>
> wrote:
> > Thanks for the speedy response!
> >
> > I eventually want to index the contents of our local website with Elastic
> > Search.
> >
> > I would use the Web repository connector with the no authority connector
> and
> > the Elasticsearch output connector.  Would you mind letting me know the
> > names and meanings of the metadata that get's passed to Elastic Search?
> >
> > Thanks again.
> >
> >
> > On Mon, Feb 11, 2013 at 12:45 PM, Karl Wright <da...@gmail.com>
> wrote:
> >>
> >> So let me get this clear - you are looking to find out what the
> >> names/meanings are of the metadata that gets passed to the output
> >> connector, for a given repository connection?
> >>
> >> If this is what you are looking for, I'm afraid that while at one
> >> point the end-user documentation described this pretty accurately, it
> >> is now significantly out of date.  While it's not terribly hard to
> >> compile this information from source code etc., the work definitely
> >> needs to be repeated by somebody.
> >>
> >> If you want to ask this question about a specific connector, I can
> >> certainly try to answer it, though.  If you want to contribute either
> >> the information or a documentation patch, this would be great too.
> >>
> >> Karl
> >>
> >> On Mon, Feb 11, 2013 at 2:38 PM, Tony Edgin <te...@gmail.com>
> >> wrote:
> >> > I'm sure this is documented somewhere, and I apologize in advance for
> >> > not
> >> > being able to find it.
> >> >
> >> > How do I determine the model or schema of the document passed to the
> >> > search
> >> > engine by a given job?
> >> >
> >> > For instance, I'm running a job that crawls a directory on my local
> file
> >> > system and passes to to Elastic Search.  Interrogating Elastic
> Search, I
> >> > can
> >> > determine that the document has three fields, "file", "type" and
> "uri",
> >> > all
> >> > strings.  How would I have known that in advance?
> >> >
> >> > Thanks for any help.
> >
> >
>

Re: Determining document model passed to search engine

Posted by Karl Wright <da...@gmail.com>.
What emerges from the web connector is the following:

-	metadata, which you define on the web connector’s “Metadata” tab,
that are named however you want;
-	forced acls, which get added to the document based on what you
select on the “Security” tab;
-	the document’s content type;
-	the document’s url;
-	the document itself.

What the elastic search connector does is:
-	Map the document’s url to ElasticSearch’s document id field (which I
guess shows up in Elastic Search as the ‘uri’ field)
-	Output all the metadata directly to ElasticSearch using the name
provided by the repository connector
-	Set the file value to “” (which seems wrong, since that could be
helpful if available - let me know if you think a fix for this would
be useful)
-	NONE of the rest of the document fields (content type, acls, etc)
are communicated to Elastic Search at all right now, except for the
document itself.

Karl


On Mon, Feb 11, 2013 at 2:55 PM, Tony Edgin <te...@gmail.com> wrote:
> Thanks for the speedy response!
>
> I eventually want to index the contents of our local website with Elastic
> Search.
>
> I would use the Web repository connector with the no authority connector and
> the Elasticsearch output connector.  Would you mind letting me know the
> names and meanings of the metadata that get's passed to Elastic Search?
>
> Thanks again.
>
>
> On Mon, Feb 11, 2013 at 12:45 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> So let me get this clear - you are looking to find out what the
>> names/meanings are of the metadata that gets passed to the output
>> connector, for a given repository connection?
>>
>> If this is what you are looking for, I'm afraid that while at one
>> point the end-user documentation described this pretty accurately, it
>> is now significantly out of date.  While it's not terribly hard to
>> compile this information from source code etc., the work definitely
>> needs to be repeated by somebody.
>>
>> If you want to ask this question about a specific connector, I can
>> certainly try to answer it, though.  If you want to contribute either
>> the information or a documentation patch, this would be great too.
>>
>> Karl
>>
>> On Mon, Feb 11, 2013 at 2:38 PM, Tony Edgin <te...@gmail.com>
>> wrote:
>> > I'm sure this is documented somewhere, and I apologize in advance for
>> > not
>> > being able to find it.
>> >
>> > How do I determine the model or schema of the document passed to the
>> > search
>> > engine by a given job?
>> >
>> > For instance, I'm running a job that crawls a directory on my local file
>> > system and passes to to Elastic Search.  Interrogating Elastic Search, I
>> > can
>> > determine that the document has three fields, "file", "type" and "uri",
>> > all
>> > strings.  How would I have known that in advance?
>> >
>> > Thanks for any help.
>
>

Re: Determining document model passed to search engine

Posted by Tony Edgin <te...@gmail.com>.
Thanks for the speedy response!

I eventually want to index the contents of our local website with Elastic
Search.

I would use the Web repository connector with the no authority connector
and the Elasticsearch output connector.  Would you mind letting me know the
names and meanings of the metadata that get's passed to Elastic Search?

Thanks again.

On Mon, Feb 11, 2013 at 12:45 PM, Karl Wright <da...@gmail.com> wrote:

> So let me get this clear - you are looking to find out what the
> names/meanings are of the metadata that gets passed to the output
> connector, for a given repository connection?
>
> If this is what you are looking for, I'm afraid that while at one
> point the end-user documentation described this pretty accurately, it
> is now significantly out of date.  While it's not terribly hard to
> compile this information from source code etc., the work definitely
> needs to be repeated by somebody.
>
> If you want to ask this question about a specific connector, I can
> certainly try to answer it, though.  If you want to contribute either
> the information or a documentation patch, this would be great too.
>
> Karl
>
> On Mon, Feb 11, 2013 at 2:38 PM, Tony Edgin <te...@gmail.com>
> wrote:
> > I'm sure this is documented somewhere, and I apologize in advance for not
> > being able to find it.
> >
> > How do I determine the model or schema of the document passed to the
> search
> > engine by a given job?
> >
> > For instance, I'm running a job that crawls a directory on my local file
> > system and passes to to Elastic Search.  Interrogating Elastic Search, I
> can
> > determine that the document has three fields, "file", "type" and "uri",
> all
> > strings.  How would I have known that in advance?
> >
> > Thanks for any help.
>

Re: Determining document model passed to search engine

Posted by Karl Wright <da...@gmail.com>.
So let me get this clear - you are looking to find out what the
names/meanings are of the metadata that gets passed to the output
connector, for a given repository connection?

If this is what you are looking for, I'm afraid that while at one
point the end-user documentation described this pretty accurately, it
is now significantly out of date.  While it's not terribly hard to
compile this information from source code etc., the work definitely
needs to be repeated by somebody.

If you want to ask this question about a specific connector, I can
certainly try to answer it, though.  If you want to contribute either
the information or a documentation patch, this would be great too.

Karl

On Mon, Feb 11, 2013 at 2:38 PM, Tony Edgin <te...@gmail.com> wrote:
> I'm sure this is documented somewhere, and I apologize in advance for not
> being able to find it.
>
> How do I determine the model or schema of the document passed to the search
> engine by a given job?
>
> For instance, I'm running a job that crawls a directory on my local file
> system and passes to to Elastic Search.  Interrogating Elastic Search, I can
> determine that the document has three fields, "file", "type" and "uri", all
> strings.  How would I have known that in advance?
>
> Thanks for any help.