You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ponymail.apache.org by sebb <se...@gmail.com> on 2016/11/06 14:18:45 UTC

Index not_analysed for fields used as ids?

Fields such as message-id are stored as text strings, but they are
only really intended to be used as ids. They don't contain independent
text parts.

From what I have understood so far from reading the ES docs, such
fields should be tagged as

"index": "not_analyzed"

AIUI this reduces the analysis overhead and storage requirements, and
also makes it harder to find fields with
This probably applies to other fields in "mbox":

mid
possibly in-reply-to
also references

And of course the auto-created fields such as attachments

Likewise the doc types currently missing from setup.py:

notifications
account
mailinglists

These are internal use only so are not intended for searching.

Or have I got this completely wrong?

Re: Index not_analysed for fields used as ids?

Posted by sebb <se...@gmail.com>.
On 7 November 2016 at 12:18, Daniel Gruno <hu...@apache.org> wrote:
> On 11/07/2016 01:02 PM, Shane Curcuru wrote:
>> sebb wrote on 11/6/16 10:10 PM:
>>> On 7 November 2016 at 02:21, John D. Ament <jo...@apache.org> wrote:
>> ...snip...
>>>> Although the interesting thing, I just tried searching by message ID, and
>>>> that doesn't seem to work on the ASF version out there -
>>>> https://lists.apache.org/list.html?dev@joshua.apache.org:lte=1M:%3C20161101000041.15874.17104@johns-mbp-2.home%3E
>>>
>>> message-id is flagged as not_analysed: maybe that excludes it from _all
>> ...snip...
>>
>> Wait, Message-ID isn't searchable?  That actually seems to be a common
>> thing to search for in some cases.
>>
>> Is there a specific reason we don't allow searching by that?

It is indexed, but not analyzed.

This means one cannot readily search by its component parts, because
the id is treated as a whole.

I think this field type is called a keyword in ES 5.0.

>> If we're not indexing it, we should clearly document how users can
>> construct the URL to find their desired message id directly - someplace
>> findable in the UI.
>>
>> - Shane
>>
> you can't search in the search box for message ID IIRC, but you can find
> it easily by just going to thread.html/<insert-id-here>

That works for John's example above:

https://lists.apache.org/thread.html/<20...@johns-mbp-2.home>

However it might be useful to provide a search form which allowed
direct input of msg id and other such fields.

At present search only works from one of the mailing lists.
There is a standalone page (search.html) but it's not linked from the
main navigation (possibly because it is incomplete).

Re: Index not_analysed for fields used as ids?

Posted by Daniel Gruno <hu...@apache.org>.
On 11/07/2016 01:02 PM, Shane Curcuru wrote:
> sebb wrote on 11/6/16 10:10 PM:
>> On 7 November 2016 at 02:21, John D. Ament <jo...@apache.org> wrote:
> ...snip...
>>> Although the interesting thing, I just tried searching by message ID, and
>>> that doesn't seem to work on the ASF version out there -
>>> https://lists.apache.org/list.html?dev@joshua.apache.org:lte=1M:%3C20161101000041.15874.17104@johns-mbp-2.home%3E
>>
>> message-id is flagged as not_analysed: maybe that excludes it from _all
> ...snip...
> 
> Wait, Message-ID isn't searchable?  That actually seems to be a common
> thing to search for in some cases.
> 
> Is there a specific reason we don't allow searching by that?
> 
> If we're not indexing it, we should clearly document how users can
> construct the URL to find their desired message id directly - someplace
> findable in the UI.
> 
> - Shane
> 
you can't search in the search box for message ID IIRC, but you can find
it easily by just going to thread.html/<insert-id-here>



Re: Index not_analysed for fields used as ids?

Posted by Shane Curcuru <as...@shanecurcuru.org>.
sebb wrote on 11/6/16 10:10 PM:
> On 7 November 2016 at 02:21, John D. Ament <jo...@apache.org> wrote:
...snip...
>> Although the interesting thing, I just tried searching by message ID, and
>> that doesn't seem to work on the ASF version out there -
>> https://lists.apache.org/list.html?dev@joshua.apache.org:lte=1M:%3C20161101000041.15874.17104@johns-mbp-2.home%3E
> 
> message-id is flagged as not_analysed: maybe that excludes it from _all
...snip...

Wait, Message-ID isn't searchable?  That actually seems to be a common
thing to search for in some cases.

Is there a specific reason we don't allow searching by that?

If we're not indexing it, we should clearly document how users can
construct the URL to find their desired message id directly - someplace
findable in the UI.

- Shane


Re: Index not_analysed for fields used as ids?

Posted by sebb <se...@gmail.com>.
On 7 November 2016 at 02:21, John D. Ament <jo...@apache.org> wrote:
> On Sun, Nov 6, 2016 at 9:03 PM sebb <se...@gmail.com> wrote:
>
>> On 7 November 2016 at 01:36, John D. Ament <jo...@apache.org> wrote:
>> > On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:
>> >
>> >> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com>
>> wrote:
>> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
>> >> wrote:
>> >> >
>> >> >> On 11/06/2016 03:18 PM, sebb wrote:
>> >> >> > Fields such as message-id are stored as text strings, but they are
>> >> >> > only really intended to be used as ids. They don't contain
>> independent
>> >> >> > text parts.
>> >> >> >
>> >> >> > From what I have understood so far from reading the ES docs, such
>> >> >> > fields should be tagged as
>> >> >> >
>> >> >> > "index": "not_analyzed"
>> >> >> >
>> >> >> > AIUI this reduces the analysis overhead and storage requirements,
>> and
>> >> >> > also makes it harder to find fields with
>> >> >> > This probably applies to other fields in "mbox":
>> >> >> >
>> >> >> > mid
>> >> >> > possibly in-reply-to
>> >> >> > also references
>> >> >> >
>> >> >> > And of course the auto-created fields such as attachments
>> >> >> >
>> >> >> > Likewise the doc types currently missing from setup.py:
>> >> >> >
>> >> >> > notifications
>> >> >> > account
>> >> >> > mailinglists
>> >> >> >
>> >> >> > These are internal use only so are not intended for searching.
>> >> >> >
>> >> >> > Or have I got this completely wrong?
>> >> >> >
>> >> >>
>> >> >> message-id is set to not be analyzed, by the setup script (it's in
>> the
>> >> >> mappings it sends to ES when creating the index). mid and in-reply-to
>> >> >> should probably also be not analyzed, although mid is really a copy
>> of
>> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> >> >> list_raw), neither is the raw from address
>> >> >>
>> >> >
>> >> > So I notice the query process is an arbitrary full text query, which
>> runs
>> >> > against _all.
>> >> >
>> >>
>> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>> >>
>> >> Huh?
>> >>
>> >> The query starts:
>> >>
>> >> local url = config.es_url .. doc .. "/_search?q="..query
>> >>
>> >> where
>> >>
>> >> es_url = "http://localhost:9200/ponymail/"
>> >>
>> >> and
>> >>
>> >> doc = "mbox" by default.
>> >>
>> >> Where does the _all come in?
>> >>
>> >
>> > When you do a query string query in elastic search (reference:
>> >
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
>> )
>> > the default field unless specified is "_all".  I can't find anything in
>> the
>> > pony code that changes this field.  As a result, its going to search _all
>> > by default.
>> >
>>
>> Sorry, I thought you were referring to the _all doc type.
>>
>> But I'm not sure what this has to do with my original e-mail about
>> which fields should be indexed, and which should not.
>>
>
> Everything actually.

I assume you mean everything should *not* be indexed?
That will surely depend on whether there are any specific field searches,
e.g. Subject and From are shown as separate fields in the Advanced search.

> https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html

In which case we should disable the _all field for all but the mbox mapping.

Most of those will not have many documents, apart from mbox_source,
and that does not have many text fields.
So maybe it won't make much difference.

> Basically, the mappings we use are moot on the individual fields (except
> for the epoch field) since all searches are performed against the _all
> field's value, which is just a big lob of everything smushed together.

Since epoch is double (why is it not long?), not a string, it's not
analysed anyway.

> Although the interesting thing, I just tried searching by message ID, and
> that doesn't seem to work on the ASF version out there -
> https://lists.apache.org/list.html?dev@joshua.apache.org:lte=1M:%3C20161101000041.15874.17104@johns-mbp-2.home%3E

message-id is flagged as not_analysed: maybe that excludes it from _all

> John
>
>
>>
>> >>
>> >> > unless
>> >> > I need to dig into it a bit further to see if there's something
>> building
>> >> up
>> >> > query a bit different.
>> >> >
>> >> > So... that means most of these mappings are moot.
>> >>
>>

Re: Index not_analysed for fields used as ids?

Posted by sebb <se...@gmail.com>.
On 7 November 2016 at 02:21, John D. Ament <jo...@apache.org> wrote:
> On Sun, Nov 6, 2016 at 9:03 PM sebb <se...@gmail.com> wrote:
>
>> On 7 November 2016 at 01:36, John D. Ament <jo...@apache.org> wrote:
>> > On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:
>> >
>> >> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com>
>> wrote:
>> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
>> >> wrote:
>> >> >
>> >> >> On 11/06/2016 03:18 PM, sebb wrote:
>> >> >> > Fields such as message-id are stored as text strings, but they are
>> >> >> > only really intended to be used as ids. They don't contain
>> independent
>> >> >> > text parts.
>> >> >> >
>> >> >> > From what I have understood so far from reading the ES docs, such
>> >> >> > fields should be tagged as
>> >> >> >
>> >> >> > "index": "not_analyzed"
>> >> >> >
>> >> >> > AIUI this reduces the analysis overhead and storage requirements,
>> and
>> >> >> > also makes it harder to find fields with
>> >> >> > This probably applies to other fields in "mbox":
>> >> >> >
>> >> >> > mid
>> >> >> > possibly in-reply-to
>> >> >> > also references
>> >> >> >
>> >> >> > And of course the auto-created fields such as attachments
>> >> >> >
>> >> >> > Likewise the doc types currently missing from setup.py:
>> >> >> >
>> >> >> > notifications
>> >> >> > account
>> >> >> > mailinglists
>> >> >> >
>> >> >> > These are internal use only so are not intended for searching.
>> >> >> >
>> >> >> > Or have I got this completely wrong?
>> >> >> >
>> >> >>
>> >> >> message-id is set to not be analyzed, by the setup script (it's in
>> the
>> >> >> mappings it sends to ES when creating the index). mid and in-reply-to
>> >> >> should probably also be not analyzed, although mid is really a copy
>> of
>> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> >> >> list_raw), neither is the raw from address
>> >> >>
>> >> >
>> >> > So I notice the query process is an arbitrary full text query, which
>> runs
>> >> > against _all.
>> >> >
>> >>
>> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>> >>
>> >> Huh?
>> >>
>> >> The query starts:
>> >>
>> >> local url = config.es_url .. doc .. "/_search?q="..query
>> >>
>> >> where
>> >>
>> >> es_url = "http://localhost:9200/ponymail/"
>> >>
>> >> and
>> >>
>> >> doc = "mbox" by default.
>> >>
>> >> Where does the _all come in?
>> >>
>> >
>> > When you do a query string query in elastic search (reference:
>> >
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
>> )
>> > the default field unless specified is "_all".  I can't find anything in
>> the
>> > pony code that changes this field.  As a result, its going to search _all
>> > by default.
>> >
>>
>> Sorry, I thought you were referring to the _all doc type.
>>
>> But I'm not sure what this has to do with my original e-mail about
>> which fields should be indexed, and which should not.
>>
>
> Everything actually.
> https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html
>
> Basically, the mappings we use are moot on the individual fields (except
> for the epoch field) since all searches are performed against the _all
> field's value, which is just a big lob of everything smushed together.
>
> Although the interesting thing, I just tried searching by message ID, and
> that doesn't seem to work on the ASF version out there -
> https://lists.apache.org/list.html?dev@joshua.apache.org:lte=1M:%3C20161101000041.15874.17104@johns-mbp-2.home%3E

That's because the search term is converted into a generic query (i.e.
only searches from, subject, body - see my reply about stats.lua)

> John
>
>
>>
>> >>
>> >> > unless
>> >> > I need to dig into it a bit further to see if there's something
>> building
>> >> up
>> >> > query a bit different.
>> >> >
>> >> > So... that means most of these mappings are moot.
>> >>
>>

Re: Index not_analysed for fields used as ids?

Posted by "John D. Ament" <jo...@apache.org>.
On Sun, Nov 6, 2016 at 9:03 PM sebb <se...@gmail.com> wrote:

> On 7 November 2016 at 01:36, John D. Ament <jo...@apache.org> wrote:
> > On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:
> >
> >> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com>
> wrote:
> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
> >> wrote:
> >> >
> >> >> On 11/06/2016 03:18 PM, sebb wrote:
> >> >> > Fields such as message-id are stored as text strings, but they are
> >> >> > only really intended to be used as ids. They don't contain
> independent
> >> >> > text parts.
> >> >> >
> >> >> > From what I have understood so far from reading the ES docs, such
> >> >> > fields should be tagged as
> >> >> >
> >> >> > "index": "not_analyzed"
> >> >> >
> >> >> > AIUI this reduces the analysis overhead and storage requirements,
> and
> >> >> > also makes it harder to find fields with
> >> >> > This probably applies to other fields in "mbox":
> >> >> >
> >> >> > mid
> >> >> > possibly in-reply-to
> >> >> > also references
> >> >> >
> >> >> > And of course the auto-created fields such as attachments
> >> >> >
> >> >> > Likewise the doc types currently missing from setup.py:
> >> >> >
> >> >> > notifications
> >> >> > account
> >> >> > mailinglists
> >> >> >
> >> >> > These are internal use only so are not intended for searching.
> >> >> >
> >> >> > Or have I got this completely wrong?
> >> >> >
> >> >>
> >> >> message-id is set to not be analyzed, by the setup script (it's in
> the
> >> >> mappings it sends to ES when creating the index). mid and in-reply-to
> >> >> should probably also be not analyzed, although mid is really a copy
> of
> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
> >> >> list_raw), neither is the raw from address
> >> >>
> >> >
> >> > So I notice the query process is an arbitrary full text query, which
> runs
> >> > against _all.
> >> >
> >>
> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
> >>
> >> Huh?
> >>
> >> The query starts:
> >>
> >> local url = config.es_url .. doc .. "/_search?q="..query
> >>
> >> where
> >>
> >> es_url = "http://localhost:9200/ponymail/"
> >>
> >> and
> >>
> >> doc = "mbox" by default.
> >>
> >> Where does the _all come in?
> >>
> >
> > When you do a query string query in elastic search (reference:
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
> )
> > the default field unless specified is "_all".  I can't find anything in
> the
> > pony code that changes this field.  As a result, its going to search _all
> > by default.
> >
>
> Sorry, I thought you were referring to the _all doc type.
>
> But I'm not sure what this has to do with my original e-mail about
> which fields should be indexed, and which should not.
>

Everything actually.
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html

Basically, the mappings we use are moot on the individual fields (except
for the epoch field) since all searches are performed against the _all
field's value, which is just a big lob of everything smushed together.

Although the interesting thing, I just tried searching by message ID, and
that doesn't seem to work on the ASF version out there -
https://lists.apache.org/list.html?dev@joshua.apache.org:lte=1M:%3C20161101000041.15874.17104@johns-mbp-2.home%3E

John


>
> >>
> >> > unless
> >> > I need to dig into it a bit further to see if there's something
> building
> >> up
> >> > query a bit different.
> >> >
> >> > So... that means most of these mappings are moot.
> >>
>

Re: Index not_analysed for fields used as ids?

Posted by sebb <se...@gmail.com>.
On 7 November 2016 at 01:36, John D. Ament <jo...@apache.org> wrote:
> On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:
>
>> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com> wrote:
>> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
>> wrote:
>> >
>> >> On 11/06/2016 03:18 PM, sebb wrote:
>> >> > Fields such as message-id are stored as text strings, but they are
>> >> > only really intended to be used as ids. They don't contain independent
>> >> > text parts.
>> >> >
>> >> > From what I have understood so far from reading the ES docs, such
>> >> > fields should be tagged as
>> >> >
>> >> > "index": "not_analyzed"
>> >> >
>> >> > AIUI this reduces the analysis overhead and storage requirements, and
>> >> > also makes it harder to find fields with
>> >> > This probably applies to other fields in "mbox":
>> >> >
>> >> > mid
>> >> > possibly in-reply-to
>> >> > also references
>> >> >
>> >> > And of course the auto-created fields such as attachments
>> >> >
>> >> > Likewise the doc types currently missing from setup.py:
>> >> >
>> >> > notifications
>> >> > account
>> >> > mailinglists
>> >> >
>> >> > These are internal use only so are not intended for searching.
>> >> >
>> >> > Or have I got this completely wrong?
>> >> >
>> >>
>> >> message-id is set to not be analyzed, by the setup script (it's in the
>> >> mappings it sends to ES when creating the index). mid and in-reply-to
>> >> should probably also be not analyzed, although mid is really a copy of
>> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> >> list_raw), neither is the raw from address
>> >>
>> >
>> > So I notice the query process is an arbitrary full text query, which runs
>> > against _all.
>> >
>> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>>
>> Huh?
>>
>> The query starts:
>>
>> local url = config.es_url .. doc .. "/_search?q="..query
>>
>> where
>>
>> es_url = "http://localhost:9200/ponymail/"
>>
>> and
>>
>> doc = "mbox" by default.
>>
>> Where does the _all come in?
>>
>
> When you do a query string query in elastic search (reference:
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html)
> the default field unless specified is "_all".  I can't find anything in the
> pony code that changes this field.  As a result, its going to search _all
> by default.
>

Sorry, I thought you were referring to the _all doc type.

But I'm not sure what this has to do with my original e-mail about
which fields should be indexed, and which should not.

>>
>> > unless
>> > I need to dig into it a bit further to see if there's something building
>> up
>> > query a bit different.
>> >
>> > So... that means most of these mappings are moot.
>>

Re: Index not_analysed for fields used as ids?

Posted by sebb <se...@gmail.com>.
On 7 November 2016 at 15:07, John D. Ament <jo...@apache.org> wrote:
> On Mon, Nov 7, 2016 at 9:54 AM sebb <se...@gmail.com> wrote:
>
>> On 7 November 2016 at 14:36, John D. Ament <jo...@gmail.com> wrote:
>> > On Mon, Nov 7, 2016 at 9:23 AM sebb <se...@gmail.com> wrote:
>> >
>> >> On 7 November 2016 at 01:36, John D. Ament <jo...@apache.org>
>> wrote:
>> >> > On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:
>> >> >
>> >> >> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com>
>> >> wrote:
>> >> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
>> >> >> wrote:
>> >> >> >
>> >> >> >> On 11/06/2016 03:18 PM, sebb wrote:
>> >> >> >> > Fields such as message-id are stored as text strings, but they
>> are
>> >> >> >> > only really intended to be used as ids. They don't contain
>> >> independent
>> >> >> >> > text parts.
>> >> >> >> >
>> >> >> >> > From what I have understood so far from reading the ES docs,
>> such
>> >> >> >> > fields should be tagged as
>> >> >> >> >
>> >> >> >> > "index": "not_analyzed"
>> >> >> >> >
>> >> >> >> > AIUI this reduces the analysis overhead and storage
>> requirements,
>> >> and
>> >> >> >> > also makes it harder to find fields with
>> >> >> >> > This probably applies to other fields in "mbox":
>> >> >> >> >
>> >> >> >> > mid
>> >> >> >> > possibly in-reply-to
>> >> >> >> > also references
>> >> >> >> >
>> >> >> >> > And of course the auto-created fields such as attachments
>> >> >> >> >
>> >> >> >> > Likewise the doc types currently missing from setup.py:
>> >> >> >> >
>> >> >> >> > notifications
>> >> >> >> > account
>> >> >> >> > mailinglists
>> >> >> >> >
>> >> >> >> > These are internal use only so are not intended for searching.
>> >> >> >> >
>> >> >> >> > Or have I got this completely wrong?
>> >> >> >> >
>> >> >> >>
>> >> >> >> message-id is set to not be analyzed, by the setup script (it's in
>> >> the
>> >> >> >> mappings it sends to ES when creating the index). mid and
>> in-reply-to
>> >> >> >> should probably also be not analyzed, although mid is really a
>> copy
>> >> of
>> >> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> >> >> >> list_raw), neither is the raw from address
>> >> >> >>
>> >> >> >
>> >> >> > So I notice the query process is an arbitrary full text query,
>> which
>> >> runs
>> >> >> > against _all.
>> >> >> >
>> >> >>
>> >>
>> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>> >> >>
>> >> >> Huh?
>> >> >>
>> >> >> The query starts:
>> >> >>
>> >> >> local url = config.es_url .. doc .. "/_search?q="..query
>> >> >>
>> >> >> where
>> >> >>
>> >> >> es_url = "http://localhost:9200/ponymail/"
>> >> >>
>> >> >> and
>> >> >>
>> >> >> doc = "mbox" by default.
>> >> >>
>> >> >> Where does the _all come in?
>> >> >>
>> >> >
>> >> > When you do a query string query in elastic search (reference:
>> >> >
>> >>
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
>> >> )
>> >> > the default field unless specified is "_all".  I can't find anything
>> in
>> >> the
>> >> > pony code that changes this field.  As a result, its going to search
>> _all
>> >> > by default.
>> >>
>> >> stats.lua changes the generic query into:
>> >>
>> >> "query_string": {
>> >>   "default_field": "subject",
>> >>   "query": "(from:\"QUERY\") OR (subject:\"QUERY\") OR (body:\"QUERY\")"
>> >> }
>> >>
>> >> Which does not use the _all field AFAICT
>> >>
>> >
>> > Ok, this is what I was looking for ( but couldn't find ).  But to
>> reiterate
>> > my notes from above - this means that the only mappings that matter are
>> > these fields.  Other field mappings don't matter.
>> >
>>
>> Surely all the text fields 'matter' - i.e. need to have a mapping?
>> Otherwise the default is to analyse them.
>>
>>
> Not based on the query in use.  The only three fields being searched are
> "from", "subject" and "body" - so only their mappings matter when doing
> search.

AFAICT they aren't the only fields that are searched by the code.

They are the only ones used by the search function, but internally,
the code also searches the mbox type on at the following at least:

message-id
mid
in-reply-to

It also searches notifications on:
recipient, seen
and mailinglists on:
name

(search *.lua for 'elastic.find')

> One of the concepts behind ES is that your model your index based on the
> queries you want to execute.  There's two points of view on that, only
> store the things that are relevant, or make everything relevant.

In this case, ES is also being used as a general-purpose database
(account, notifications, mailinglists)
These are in the same index, so there is no one set of queries that
applies to all doc types.

So I suspect neither point of view is completely appropriate here.

>
>> It's just a question of whether a field is used for searching, and if
>> so, what type(s) of searches are done.
>>
>> It looks like from/subject/body need to support word matching, so need
>> to be analysed.
>>
>
> We may want to consider things like partial match as well - fuzziness
> ranking, ngrams, etc.

Well yes, but does that affect the mapping choice?

>
>>
>> However message id and many other fields need only support keyword
>> matching.
>> So these only need to be indexed.
>>
>
> Yes and no.  ES 5 introduced the concept of an enum type which may be what
> message-id should be pointing to.

I cannot find a reference to an 'enum' type.
Do you mean 'keyword'? [1]

> Email message IDs include some of the
> stop characters in there "-" which need to be treated specially in queries.

Surely stop characters only apply to fields that are analyzed?
Which is why such fields need to be set up as not_analyzed (or keyword in 5.0)
This allows searching by exact value; no need to use a special query.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/5.0/keyword.html

>
>>
>> >>
>> >> >
>> >> >>
>> >> >> > unless
>> >> >> > I need to dig into it a bit further to see if there's something
>> >> building
>> >> >> up
>> >> >> > query a bit different.
>> >> >> >
>> >> >> > So... that means most of these mappings are moot.
>> >> >>
>> >>
>>

Re: Index not_analysed for fields used as ids?

Posted by "John D. Ament" <jo...@apache.org>.
On Mon, Nov 7, 2016 at 9:54 AM sebb <se...@gmail.com> wrote:

> On 7 November 2016 at 14:36, John D. Ament <jo...@gmail.com> wrote:
> > On Mon, Nov 7, 2016 at 9:23 AM sebb <se...@gmail.com> wrote:
> >
> >> On 7 November 2016 at 01:36, John D. Ament <jo...@apache.org>
> wrote:
> >> > On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:
> >> >
> >> >> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com>
> >> wrote:
> >> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
> >> >> wrote:
> >> >> >
> >> >> >> On 11/06/2016 03:18 PM, sebb wrote:
> >> >> >> > Fields such as message-id are stored as text strings, but they
> are
> >> >> >> > only really intended to be used as ids. They don't contain
> >> independent
> >> >> >> > text parts.
> >> >> >> >
> >> >> >> > From what I have understood so far from reading the ES docs,
> such
> >> >> >> > fields should be tagged as
> >> >> >> >
> >> >> >> > "index": "not_analyzed"
> >> >> >> >
> >> >> >> > AIUI this reduces the analysis overhead and storage
> requirements,
> >> and
> >> >> >> > also makes it harder to find fields with
> >> >> >> > This probably applies to other fields in "mbox":
> >> >> >> >
> >> >> >> > mid
> >> >> >> > possibly in-reply-to
> >> >> >> > also references
> >> >> >> >
> >> >> >> > And of course the auto-created fields such as attachments
> >> >> >> >
> >> >> >> > Likewise the doc types currently missing from setup.py:
> >> >> >> >
> >> >> >> > notifications
> >> >> >> > account
> >> >> >> > mailinglists
> >> >> >> >
> >> >> >> > These are internal use only so are not intended for searching.
> >> >> >> >
> >> >> >> > Or have I got this completely wrong?
> >> >> >> >
> >> >> >>
> >> >> >> message-id is set to not be analyzed, by the setup script (it's in
> >> the
> >> >> >> mappings it sends to ES when creating the index). mid and
> in-reply-to
> >> >> >> should probably also be not analyzed, although mid is really a
> copy
> >> of
> >> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
> >> >> >> list_raw), neither is the raw from address
> >> >> >>
> >> >> >
> >> >> > So I notice the query process is an arbitrary full text query,
> which
> >> runs
> >> >> > against _all.
> >> >> >
> >> >>
> >>
> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
> >> >>
> >> >> Huh?
> >> >>
> >> >> The query starts:
> >> >>
> >> >> local url = config.es_url .. doc .. "/_search?q="..query
> >> >>
> >> >> where
> >> >>
> >> >> es_url = "http://localhost:9200/ponymail/"
> >> >>
> >> >> and
> >> >>
> >> >> doc = "mbox" by default.
> >> >>
> >> >> Where does the _all come in?
> >> >>
> >> >
> >> > When you do a query string query in elastic search (reference:
> >> >
> >>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
> >> )
> >> > the default field unless specified is "_all".  I can't find anything
> in
> >> the
> >> > pony code that changes this field.  As a result, its going to search
> _all
> >> > by default.
> >>
> >> stats.lua changes the generic query into:
> >>
> >> "query_string": {
> >>   "default_field": "subject",
> >>   "query": "(from:\"QUERY\") OR (subject:\"QUERY\") OR (body:\"QUERY\")"
> >> }
> >>
> >> Which does not use the _all field AFAICT
> >>
> >
> > Ok, this is what I was looking for ( but couldn't find ).  But to
> reiterate
> > my notes from above - this means that the only mappings that matter are
> > these fields.  Other field mappings don't matter.
> >
>
> Surely all the text fields 'matter' - i.e. need to have a mapping?
> Otherwise the default is to analyse them.
>
>
Not based on the query in use.  The only three fields being searched are
"from", "subject" and "body" - so only their mappings matter when doing
search.

One of the concepts behind ES is that your model your index based on the
queries you want to execute.  There's two points of view on that, only
store the things that are relevant, or make everything relevant.


> It's just a question of whether a field is used for searching, and if
> so, what type(s) of searches are done.
>
> It looks like from/subject/body need to support word matching, so need
> to be analysed.
>

We may want to consider things like partial match as well - fuzziness
ranking, ngrams, etc.


>
> However message id and many other fields need only support keyword
> matching.
> So these only need to be indexed.
>

Yes and no.  ES 5 introduced the concept of an enum type which may be what
message-id should be pointing to.  Email message IDs include some of the
stop characters in there "-" which need to be treated specially in queries.


>
> >>
> >> >
> >> >>
> >> >> > unless
> >> >> > I need to dig into it a bit further to see if there's something
> >> building
> >> >> up
> >> >> > query a bit different.
> >> >> >
> >> >> > So... that means most of these mappings are moot.
> >> >>
> >>
>

Re: Index not_analysed for fields used as ids?

Posted by sebb <se...@gmail.com>.
On 7 November 2016 at 14:36, John D. Ament <jo...@gmail.com> wrote:
> On Mon, Nov 7, 2016 at 9:23 AM sebb <se...@gmail.com> wrote:
>
>> On 7 November 2016 at 01:36, John D. Ament <jo...@apache.org> wrote:
>> > On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:
>> >
>> >> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com>
>> wrote:
>> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
>> >> wrote:
>> >> >
>> >> >> On 11/06/2016 03:18 PM, sebb wrote:
>> >> >> > Fields such as message-id are stored as text strings, but they are
>> >> >> > only really intended to be used as ids. They don't contain
>> independent
>> >> >> > text parts.
>> >> >> >
>> >> >> > From what I have understood so far from reading the ES docs, such
>> >> >> > fields should be tagged as
>> >> >> >
>> >> >> > "index": "not_analyzed"
>> >> >> >
>> >> >> > AIUI this reduces the analysis overhead and storage requirements,
>> and
>> >> >> > also makes it harder to find fields with
>> >> >> > This probably applies to other fields in "mbox":
>> >> >> >
>> >> >> > mid
>> >> >> > possibly in-reply-to
>> >> >> > also references
>> >> >> >
>> >> >> > And of course the auto-created fields such as attachments
>> >> >> >
>> >> >> > Likewise the doc types currently missing from setup.py:
>> >> >> >
>> >> >> > notifications
>> >> >> > account
>> >> >> > mailinglists
>> >> >> >
>> >> >> > These are internal use only so are not intended for searching.
>> >> >> >
>> >> >> > Or have I got this completely wrong?
>> >> >> >
>> >> >>
>> >> >> message-id is set to not be analyzed, by the setup script (it's in
>> the
>> >> >> mappings it sends to ES when creating the index). mid and in-reply-to
>> >> >> should probably also be not analyzed, although mid is really a copy
>> of
>> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> >> >> list_raw), neither is the raw from address
>> >> >>
>> >> >
>> >> > So I notice the query process is an arbitrary full text query, which
>> runs
>> >> > against _all.
>> >> >
>> >>
>> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>> >>
>> >> Huh?
>> >>
>> >> The query starts:
>> >>
>> >> local url = config.es_url .. doc .. "/_search?q="..query
>> >>
>> >> where
>> >>
>> >> es_url = "http://localhost:9200/ponymail/"
>> >>
>> >> and
>> >>
>> >> doc = "mbox" by default.
>> >>
>> >> Where does the _all come in?
>> >>
>> >
>> > When you do a query string query in elastic search (reference:
>> >
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
>> )
>> > the default field unless specified is "_all".  I can't find anything in
>> the
>> > pony code that changes this field.  As a result, its going to search _all
>> > by default.
>>
>> stats.lua changes the generic query into:
>>
>> "query_string": {
>>   "default_field": "subject",
>>   "query": "(from:\"QUERY\") OR (subject:\"QUERY\") OR (body:\"QUERY\")"
>> }
>>
>> Which does not use the _all field AFAICT
>>
>
> Ok, this is what I was looking for ( but couldn't find ).  But to reiterate
> my notes from above - this means that the only mappings that matter are
> these fields.  Other field mappings don't matter.
>

Surely all the text fields 'matter' - i.e. need to have a mapping?
Otherwise the default is to analyse them.

It's just a question of whether a field is used for searching, and if
so, what type(s) of searches are done.

It looks like from/subject/body need to support word matching, so need
to be analysed.

However message id and many other fields need only support keyword matching.
So these only need to be indexed.

>>
>> >
>> >>
>> >> > unless
>> >> > I need to dig into it a bit further to see if there's something
>> building
>> >> up
>> >> > query a bit different.
>> >> >
>> >> > So... that means most of these mappings are moot.
>> >>
>>

Re: Index not_analysed for fields used as ids?

Posted by "John D. Ament" <jo...@gmail.com>.
On Mon, Nov 7, 2016 at 9:23 AM sebb <se...@gmail.com> wrote:

> On 7 November 2016 at 01:36, John D. Ament <jo...@apache.org> wrote:
> > On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:
> >
> >> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com>
> wrote:
> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
> >> wrote:
> >> >
> >> >> On 11/06/2016 03:18 PM, sebb wrote:
> >> >> > Fields such as message-id are stored as text strings, but they are
> >> >> > only really intended to be used as ids. They don't contain
> independent
> >> >> > text parts.
> >> >> >
> >> >> > From what I have understood so far from reading the ES docs, such
> >> >> > fields should be tagged as
> >> >> >
> >> >> > "index": "not_analyzed"
> >> >> >
> >> >> > AIUI this reduces the analysis overhead and storage requirements,
> and
> >> >> > also makes it harder to find fields with
> >> >> > This probably applies to other fields in "mbox":
> >> >> >
> >> >> > mid
> >> >> > possibly in-reply-to
> >> >> > also references
> >> >> >
> >> >> > And of course the auto-created fields such as attachments
> >> >> >
> >> >> > Likewise the doc types currently missing from setup.py:
> >> >> >
> >> >> > notifications
> >> >> > account
> >> >> > mailinglists
> >> >> >
> >> >> > These are internal use only so are not intended for searching.
> >> >> >
> >> >> > Or have I got this completely wrong?
> >> >> >
> >> >>
> >> >> message-id is set to not be analyzed, by the setup script (it's in
> the
> >> >> mappings it sends to ES when creating the index). mid and in-reply-to
> >> >> should probably also be not analyzed, although mid is really a copy
> of
> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
> >> >> list_raw), neither is the raw from address
> >> >>
> >> >
> >> > So I notice the query process is an arbitrary full text query, which
> runs
> >> > against _all.
> >> >
> >>
> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
> >>
> >> Huh?
> >>
> >> The query starts:
> >>
> >> local url = config.es_url .. doc .. "/_search?q="..query
> >>
> >> where
> >>
> >> es_url = "http://localhost:9200/ponymail/"
> >>
> >> and
> >>
> >> doc = "mbox" by default.
> >>
> >> Where does the _all come in?
> >>
> >
> > When you do a query string query in elastic search (reference:
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
> )
> > the default field unless specified is "_all".  I can't find anything in
> the
> > pony code that changes this field.  As a result, its going to search _all
> > by default.
>
> stats.lua changes the generic query into:
>
> "query_string": {
>   "default_field": "subject",
>   "query": "(from:\"QUERY\") OR (subject:\"QUERY\") OR (body:\"QUERY\")"
> }
>
> Which does not use the _all field AFAICT
>

Ok, this is what I was looking for ( but couldn't find ).  But to reiterate
my notes from above - this means that the only mappings that matter are
these fields.  Other field mappings don't matter.


>
> >
> >>
> >> > unless
> >> > I need to dig into it a bit further to see if there's something
> building
> >> up
> >> > query a bit different.
> >> >
> >> > So... that means most of these mappings are moot.
> >>
>

Re: Index not_analysed for fields used as ids?

Posted by sebb <se...@gmail.com>.
On 7 November 2016 at 01:36, John D. Ament <jo...@apache.org> wrote:
> On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:
>
>> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com> wrote:
>> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
>> wrote:
>> >
>> >> On 11/06/2016 03:18 PM, sebb wrote:
>> >> > Fields such as message-id are stored as text strings, but they are
>> >> > only really intended to be used as ids. They don't contain independent
>> >> > text parts.
>> >> >
>> >> > From what I have understood so far from reading the ES docs, such
>> >> > fields should be tagged as
>> >> >
>> >> > "index": "not_analyzed"
>> >> >
>> >> > AIUI this reduces the analysis overhead and storage requirements, and
>> >> > also makes it harder to find fields with
>> >> > This probably applies to other fields in "mbox":
>> >> >
>> >> > mid
>> >> > possibly in-reply-to
>> >> > also references
>> >> >
>> >> > And of course the auto-created fields such as attachments
>> >> >
>> >> > Likewise the doc types currently missing from setup.py:
>> >> >
>> >> > notifications
>> >> > account
>> >> > mailinglists
>> >> >
>> >> > These are internal use only so are not intended for searching.
>> >> >
>> >> > Or have I got this completely wrong?
>> >> >
>> >>
>> >> message-id is set to not be analyzed, by the setup script (it's in the
>> >> mappings it sends to ES when creating the index). mid and in-reply-to
>> >> should probably also be not analyzed, although mid is really a copy of
>> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> >> list_raw), neither is the raw from address
>> >>
>> >
>> > So I notice the query process is an arbitrary full text query, which runs
>> > against _all.
>> >
>> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>>
>> Huh?
>>
>> The query starts:
>>
>> local url = config.es_url .. doc .. "/_search?q="..query
>>
>> where
>>
>> es_url = "http://localhost:9200/ponymail/"
>>
>> and
>>
>> doc = "mbox" by default.
>>
>> Where does the _all come in?
>>
>
> When you do a query string query in elastic search (reference:
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html)
> the default field unless specified is "_all".  I can't find anything in the
> pony code that changes this field.  As a result, its going to search _all
> by default.

stats.lua changes the generic query into:

"query_string": {
  "default_field": "subject",
  "query": "(from:\"QUERY\") OR (subject:\"QUERY\") OR (body:\"QUERY\")"
}

Which does not use the _all field AFAICT

>
>>
>> > unless
>> > I need to dig into it a bit further to see if there's something building
>> up
>> > query a bit different.
>> >
>> > So... that means most of these mappings are moot.
>>

Re: Index not_analysed for fields used as ids?

Posted by "John D. Ament" <jo...@apache.org>.
On Sun, Nov 6, 2016 at 8:22 PM sebb <se...@gmail.com> wrote:

> On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com> wrote:
> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org>
> wrote:
> >
> >> On 11/06/2016 03:18 PM, sebb wrote:
> >> > Fields such as message-id are stored as text strings, but they are
> >> > only really intended to be used as ids. They don't contain independent
> >> > text parts.
> >> >
> >> > From what I have understood so far from reading the ES docs, such
> >> > fields should be tagged as
> >> >
> >> > "index": "not_analyzed"
> >> >
> >> > AIUI this reduces the analysis overhead and storage requirements, and
> >> > also makes it harder to find fields with
> >> > This probably applies to other fields in "mbox":
> >> >
> >> > mid
> >> > possibly in-reply-to
> >> > also references
> >> >
> >> > And of course the auto-created fields such as attachments
> >> >
> >> > Likewise the doc types currently missing from setup.py:
> >> >
> >> > notifications
> >> > account
> >> > mailinglists
> >> >
> >> > These are internal use only so are not intended for searching.
> >> >
> >> > Or have I got this completely wrong?
> >> >
> >>
> >> message-id is set to not be analyzed, by the setup script (it's in the
> >> mappings it sends to ES when creating the index). mid and in-reply-to
> >> should probably also be not analyzed, although mid is really a copy of
> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
> >> list_raw), neither is the raw from address
> >>
> >
> > So I notice the query process is an arbitrary full text query, which runs
> > against _all.
> >
> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>
> Huh?
>
> The query starts:
>
> local url = config.es_url .. doc .. "/_search?q="..query
>
> where
>
> es_url = "http://localhost:9200/ponymail/"
>
> and
>
> doc = "mbox" by default.
>
> Where does the _all come in?
>

When you do a query string query in elastic search (reference:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html)
the default field unless specified is "_all".  I can't find anything in the
pony code that changes this field.  As a result, its going to search _all
by default.


>
> > unless
> > I need to dig into it a bit further to see if there's something building
> up
> > query a bit different.
> >
> > So... that means most of these mappings are moot.
>

Re: Index not_analysed for fields used as ids?

Posted by sebb <se...@gmail.com>.
On 6 November 2016 at 14:37, John D. Ament <jo...@gmail.com> wrote:
> On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org> wrote:
>
>> On 11/06/2016 03:18 PM, sebb wrote:
>> > Fields such as message-id are stored as text strings, but they are
>> > only really intended to be used as ids. They don't contain independent
>> > text parts.
>> >
>> > From what I have understood so far from reading the ES docs, such
>> > fields should be tagged as
>> >
>> > "index": "not_analyzed"
>> >
>> > AIUI this reduces the analysis overhead and storage requirements, and
>> > also makes it harder to find fields with
>> > This probably applies to other fields in "mbox":
>> >
>> > mid
>> > possibly in-reply-to
>> > also references
>> >
>> > And of course the auto-created fields such as attachments
>> >
>> > Likewise the doc types currently missing from setup.py:
>> >
>> > notifications
>> > account
>> > mailinglists
>> >
>> > These are internal use only so are not intended for searching.
>> >
>> > Or have I got this completely wrong?
>> >
>>
>> message-id is set to not be analyzed, by the setup script (it's in the
>> mappings it sends to ES when creating the index). mid and in-reply-to
>> should probably also be not analyzed, although mid is really a copy of
>> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> list_raw), neither is the raw from address
>>
>
> So I notice the query process is an arbitrary full text query, which runs
> against _all.
> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44

Huh?

The query starts:

local url = config.es_url .. doc .. "/_search?q="..query

where

es_url = "http://localhost:9200/ponymail/"

and

doc = "mbox" by default.

Where does the _all come in?

> unless
> I need to dig into it a bit further to see if there's something building up
> query a bit different.
>
> So... that means most of these mappings are moot.

Re: Index not_analysed for fields used as ids?

Posted by "John D. Ament" <jo...@gmail.com>.
On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <hu...@apache.org> wrote:

> On 11/06/2016 03:18 PM, sebb wrote:
> > Fields such as message-id are stored as text strings, but they are
> > only really intended to be used as ids. They don't contain independent
> > text parts.
> >
> > From what I have understood so far from reading the ES docs, such
> > fields should be tagged as
> >
> > "index": "not_analyzed"
> >
> > AIUI this reduces the analysis overhead and storage requirements, and
> > also makes it harder to find fields with
> > This probably applies to other fields in "mbox":
> >
> > mid
> > possibly in-reply-to
> > also references
> >
> > And of course the auto-created fields such as attachments
> >
> > Likewise the doc types currently missing from setup.py:
> >
> > notifications
> > account
> > mailinglists
> >
> > These are internal use only so are not intended for searching.
> >
> > Or have I got this completely wrong?
> >
>
> message-id is set to not be analyzed, by the setup script (it's in the
> mappings it sends to ES when creating the index). mid and in-reply-to
> should probably also be not analyzed, although mid is really a copy of
> the doc ID, IIRC. the list ID is also not analyzed by default (as
> list_raw), neither is the raw from address
>

So I notice the query process is an arbitrary full text query, which runs
against _all.
https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
unless
I need to dig into it a bit further to see if there's something building up
query a bit different.

So... that means most of these mappings are moot.

Re: Index not_analysed for fields used as ids?

Posted by sebb <se...@gmail.com>.
On 6 November 2016 at 14:26, Daniel Gruno <hu...@apache.org> wrote:
> On 11/06/2016 03:18 PM, sebb wrote:
>> Fields such as message-id are stored as text strings, but they are
>> only really intended to be used as ids. They don't contain independent
>> text parts.
>>
>> From what I have understood so far from reading the ES docs, such
>> fields should be tagged as
>>
>> "index": "not_analyzed"
>>
>> AIUI this reduces the analysis overhead and storage requirements, and
>> also makes it harder to find fields with
>> This probably applies to other fields in "mbox":
>>
>> mid
>> possibly in-reply-to
>> also references
>>
>> And of course the auto-created fields such as attachments
>>
>> Likewise the doc types currently missing from setup.py:
>>
>> notifications
>> account
>> mailinglists
>>
>> These are internal use only so are not intended for searching.
>>
>> Or have I got this completely wrong?
>>
>
> message-id is set to not be analyzed, by the setup script (it's in the
> mappings it sends to ES when creating the index).

Yes, I know, that was why I mentioned it, but my email was not at all clear.

> mid and in-reply-to
> should probably also be not analyzed, although mid is really a copy of
> the doc ID, IIRC.

> the list ID is also not analyzed by default (as
> list_raw), neither is the raw from address

Yes, I noticed those raw fields.
However I'm not sure why one would want to analyse the LID, so why is
there a list field as well as list_raw?

Since 'from' may contain free text as well as the email address it
makes sense to analyse it; I'm not sure why one needs from_raw as
well, unless one needs to match against the whole field.

Re: Index not_analysed for fields used as ids?

Posted by Daniel Gruno <hu...@apache.org>.
On 11/06/2016 03:18 PM, sebb wrote:
> Fields such as message-id are stored as text strings, but they are
> only really intended to be used as ids. They don't contain independent
> text parts.
> 
> From what I have understood so far from reading the ES docs, such
> fields should be tagged as
> 
> "index": "not_analyzed"
> 
> AIUI this reduces the analysis overhead and storage requirements, and
> also makes it harder to find fields with
> This probably applies to other fields in "mbox":
> 
> mid
> possibly in-reply-to
> also references
> 
> And of course the auto-created fields such as attachments
> 
> Likewise the doc types currently missing from setup.py:
> 
> notifications
> account
> mailinglists
> 
> These are internal use only so are not intended for searching.
> 
> Or have I got this completely wrong?
> 

message-id is set to not be analyzed, by the setup script (it's in the
mappings it sends to ES when creating the index). mid and in-reply-to
should probably also be not analyzed, although mid is really a copy of
the doc ID, IIRC. the list ID is also not analyzed by default (as
list_raw), neither is the raw from address