You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ponymail.apache.org by sebb <se...@gmail.com> on 2021/11/14 11:19:22 UTC

Re: [incubator-ponymail-foal] 01/02: Add a separate header for short bodies for stats.py

-1

I think this commit should be reverted.

I see no purpose for the body_short header.

It wastes space in the database.
It increases the data sent from database to server code.

What if you want to change the length?
Are you going to update the entire database each time the length is changed?

Seems to me the sensible way to do this is in the status.lua plugin.
This can then pick up a config item for the length.

Sebb

On Sun, 17 Oct 2021 at 21:32, sebb <se...@gmail.com> wrote:
>
> On Sun, 17 Oct 2021 at 16:43, <hu...@apache.org> wrote:
>>
>> This is an automated email from the ASF dual-hosted git repository.
>>
>> humbedooh pushed a commit to branch master
>> in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
>>
>> commit 2dff9351d119ddee5c5e0171991c54b1911f05b1
>> Author: Daniel Gruno <hu...@apache.org>
>> AuthorDate: Sun Oct 17 17:41:51 2021 +0200
>>
>>     Add a separate header for short bodies for stats.py
>> ---
>>  tools/archiver.py   | 5 ++++-
>>  tools/mappings.yaml | 2 ++
>>  2 files changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/archiver.py b/tools/archiver.py
>> index 1783d0e..dc12e39 100755
>> --- a/tools/archiver.py
>> +++ b/tools/archiver.py
>> @@ -584,6 +584,8 @@ class Archiver(object):  # N.B. Also used by import-mbox.py
>>              ghash = hashlib.md5(mailaddr.encode("utf-8")).hexdigest()
>>
>>              notes.append(["ARCHIVE: Email archived as %s at %u" % (document_id, time.time())])
>> +            body_unflowed = body.unflow() if body else ""
>> +            body_shortened = body_unflowed[:210]  # 210 so that we can tell if > 200.
>>
>
> -1
>
> What's so special about 200 and 210?
>
> These numbers should be constants (with suitable docn) or possibly configuration items.
>
> The only bare numbers I would expect to see in code are 0 and 1 (or -1).
>
>>              output_json = {
>>                  "from_raw": msg_metadata["from"],
>> @@ -603,7 +605,8 @@ class Archiver(object):  # N.B. Also used by import-mbox.py
>>                  "private": private,
>>                  "references": msg_metadata["references"],
>>                  "in-reply-to": irt,
>> -                "body": body.unflow() if body else "",
>> +                "body": body_unflowed,
>> +                "body_short": body_shortened,
>>                  "html_source_only": body and body.html_as_source or False,
>>                  "attachments": attachments,
>>                  "forum": (lid or "").strip("<>").replace(".", "@", 1),
>> diff --git a/tools/mappings.yaml b/tools/mappings.yaml
>> index 4bb4978..6ad72d3 100644
>> --- a/tools/mappings.yaml
>> +++ b/tools/mappings.yaml
>> @@ -55,6 +55,8 @@ mbox:
>>            type: long
>>      body:
>>        type: text
>> +    body_short:
>> +      type: text
>>      cc:
>>        type: text
>>      date:

Re: [incubator-ponymail-foal] 01/02: Add a separate header for short bodies for stats.py

Posted by sebb <se...@gmail.com>.
On Sun, 14 Nov 2021 at 17:32, sebb <se...@gmail.com> wrote:
>
> On Sun, 14 Nov 2021 at 16:22, Daniel Gruno <hu...@apache.org> wrote:
> >
> > On 14/11/2021 12.19, sebb wrote:
> > > -1
> > >
> > > I think this commit should be reverted.
> > >
> > > I see no purpose for the body_short header.
> >
> > It is absolutely needed. It is what makes it possible to search large
> > lists with very large emails and not have everything crash around you.
>
> I see now.
>
> I was confused by the use of doc['body'] in messages.query and
> overlooked the fact that it does not download the body attribute.
>
> However, I'm not sure that BODY_MAXLEN serves any purpose.
>
> body_short could include the '...' marker.
>
> > When you only need the first bit of the text, there is no need to grab
> > several megabytes of email body, especially not if you have thousands of
> > emails you need to parse. In a production environment I have access to,
> > this caused a significant speedup and went from grabbing 2GB of data per
> > search to only 50MB - search time and overall backend response time was
> > an order of magnitude faster after this had been implemented. It is a
> > very vital component of a responsive service.
> >
> > The only other way would be to use scripting inside the ES query, which
> > is both dangerous and requires additional setup.
> >

On further reflection, I don't think the size of the body field is the
fundamental problem.

As I see it, the main problem is the number of emails that are
returned in a single response.

As the design is currently, even 200 bytes could become a problem if
there are enough emails in the search result.
And even if the server can cope, it can send a lot of data to the
client, which might be a low-memory phone.

At present the design just does not scale well, which is why there has
had to be this work-round.

I think the design needs to be reworked to use a batched response.
It would then not matter if the entire body were returned to
stats.lua, and that could shorten the data for display.

> > As for the length, 200 is the standard used throughout the interface.
> > Yes, you could change it, and yes you might need to reindex if so, but
> > who is realistically going to change it?
> >
>
> > >
> > > It wastes space in the database.
> > > It increases the data sent from database to server code.
> > >
> > > What if you want to change the length?
> > > Are you going to update the entire database each time the length is changed?
> > >
> > > Seems to me the sensible way to do this is in the status.lua plugin.
> > > This can then pick up a config item for the length.
> > >
> > > Sebb
> > >
> > > On Sun, 17 Oct 2021 at 21:32, sebb <se...@gmail.com> wrote:
> > >>
> > >> On Sun, 17 Oct 2021 at 16:43, <hu...@apache.org> wrote:
> > >>>
> > >>> This is an automated email from the ASF dual-hosted git repository.
> > >>>
> > >>> humbedooh pushed a commit to branch master
> > >>> in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
> > >>>
> > >>> commit 2dff9351d119ddee5c5e0171991c54b1911f05b1
> > >>> Author: Daniel Gruno <hu...@apache.org>
> > >>> AuthorDate: Sun Oct 17 17:41:51 2021 +0200
> > >>>
> > >>>      Add a separate header for short bodies for stats.py
> > >>> ---
> > >>>   tools/archiver.py   | 5 ++++-
> > >>>   tools/mappings.yaml | 2 ++
> > >>>   2 files changed, 6 insertions(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/tools/archiver.py b/tools/archiver.py
> > >>> index 1783d0e..dc12e39 100755
> > >>> --- a/tools/archiver.py
> > >>> +++ b/tools/archiver.py
> > >>> @@ -584,6 +584,8 @@ class Archiver(object):  # N.B. Also used by import-mbox.py
> > >>>               ghash = hashlib.md5(mailaddr.encode("utf-8")).hexdigest()
> > >>>
> > >>>               notes.append(["ARCHIVE: Email archived as %s at %u" % (document_id, time.time())])
> > >>> +            body_unflowed = body.unflow() if body else ""
> > >>> +            body_shortened = body_unflowed[:210]  # 210 so that we can tell if > 200.
> > >>>
> > >>
> > >> -1
> > >>
> > >> What's so special about 200 and 210?
> > >>
> > >> These numbers should be constants (with suitable docn) or possibly configuration items.
> > >>
> > >> The only bare numbers I would expect to see in code are 0 and 1 (or -1).
> > >>
> > >>>               output_json = {
> > >>>                   "from_raw": msg_metadata["from"],
> > >>> @@ -603,7 +605,8 @@ class Archiver(object):  # N.B. Also used by import-mbox.py
> > >>>                   "private": private,
> > >>>                   "references": msg_metadata["references"],
> > >>>                   "in-reply-to": irt,
> > >>> -                "body": body.unflow() if body else "",
> > >>> +                "body": body_unflowed,
> > >>> +                "body_short": body_shortened,
> > >>>                   "html_source_only": body and body.html_as_source or False,
> > >>>                   "attachments": attachments,
> > >>>                   "forum": (lid or "").strip("<>").replace(".", "@", 1),
> > >>> diff --git a/tools/mappings.yaml b/tools/mappings.yaml
> > >>> index 4bb4978..6ad72d3 100644
> > >>> --- a/tools/mappings.yaml
> > >>> +++ b/tools/mappings.yaml
> > >>> @@ -55,6 +55,8 @@ mbox:
> > >>>             type: long
> > >>>       body:
> > >>>         type: text
> > >>> +    body_short:
> > >>> +      type: text
> > >>>       cc:
> > >>>         type: text
> > >>>       date:
> >

Re: [incubator-ponymail-foal] 01/02: Add a separate header for short bodies for stats.py

Posted by sebb <se...@gmail.com>.
On Sun, 14 Nov 2021 at 16:22, Daniel Gruno <hu...@apache.org> wrote:
>
> On 14/11/2021 12.19, sebb wrote:
> > -1
> >
> > I think this commit should be reverted.
> >
> > I see no purpose for the body_short header.
>
> It is absolutely needed. It is what makes it possible to search large
> lists with very large emails and not have everything crash around you.

I see now.

I was confused by the use of doc['body'] in messages.query and
overlooked the fact that it does not download the body attribute.

However, I'm not sure that BODY_MAXLEN serves any purpose.

body_short could include the '...' marker.

> When you only need the first bit of the text, there is no need to grab
> several megabytes of email body, especially not if you have thousands of
> emails you need to parse. In a production environment I have access to,
> this caused a significant speedup and went from grabbing 2GB of data per
> search to only 50MB - search time and overall backend response time was
> an order of magnitude faster after this had been implemented. It is a
> very vital component of a responsive service.
>
> The only other way would be to use scripting inside the ES query, which
> is both dangerous and requires additional setup.
>
> As for the length, 200 is the standard used throughout the interface.
> Yes, you could change it, and yes you might need to reindex if so, but
> who is realistically going to change it?
>

> >
> > It wastes space in the database.
> > It increases the data sent from database to server code.
> >
> > What if you want to change the length?
> > Are you going to update the entire database each time the length is changed?
> >
> > Seems to me the sensible way to do this is in the status.lua plugin.
> > This can then pick up a config item for the length.
> >
> > Sebb
> >
> > On Sun, 17 Oct 2021 at 21:32, sebb <se...@gmail.com> wrote:
> >>
> >> On Sun, 17 Oct 2021 at 16:43, <hu...@apache.org> wrote:
> >>>
> >>> This is an automated email from the ASF dual-hosted git repository.
> >>>
> >>> humbedooh pushed a commit to branch master
> >>> in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
> >>>
> >>> commit 2dff9351d119ddee5c5e0171991c54b1911f05b1
> >>> Author: Daniel Gruno <hu...@apache.org>
> >>> AuthorDate: Sun Oct 17 17:41:51 2021 +0200
> >>>
> >>>      Add a separate header for short bodies for stats.py
> >>> ---
> >>>   tools/archiver.py   | 5 ++++-
> >>>   tools/mappings.yaml | 2 ++
> >>>   2 files changed, 6 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/tools/archiver.py b/tools/archiver.py
> >>> index 1783d0e..dc12e39 100755
> >>> --- a/tools/archiver.py
> >>> +++ b/tools/archiver.py
> >>> @@ -584,6 +584,8 @@ class Archiver(object):  # N.B. Also used by import-mbox.py
> >>>               ghash = hashlib.md5(mailaddr.encode("utf-8")).hexdigest()
> >>>
> >>>               notes.append(["ARCHIVE: Email archived as %s at %u" % (document_id, time.time())])
> >>> +            body_unflowed = body.unflow() if body else ""
> >>> +            body_shortened = body_unflowed[:210]  # 210 so that we can tell if > 200.
> >>>
> >>
> >> -1
> >>
> >> What's so special about 200 and 210?
> >>
> >> These numbers should be constants (with suitable docn) or possibly configuration items.
> >>
> >> The only bare numbers I would expect to see in code are 0 and 1 (or -1).
> >>
> >>>               output_json = {
> >>>                   "from_raw": msg_metadata["from"],
> >>> @@ -603,7 +605,8 @@ class Archiver(object):  # N.B. Also used by import-mbox.py
> >>>                   "private": private,
> >>>                   "references": msg_metadata["references"],
> >>>                   "in-reply-to": irt,
> >>> -                "body": body.unflow() if body else "",
> >>> +                "body": body_unflowed,
> >>> +                "body_short": body_shortened,
> >>>                   "html_source_only": body and body.html_as_source or False,
> >>>                   "attachments": attachments,
> >>>                   "forum": (lid or "").strip("<>").replace(".", "@", 1),
> >>> diff --git a/tools/mappings.yaml b/tools/mappings.yaml
> >>> index 4bb4978..6ad72d3 100644
> >>> --- a/tools/mappings.yaml
> >>> +++ b/tools/mappings.yaml
> >>> @@ -55,6 +55,8 @@ mbox:
> >>>             type: long
> >>>       body:
> >>>         type: text
> >>> +    body_short:
> >>> +      type: text
> >>>       cc:
> >>>         type: text
> >>>       date:
>

Re: [incubator-ponymail-foal] 01/02: Add a separate header for short bodies for stats.py

Posted by Daniel Gruno <hu...@apache.org>.
On 14/11/2021 12.19, sebb wrote:
> -1
> 
> I think this commit should be reverted.
> 
> I see no purpose for the body_short header.

It is absolutely needed. It is what makes it possible to search large 
lists with very large emails and not have everything crash around you.

When you only need the first bit of the text, there is no need to grab 
several megabytes of email body, especially not if you have thousands of 
emails you need to parse. In a production environment I have access to, 
this caused a significant speedup and went from grabbing 2GB of data per 
search to only 50MB - search time and overall backend response time was 
an order of magnitude faster after this had been implemented. It is a 
very vital component of a responsive service.

The only other way would be to use scripting inside the ES query, which 
is both dangerous and requires additional setup.

As for the length, 200 is the standard used throughout the interface. 
Yes, you could change it, and yes you might need to reindex if so, but 
who is realistically going to change it?

> 
> It wastes space in the database.
> It increases the data sent from database to server code.
> 
> What if you want to change the length?
> Are you going to update the entire database each time the length is changed?
> 
> Seems to me the sensible way to do this is in the status.lua plugin.
> This can then pick up a config item for the length.
> 
> Sebb
> 
> On Sun, 17 Oct 2021 at 21:32, sebb <se...@gmail.com> wrote:
>>
>> On Sun, 17 Oct 2021 at 16:43, <hu...@apache.org> wrote:
>>>
>>> This is an automated email from the ASF dual-hosted git repository.
>>>
>>> humbedooh pushed a commit to branch master
>>> in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
>>>
>>> commit 2dff9351d119ddee5c5e0171991c54b1911f05b1
>>> Author: Daniel Gruno <hu...@apache.org>
>>> AuthorDate: Sun Oct 17 17:41:51 2021 +0200
>>>
>>>      Add a separate header for short bodies for stats.py
>>> ---
>>>   tools/archiver.py   | 5 ++++-
>>>   tools/mappings.yaml | 2 ++
>>>   2 files changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/tools/archiver.py b/tools/archiver.py
>>> index 1783d0e..dc12e39 100755
>>> --- a/tools/archiver.py
>>> +++ b/tools/archiver.py
>>> @@ -584,6 +584,8 @@ class Archiver(object):  # N.B. Also used by import-mbox.py
>>>               ghash = hashlib.md5(mailaddr.encode("utf-8")).hexdigest()
>>>
>>>               notes.append(["ARCHIVE: Email archived as %s at %u" % (document_id, time.time())])
>>> +            body_unflowed = body.unflow() if body else ""
>>> +            body_shortened = body_unflowed[:210]  # 210 so that we can tell if > 200.
>>>
>>
>> -1
>>
>> What's so special about 200 and 210?
>>
>> These numbers should be constants (with suitable docn) or possibly configuration items.
>>
>> The only bare numbers I would expect to see in code are 0 and 1 (or -1).
>>
>>>               output_json = {
>>>                   "from_raw": msg_metadata["from"],
>>> @@ -603,7 +605,8 @@ class Archiver(object):  # N.B. Also used by import-mbox.py
>>>                   "private": private,
>>>                   "references": msg_metadata["references"],
>>>                   "in-reply-to": irt,
>>> -                "body": body.unflow() if body else "",
>>> +                "body": body_unflowed,
>>> +                "body_short": body_shortened,
>>>                   "html_source_only": body and body.html_as_source or False,
>>>                   "attachments": attachments,
>>>                   "forum": (lid or "").strip("<>").replace(".", "@", 1),
>>> diff --git a/tools/mappings.yaml b/tools/mappings.yaml
>>> index 4bb4978..6ad72d3 100644
>>> --- a/tools/mappings.yaml
>>> +++ b/tools/mappings.yaml
>>> @@ -55,6 +55,8 @@ mbox:
>>>             type: long
>>>       body:
>>>         type: text
>>> +    body_short:
>>> +      type: text
>>>       cc:
>>>         type: text
>>>       date: