You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ponymail.apache.org by GitBox <gi...@apache.org> on 2021/05/05 12:26:43 UTC

[GitHub] [incubator-ponymail-foal] sbp opened a new issue #17: Enumerating all mailing lists is intensive in both CPU and network

sbp opened a new issue #17:
URL: https://github.com/apache/incubator-ponymail-foal/issues/17


   In addition to the issue with list discovery identified in PR #16, there is an additional issue that the use of `size=0` later on in the query code to avoid `sum_other_doc_count` being greater than zero is strongly recommended against in the Elasticsearch documentation:
   
   > It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don’t use this on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network.
   
   This means that the query will likely be very expensive on databases containing hundreds of thousands of messages, and `background.py` is running it once every couple of minutes or so. But it is necessary to use `size=0` in order to accurately enumerate all mailing lists.
   
   The underlying issue here is that Elasticsearch is not designed for accurate queries of this nature over extremely large datasets. It may therefore be necessary to add an extra index for mailing lists, which would be updated whenever `archiver.py` receives another message.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-ponymail-foal] Humbedooh commented on issue #17: Enumerating all mailing lists is intensive in both CPU and network

Posted by GitBox <gi...@apache.org>.

Humbedooh commented on issue #17:
URL: https://github.com/apache/incubator-ponymail-foal/issues/17#issuecomment-832697449


   I do like the idea of keeping a "set" of found lists. This would also need to store whether private messages are present on said list.
   Having said that, I don't believe the query is that expensive - I say this from experience with 30 million emails in a database - it should take a few seconds on decent hardware. 
   
   If archiver.py gets a new index to play with, this should also be set up in migrate.py and setup.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org