You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ponymail.apache.org by GitBox <gi...@apache.org> on 2021/06/09 10:03:28 UTC

[GitHub] [incubator-ponymail-foal] sbp opened a new pull request #43: Add thread info capabilities to the archiver

sbp opened a new pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43


   This PR enables extra information about threads to be stored in elastic at archival time, rather than at UI time. In other words, instead of getting information about threads dynamically when people view a Foal powered website, information about threads is added to a message as soon as it is archived.
   
   This mode is optional, and is enabled by setting `archiver.threadinfo` to `true` in the configuration file.
   
   This PR includes a script called `rethread.py` which can add the information to an existing database. This means that an email archive which does not currently have archival time thread properties can be updated as a whole before changing how the archiver works.
   
   One drawback to this approach is that if messages are received out of order then this may confuse the archiver. The `rethread.py` script can be used to fix such problems, but it does scan and modify the entire database to do so. It may be possible to have the script more selectively and carefully fix problems, but there is probably no getting around doing a full rescan.
   
   The properties added to archived documents are `forum`, `previous`, `thread`, and `top`. The `forum` property is the email address of the mailing list, included as a handy alternative to `list_raw`. The `previous` property points to the mbox ID of either the parent message in the thread or, if the message is already the top of the thread, the top of the previous thread. The `thread` property points to the top of the current thread, which will be the same mbox ID as the current message if the current message is itself the top of the current thread. And the `top` property is a boolean indicating whether or not a message is the top of the current thread.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] Humbedooh commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-857857480


   > Would it not be sufficient to have a mapping between Message-Id and database-id?
   > This would not require scanning the entire database (except to index existing emails), and should eliminate most of the subsequent searching. It should be much cheaper than a couple of searches.
   
   Where would/should that mapping be stored?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] Humbedooh merged pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
Humbedooh merged pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@ponymail.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] sbp commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-860530520


   The code is already being used. This is an upstream contribution.
   
   The reason for the implementation is that it made many inefficient query patterns efficient. If the PR is updated to use a separate archiver module and a separate table, does that not provide enough separation to assuage concerns about additional useful processing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] Humbedooh commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-857622093


   I am assuming that "previous" is either the "original" starting point of a thread, or an estimate of the thread that chronologically speaking preceded this particular email, such that if you click on whatever previous button we'd set up, you'd go further back into either this discussion - if it's a reply in a thread - or whatever discussion preceded this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] sbp commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-860531625


   To be clear, the separate archiver module would not be imported by the present module. It would be an entirely separate module that would have its own hook entry in the Mailman3 configuration. It could have its own stdin mode for non-Mailman3 configurations. And it would still be optional.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] sebbASF commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-860563018


   I am only concerned here about the affect on the archiver process.
   If it is run as a separate process that should be fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] sebbASF commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-857853965


   As to the suggested code, if it is to be implemented, it needs more documentation and some unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] sebbASF commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-857852973


   I'm concerned that the code adds a couple of searches to the archiver, which is the most critical part of the system.
   
   Would it not be sufficient to have a mapping between Message-Id and database-id?
   This would not require scanning the entire database (except to index existing emails), and should eliminate most of the subsequent searching.  It should be much cheaper than a couple of searches.
   
   Also, if the database is missing some early emails that are part of a thread that has already been indexed, does the archiver cope? And/Or would the rethread.py script need to be run again?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] sbp commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-859439603


   > I'm concerned that the code adds a couple of searches to the archiver, which is the most critical part of the system.
   
   Another possibility is that the properties added by this code could be added to a new separate table, e.g. `{dbname}_threads`. To avoid impacting archiver performance the code could even be added in a separate archiver module, with its own archiver hook in Mailman3 or separate stdin workflow. But note that the code is entirely optional, and has to be configured to be enabled. By default the code is not enabled.
   
   As for a mapping from `message-id` to database ID, elastic can already make such queries on the existing code. Indeed this query is already done in the code in this PR, in the `get_by_message_id` function in `archiver.py`. It's not clear why a table for this would be a good substitute for any of the other work that the code in this PR does, for example in parsing `in-reply-to` and `references` headers, nor even whether it would optimise `get_by_message_id` very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] sbp commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-857650779


   Yep, exactly. If `a -> b` means that `b` is a reply to `a` then if you have two threads:
   
   ```
   8 Jun  a -> b -> c
   9 Jun  d -> e -> f
   ```
   
   Then the `previous` field of `f` is `e`, and the previous field of `e` is `d`. But the `previous` field of `d` is `a`.
   
   Because all messages have a `thread` field too, this can indeed be used to easily obtain the previous thread of any message. For example, the `previous` field of `f` is `e`, but the `thread` field is `d` because that's the top of its thread. Therefore to get the previous thread of `f` you'd use `f.thread` to get `d` and then `d.previous` to get `a`.
   
   Similarly, to e.g. get all replies to `e` you can just issue an elastic query for all messages whose `previous` field is set to `e`. Or to get all messages in a thread, just query for all messages whose `thread` field is the same as the top of the thread.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail-foal] sebbASF commented on pull request #43: Add thread info capabilities to the archiver

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #43:
URL: https://github.com/apache/incubator-ponymail-foal/pull/43#issuecomment-860226636


   There's no point providing it as an option if it is never going to be used, so I don't think it is wise to encumber the archiver code with additional processing that may cause issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org