You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ponymail.apache.org by Alex Harui <ah...@adobe.com> on 2016/06/10 00:04:15 UTC

Archive Editing Tool

Hi,

I'm interesting in developing a tool to edit mailing list archives.  My
current idea is pretty simple: develop some code that can scan the
archives within a date range for a subject line, then replace a sequence
of characters with x's, and save it in the archives.  I'm new to how
archives are going to work in PonyMail so please be patient with me.  I
think I've learned that the archives PonyMail will store are not in mbox
format, but rather, as documents in ElasticSearch.  Is it one email per
document?

I saw one suggestion to export the archive to mbox format and I assume do
the modification to the mbox file and then re-import the mbox file.  That
seems a bit tricky if the thread is still active, or is that not a problem
with PonyMail?  Or would that perform better than going after the REST API?

I am also assuming that eventually all existing ASF archives will be
imported into PonyMail and thus I don't need to worry about a version that
modifies the current mbox files.  Let me know if that assumption is
questionable.

I was also warned that modifying an archived email might change its
permalink.  Can someone confirm?

And the higher-level topic for discussion is: would replacing with x's not
be sufficient for corner cases where we might want to edit the archives?
I'm assuming we can actually edit subject and recipients email addresses.

Thanks
-Alex


Re: Archive Editing Tool

Posted by Sam Ruby <ru...@apache.org>.
On 2016-06-09 20:04 (-0400), Alex Harui <ah...@adobe.com> wrote: 
> Hi,
> 
> I'm interesting in developing a tool to edit mailing list archives.  My
> current idea is pretty simple: develop some code that can scan the
> archives within a date range for a subject line, then replace a sequence
> of characters with x's, and save it in the archives.  I'm new to how
> archives are going to work in PonyMail so please be patient with me.  I
> think I've learned that the archives PonyMail will store are not in mbox
> format, but rather, as documents in ElasticSearch.  Is it one email per
> document?

It is indeed.

> I saw one suggestion to export the archive to mbox format and I assume do
> the modification to the mbox file and then re-import the mbox file.  That
> seems a bit tricky if the thread is still active, or is that not a problem
> with PonyMail?  Or would that perform better than going after the REST API?

I think our wires crossed at some point.  I was responding to a different point (about mbox going away).  mbox is a file format standardized by the IETF and supported by bazillions of tools, and most programming languages have libraries which support this format.

So, mbox is not going away, but not likely to be relevant to the question of modifying archives stored in ElasticSearch.

> I am also assuming that eventually all existing ASF archives will be
> imported into PonyMail and thus I don't need to worry about a version that
> modifies the current mbox files.  Let me know if that assumption is
> questionable.

Reasonable assumption.

> I was also warned that modifying an archived email might change its
> permalink.  Can someone confirm?

Here's the code that generates ids used for indexing messages in ES:

https://github.com/apache/incubator-ponymail/blob/master/tools/archiver.py#L299

I guess it might be possible for a utility that updates a message to use the original message id; but it isn't clear to me whether that would cause breakage someplace else.

> And the higher-level topic for discussion is: would replacing with x's not
> be sufficient for corner cases where we might want to edit the archives?
> I'm assuming we can actually edit subject and recipients email addresses.

As long as the replacement can be parsed as an email by software that understands mail formats, I think all is fair.  Given the complexity of mail formats, probably the way to do that is to use an existing library to parse an email, modify the result, and then use the library again to serialize the message.

To see an example of the underlying mail format, click on a "View Source" button on lists.apache.org.  It looks deceptively simple, but once you deal with non ASCII characters, attachments, HTML formatted email, ... it gets complicated pretty quickly.

> Thanks
> -Alex

- Sam Ruby

Re: Archive Editing Tool

Posted by "John D. Ament" <jo...@apache.org>.
Hi Alex,

You may be interested in this issue:
https://github.com/apache/incubator-ponymail/issues/59

AFAIK, all ASF archives are in ponymail, but that's more of an infra
question.

John

On Thu, Jun 9, 2016 at 8:04 PM Alex Harui <ah...@adobe.com> wrote:

> Hi,
>
> I'm interesting in developing a tool to edit mailing list archives.  My
> current idea is pretty simple: develop some code that can scan the
> archives within a date range for a subject line, then replace a sequence
> of characters with x's, and save it in the archives.  I'm new to how
> archives are going to work in PonyMail so please be patient with me.  I
> think I've learned that the archives PonyMail will store are not in mbox
> format, but rather, as documents in ElasticSearch.  Is it one email per
> document?
>
> I saw one suggestion to export the archive to mbox format and I assume do
> the modification to the mbox file and then re-import the mbox file.  That
> seems a bit tricky if the thread is still active, or is that not a problem
> with PonyMail?  Or would that perform better than going after the REST API?
>
> I am also assuming that eventually all existing ASF archives will be
> imported into PonyMail and thus I don't need to worry about a version that
> modifies the current mbox files.  Let me know if that assumption is
> questionable.
>
> I was also warned that modifying an archived email might change its
> permalink.  Can someone confirm?
>
> And the higher-level topic for discussion is: would replacing with x's not
> be sufficient for corner cases where we might want to edit the archives?
> I'm assuming we can actually edit subject and recipients email addresses.
>
> Thanks
> -Alex
>
>