You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ponymail.apache.org by sebb <se...@gmail.com> on 2020/09/10 11:25:35 UTC

Migrated records should be identified

Migration to Foal will be a huge job for some installations.

Whilst hopefully all snags will have been ironed out of any conversion
tool before it is deployed in earnest, it's possible that some edge
cases will cause issues, and will need subsequent adjustment.

To this end, I think it will be essential to know which records have
been migrated, and which version of the software was used to do so (as
well as the date).

It may be worth including version and timestamp info in the direct
archive and imports as well.

One possible application would be to back-fill attachments which were
originally ignored.

S.

Re: Migrated records should be identified

Posted by sebb <se...@gmail.com>.

On Thu, 10 Sep 2020 at 13:46, Daniel Gruno <hu...@apache.org> wrote:
>
> On 10/09/2020 14.44, sebb wrote:
> > On Thu, 10 Sep 2020 at 13:23, Daniel Gruno <hu...@apache.org> wrote:
> >>
> >> On 10/09/2020 14.15, sebb wrote:
> >>> On Thu, 10 Sep 2020 at 12:32, Daniel Gruno <hu...@apache.org> wrote:
> >>>>
> >>>> On 10/09/2020 13.25, sebb wrote:
> >>>>> Migration to Foal will be a huge job for some installations.
> >>>>>
> >>>>> Whilst hopefully all snags will have been ironed out of any conversion
> >>>>> tool before it is deployed in earnest, it's possible that some edge
> >>>>> cases will cause issues, and will need subsequent adjustment.
> >>>>
> >>>> Short of ironing out a standard for DKIM_ID, the migration tests I've
> >>>> done have gone relatively well. There were IIRC a few snags, most
> >>>> related to the ES 7.8.1 lib, but once I got migration started, it worked
> >>>> as intended and everything on the new ES server was compatible. If we
> >>>> could somehow get a migration test running on travis or such, that would
> >>>> be ideal - but that is quite tricky - we'd have to maybe dockerize two
> >>>> containers - one with old pony, one with foal, and then test migrating
> >>>> across and checking that each document is obtainable.
> >>>
> >>> What tests are planned for checking migration?
> >>>
> >>>>>
> >>>>> To this end, I think it will be essential to know which records have
> >>>>> been migrated, and which version of the software was used to do so (as
> >>>>> well as the date).
> >>>>>
> >>>>> It may be worth including version and timestamp info in the direct
> >>>>> archive and imports as well.
> >>>>
> >>>> Do you mean adding a key/value to the migrated doc with a migration
> >>>> note? That wouldn't be a bad idea, if nothing else, to keep score of
> >>>> what was migrated and what's new.
> >>>
> >>> Something like that.
> >>>
> >>> I think the data needs to be flexible and allow for multiple notes.
> >>> It won't always be sufficient to record the last change to the data.
> >>
> >> Yes, one wondrous thing about ES is a text field can be both text or an
> >> array of texts, so you can have one note or multiple notes, and it'll
> >> just work. I'm thinking of just having a "notes" field where we can put
> >> entries.
> >
> > Does that automatically append new entries, or does the user have to
> > amend the record to ensure previous entries are not lost?
>
> What I do right now is fetch the doc, ensure 'notes' is a list, then
> append new notes to it and save the entire doc.

i.e. care must be taken not to lose existing info.

> >
> > It would probably still be useful to have some fixed attributes such as
> > -archived-at
> > -imported-at
>
> That would be for archiver.py and import-mbox.py?

Yes, probably also need
-migrated at

> >
> >>>
> >>>>>
> >>>>> One possible application would be to back-fill attachments which were
> >>>>> originally ignored.
> >>>>
> >>>> This could be run as a background re-indexer perhaps? That grabs the
> >>>> source document, re-parses attachments, and if it contained more than
> >>>> originally thought, add them and update the email document.
> >>>
> >>> Yes, and marks the document somehow so it does not need to be scanned again.
> >>>
> >>> This is where the change context comes in.
> >>> If we knew which documents were created with which version of
> >>> software, it would be possible to know which ones did not need
> >>> processing.
> >>>
> >>>>>
> >>>>> S.
> >>>>>
> >>>>
> >>
>

Re: Migrated records should be identified

Posted by Daniel Gruno <hu...@apache.org>.

On 10/09/2020 14.44, sebb wrote:
> On Thu, 10 Sep 2020 at 13:23, Daniel Gruno <hu...@apache.org> wrote:
>>
>> On 10/09/2020 14.15, sebb wrote:
>>> On Thu, 10 Sep 2020 at 12:32, Daniel Gruno <hu...@apache.org> wrote:
>>>>
>>>> On 10/09/2020 13.25, sebb wrote:
>>>>> Migration to Foal will be a huge job for some installations.
>>>>>
>>>>> Whilst hopefully all snags will have been ironed out of any conversion
>>>>> tool before it is deployed in earnest, it's possible that some edge
>>>>> cases will cause issues, and will need subsequent adjustment.
>>>>
>>>> Short of ironing out a standard for DKIM_ID, the migration tests I've
>>>> done have gone relatively well. There were IIRC a few snags, most
>>>> related to the ES 7.8.1 lib, but once I got migration started, it worked
>>>> as intended and everything on the new ES server was compatible. If we
>>>> could somehow get a migration test running on travis or such, that would
>>>> be ideal - but that is quite tricky - we'd have to maybe dockerize two
>>>> containers - one with old pony, one with foal, and then test migrating
>>>> across and checking that each document is obtainable.
>>>
>>> What tests are planned for checking migration?
>>>
>>>>>
>>>>> To this end, I think it will be essential to know which records have
>>>>> been migrated, and which version of the software was used to do so (as
>>>>> well as the date).
>>>>>
>>>>> It may be worth including version and timestamp info in the direct
>>>>> archive and imports as well.
>>>>
>>>> Do you mean adding a key/value to the migrated doc with a migration
>>>> note? That wouldn't be a bad idea, if nothing else, to keep score of
>>>> what was migrated and what's new.
>>>
>>> Something like that.
>>>
>>> I think the data needs to be flexible and allow for multiple notes.
>>> It won't always be sufficient to record the last change to the data.
>>
>> Yes, one wondrous thing about ES is a text field can be both text or an
>> array of texts, so you can have one note or multiple notes, and it'll
>> just work. I'm thinking of just having a "notes" field where we can put
>> entries.
> 
> Does that automatically append new entries, or does the user have to
> amend the record to ensure previous entries are not lost?

What I do right now is fetch the doc, ensure 'notes' is a list, then 
append new notes to it and save the entire doc.

> 
> It would probably still be useful to have some fixed attributes such as
> -archived-at
> -imported-at

That would be for archiver.py and import-mbox.py?

> 
>>>
>>>>>
>>>>> One possible application would be to back-fill attachments which were
>>>>> originally ignored.
>>>>
>>>> This could be run as a background re-indexer perhaps? That grabs the
>>>> source document, re-parses attachments, and if it contained more than
>>>> originally thought, add them and update the email document.
>>>
>>> Yes, and marks the document somehow so it does not need to be scanned again.
>>>
>>> This is where the change context comes in.
>>> If we knew which documents were created with which version of
>>> software, it would be possible to know which ones did not need
>>> processing.
>>>
>>>>>
>>>>> S.
>>>>>
>>>>
>>

Re: Migrated records should be identified

Posted by sebb <se...@gmail.com>.

On Thu, 10 Sep 2020 at 13:23, Daniel Gruno <hu...@apache.org> wrote:
>
> On 10/09/2020 14.15, sebb wrote:
> > On Thu, 10 Sep 2020 at 12:32, Daniel Gruno <hu...@apache.org> wrote:
> >>
> >> On 10/09/2020 13.25, sebb wrote:
> >>> Migration to Foal will be a huge job for some installations.
> >>>
> >>> Whilst hopefully all snags will have been ironed out of any conversion
> >>> tool before it is deployed in earnest, it's possible that some edge
> >>> cases will cause issues, and will need subsequent adjustment.
> >>
> >> Short of ironing out a standard for DKIM_ID, the migration tests I've
> >> done have gone relatively well. There were IIRC a few snags, most
> >> related to the ES 7.8.1 lib, but once I got migration started, it worked
> >> as intended and everything on the new ES server was compatible. If we
> >> could somehow get a migration test running on travis or such, that would
> >> be ideal - but that is quite tricky - we'd have to maybe dockerize two
> >> containers - one with old pony, one with foal, and then test migrating
> >> across and checking that each document is obtainable.
> >
> > What tests are planned for checking migration?
> >
> >>>
> >>> To this end, I think it will be essential to know which records have
> >>> been migrated, and which version of the software was used to do so (as
> >>> well as the date).
> >>>
> >>> It may be worth including version and timestamp info in the direct
> >>> archive and imports as well.
> >>
> >> Do you mean adding a key/value to the migrated doc with a migration
> >> note? That wouldn't be a bad idea, if nothing else, to keep score of
> >> what was migrated and what's new.
> >
> > Something like that.
> >
> > I think the data needs to be flexible and allow for multiple notes.
> > It won't always be sufficient to record the last change to the data.
>
> Yes, one wondrous thing about ES is a text field can be both text or an
> array of texts, so you can have one note or multiple notes, and it'll
> just work. I'm thinking of just having a "notes" field where we can put
> entries.

Does that automatically append new entries, or does the user have to
amend the record to ensure previous entries are not lost?

It would probably still be useful to have some fixed attributes such as
-archived-at
-imported-at

> >
> >>>
> >>> One possible application would be to back-fill attachments which were
> >>> originally ignored.
> >>
> >> This could be run as a background re-indexer perhaps? That grabs the
> >> source document, re-parses attachments, and if it contained more than
> >> originally thought, add them and update the email document.
> >
> > Yes, and marks the document somehow so it does not need to be scanned again.
> >
> > This is where the change context comes in.
> > If we knew which documents were created with which version of
> > software, it would be possible to know which ones did not need
> > processing.
> >
> >>>
> >>> S.
> >>>
> >>
>

Re: Migrated records should be identified

Posted by Daniel Gruno <hu...@apache.org>.

On 10/09/2020 14.15, sebb wrote:
> On Thu, 10 Sep 2020 at 12:32, Daniel Gruno <hu...@apache.org> wrote:
>>
>> On 10/09/2020 13.25, sebb wrote:
>>> Migration to Foal will be a huge job for some installations.
>>>
>>> Whilst hopefully all snags will have been ironed out of any conversion
>>> tool before it is deployed in earnest, it's possible that some edge
>>> cases will cause issues, and will need subsequent adjustment.
>>
>> Short of ironing out a standard for DKIM_ID, the migration tests I've
>> done have gone relatively well. There were IIRC a few snags, most
>> related to the ES 7.8.1 lib, but once I got migration started, it worked
>> as intended and everything on the new ES server was compatible. If we
>> could somehow get a migration test running on travis or such, that would
>> be ideal - but that is quite tricky - we'd have to maybe dockerize two
>> containers - one with old pony, one with foal, and then test migrating
>> across and checking that each document is obtainable.
> 
> What tests are planned for checking migration?
> 
>>>
>>> To this end, I think it will be essential to know which records have
>>> been migrated, and which version of the software was used to do so (as
>>> well as the date).
>>>
>>> It may be worth including version and timestamp info in the direct
>>> archive and imports as well.
>>
>> Do you mean adding a key/value to the migrated doc with a migration
>> note? That wouldn't be a bad idea, if nothing else, to keep score of
>> what was migrated and what's new.
> 
> Something like that.
> 
> I think the data needs to be flexible and allow for multiple notes.
> It won't always be sufficient to record the last change to the data.

Yes, one wondrous thing about ES is a text field can be both text or an 
array of texts, so you can have one note or multiple notes, and it'll 
just work. I'm thinking of just having a "notes" field where we can put 
entries.

> 
>>>
>>> One possible application would be to back-fill attachments which were
>>> originally ignored.
>>
>> This could be run as a background re-indexer perhaps? That grabs the
>> source document, re-parses attachments, and if it contained more than
>> originally thought, add them and update the email document.
> 
> Yes, and marks the document somehow so it does not need to be scanned again.
> 
> This is where the change context comes in.
> If we knew which documents were created with which version of
> software, it would be possible to know which ones did not need
> processing.
> 
>>>
>>> S.
>>>
>>

Re: Migrated records should be identified

Posted by sebb <se...@gmail.com>.

On Thu, 10 Sep 2020 at 12:32, Daniel Gruno <hu...@apache.org> wrote:
>
> On 10/09/2020 13.25, sebb wrote:
> > Migration to Foal will be a huge job for some installations.
> >
> > Whilst hopefully all snags will have been ironed out of any conversion
> > tool before it is deployed in earnest, it's possible that some edge
> > cases will cause issues, and will need subsequent adjustment.
>
> Short of ironing out a standard for DKIM_ID, the migration tests I've
> done have gone relatively well. There were IIRC a few snags, most
> related to the ES 7.8.1 lib, but once I got migration started, it worked
> as intended and everything on the new ES server was compatible. If we
> could somehow get a migration test running on travis or such, that would
> be ideal - but that is quite tricky - we'd have to maybe dockerize two
> containers - one with old pony, one with foal, and then test migrating
> across and checking that each document is obtainable.

What tests are planned for checking migration?

> >
> > To this end, I think it will be essential to know which records have
> > been migrated, and which version of the software was used to do so (as
> > well as the date).
> >
> > It may be worth including version and timestamp info in the direct
> > archive and imports as well.
>
> Do you mean adding a key/value to the migrated doc with a migration
> note? That wouldn't be a bad idea, if nothing else, to keep score of
> what was migrated and what's new.

Something like that.

I think the data needs to be flexible and allow for multiple notes.
It won't always be sufficient to record the last change to the data.

> >
> > One possible application would be to back-fill attachments which were
> > originally ignored.
>
> This could be run as a background re-indexer perhaps? That grabs the
> source document, re-parses attachments, and if it contained more than
> originally thought, add them and update the email document.

Yes, and marks the document somehow so it does not need to be scanned again.

This is where the change context comes in.
If we knew which documents were created with which version of
software, it would be possible to know which ones did not need
processing.

> >
> > S.
> >
>

Re: Migrated records should be identified

Posted by Daniel Gruno <hu...@apache.org>.

On 10/09/2020 13.25, sebb wrote:
> Migration to Foal will be a huge job for some installations.
> 
> Whilst hopefully all snags will have been ironed out of any conversion
> tool before it is deployed in earnest, it's possible that some edge
> cases will cause issues, and will need subsequent adjustment.

Short of ironing out a standard for DKIM_ID, the migration tests I've 
done have gone relatively well. There were IIRC a few snags, most 
related to the ES 7.8.1 lib, but once I got migration started, it worked 
as intended and everything on the new ES server was compatible. If we 
could somehow get a migration test running on travis or such, that would 
be ideal - but that is quite tricky - we'd have to maybe dockerize two 
containers - one with old pony, one with foal, and then test migrating 
across and checking that each document is obtainable.

> 
> To this end, I think it will be essential to know which records have
> been migrated, and which version of the software was used to do so (as
> well as the date).
> 
> It may be worth including version and timestamp info in the direct
> archive and imports as well.

Do you mean adding a key/value to the migrated doc with a migration 
note? That wouldn't be a bad idea, if nothing else, to keep score of 
what was migrated and what's new.

> 
> One possible application would be to back-fill attachments which were
> originally ignored.

This could be run as a background re-indexer perhaps? That grabs the 
source document, re-parses attachments, and if it contained more than 
originally thought, add them and update the email document.

> 
> S.
>