You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "René Cordier (Jira)" <se...@james.apache.org> on 2020/06/12 10:26:00 UTC

[jira] [Updated] (JAMES-3202) ReIndexing "filtering" for only outdated indexed data

     [ https://issues.apache.org/jira/browse/JAMES-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

René Cordier updated JAMES-3202:
--------------------------------
    Description: 
*Why?*

ReIndexing can be slow, and requires to read all messages in the DB, then trigger the full reIndexing, even when the document is not outdated.

All these document changes creates a lot of deleted documents. Lucene "marks them as deleted", polluting the entire index until segment merging happens (yet another costly operation). The less we do updates the better. To be noted that partial updates still leads to a full new document in Lucene, and just optimises bandwith + avoids reads.

*Need specification*

As an admin, I want to run a reIndex.

We furtermore handle `RunningOptions` allowing to specify the message rate attempted. See [https://github.com/linagora/james-project/pull/3394]

We still need, given a message, get it's search index representation (at least for its mutable data). From this we will be able to condition the reindexing to outdated/non exsting data, significantly fasting up the reindexing process on mostly valid indexes. The admin could then mention via query parameter this option (carried over in running options).

*MessageSearchIndex API changes*:
{code:java}
inderface MessageSearchIndex {
   //...
   Mono<Flags> retrieveIndexedFlags(MailboxId mailboxId, MessageUid uid);
   //...
}
{code}
ElasticSearch will rely on the _GET_ verb (not search).

Unit test will be written for this new method.

ReIndexing `RunningOptions` will then carry over the option, that ReIndexerPerformer will need to take into account.

Sample webadmin API:
{code:bash}
curl -XPOST http://james:8000/mailboxes?action=reindex&filter=outdatedIndex
{code}

  was:
*Why?*

ReIndexing is slow, and requires to read all messages in the DB, then trigger the full reIndexing, even when the document is not outdated.

All these document changes creates a lot of deleted documents. Lucene "marks them as deleted", polluting the entire index until segment merging happens (yet another costly operation). The less we do updates the better. To be noted that partial updates still leads to a full new document in Lucene, and just optimises bandwith + avoids reads.

*Need specification*

As an admin, I want to run a reIndex. Current sequential reactive reindexing reaches the speed of `21 messages/seconds` on UPN, below the mentioned objective of `1.000 msg/s`.

We furtermore handle `RunningOptions` allowing to specify the message rate attempted. See [https://github.com/linagora/james-project/pull/3394]

While it enables more parralelization, we have doubts on the fact UPN can keep up with the mentionned rate after mentionning the current search index limitation.

We thus need, given a message, get it's search index representation (at least for its mutable data). From this we will be able to condition the reindexing to outdated/non exsting data, significantly fasting up the reindexing process on mostly valid indexes. The admin could then mention via query parameter this option (carried over in running options).

*MessageSearchIndex API changes*:
{code:java}
inderface MessageSearchIndex {
   //...
   Mono<Flags> retrieveIndexedFlags(MailboxId mailboxId, MessageUid uid);
   //...
}
{code}
ElasticSearch will rely on the _GET_ verb (not search).

Unit test will be written for this new method.

ReIndexing `RunningOptions` will then carry over the option, that ReIndexerPerformer will need to take into account.

Sample webadmin API:
{code:bash}
curl -XPOST http://james:8000/mailboxes?action=reindex&filter=outdatedIndex
{code}


> ReIndexing "filtering" for only outdated indexed data
> -----------------------------------------------------
>
>                 Key: JAMES-3202
>                 URL: https://issues.apache.org/jira/browse/JAMES-3202
>             Project: James Server
>          Issue Type: Improvement
>            Reporter: René Cordier
>            Priority: Major
>
> *Why?*
> ReIndexing can be slow, and requires to read all messages in the DB, then trigger the full reIndexing, even when the document is not outdated.
> All these document changes creates a lot of deleted documents. Lucene "marks them as deleted", polluting the entire index until segment merging happens (yet another costly operation). The less we do updates the better. To be noted that partial updates still leads to a full new document in Lucene, and just optimises bandwith + avoids reads.
> *Need specification*
> As an admin, I want to run a reIndex.
> We furtermore handle `RunningOptions` allowing to specify the message rate attempted. See [https://github.com/linagora/james-project/pull/3394]
> We still need, given a message, get it's search index representation (at least for its mutable data). From this we will be able to condition the reindexing to outdated/non exsting data, significantly fasting up the reindexing process on mostly valid indexes. The admin could then mention via query parameter this option (carried over in running options).
> *MessageSearchIndex API changes*:
> {code:java}
> inderface MessageSearchIndex {
>    //...
>    Mono<Flags> retrieveIndexedFlags(MailboxId mailboxId, MessageUid uid);
>    //...
> }
> {code}
> ElasticSearch will rely on the _GET_ verb (not search).
> Unit test will be written for this new method.
> ReIndexing `RunningOptions` will then carry over the option, that ReIndexerPerformer will need to take into account.
> Sample webadmin API:
> {code:bash}
> curl -XPOST http://james:8000/mailboxes?action=reindex&filter=outdatedIndex
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org