You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Thomas Mueller <mu...@adobe.com> on 2015/10/21 10:54:53 UTC

Reindexing problems

Hi,

If an index provider is (temporarily) not available, the MissingIndexProviderStrategy resets the index so it is re-indexed. This is a problem (OAK-2024, OAK-2203, OAK-2429, OAK-3325, OAK-3366, OAK-3505, OAK-3512, OAK-3513), because re-indexing is slow and one transaction. It can also cause many threads to concurrently build the index. Currently, synchronous indexes are built in one "transaction", which is anyway a performance problem (for new indexes and reindexing). If an index is not available when running a query, traversal is used, which is also a problem.

What about:

* (a) Hardcode (not rely on the Whiteboard or OSGi) the known indexes for property, reference, nodeType, lucene, counter index. This is for both writing (IndexEditor) and reading (QueryIndex) . That way, those indexes are always available, and we never get into a situation where they are temporarily not available.

* (b) Where we can't use hardcoding, use hard service references (Whiteboard / OSGi).

* (c) If we can't do that, block or fail commits if one of the configured indexes is not available, for example for the Solr index (if such an index is configured).

Additionally, for "synchronous" indexes (property index and so on), I would like to always create and reindex them asynchronously by default, and only once they are available switch to sychronous mode. I think (but I'm not sure) this is OAK-1456.

What do you think?

Regards,
Thomas

Re: Reindexing problems

Posted by Thomas Mueller <mu...@adobe.com>.

OK, I think we (kind of) agree on how to ensure important indexes are
available.

>>Additionally, for "synchronous" indexes (property index and so on), I
>>would like to always create and reindex them asynchronously by default,

OK, I see that large branches are a problem.

Instead of using branches, what about:

* First switch the index to "building in progress" so that _queries_ don't
use it. 

* Build the index in multiple commits:
  - Traverse the repository, and
   - as soon as you have 1000 index changes in memory, commit them.
* Then continue to traverse, in a new transaction.
* Until the repository is fully traversed.
* Concurrent changes would update the index as normal.
* At the of the "index creation traversal", switch the index to "ready"

Regards,
Thomas

Re: Reindexing problems

Posted by Chetan Mehrotra <ch...@gmail.com>.

> (a) Hardcode (not rely on the Whiteboard or OSGi) the known indexes

That would not work if the implementation makes use of OSGi features
like configuration or DI. For e.g. Lucene implementation relies on
OSGi config and also to expose certain extension points

> (b) Where we can't use hardcoding, use hard service references (Whiteboard / OSGi).

+1. That would be preferable. I think we can go for approach taken in
OAK-3201 as depending on setup even custom implementation might be
required. So just hard references would not help and we would need to
make the component which registers repository to be aware of all its
*required* dependencies

>  (c) If we can't do that, block or fail commits if one of the configured indexes is not available, for example for the Solr index (if such an index is configured).

+1. Current approach is problamatic. Missing index provider is more of
a setup issue which can be addressed by system admin and repository
should not try to handle that. So failing the commit should be fine.

> Additionally, for "synchronous" indexes (property index and so on), I would like to always create and reindex them asynchronously by default,

That might be tricky for DocumentNodeStore as even if you build them
asynchronously when final merge happens then it might be very
expensive to deal with such a large branch commit. Also if a critical
index like uuid/reference index it would be better if system does not
get started otherwise it would trigger large traversal if no index was
present or previous revision of index is not usable (due to some
corruption)
Chetan Mehrotra


On Wed, Oct 21, 2015 at 2:24 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
> If an index provider is (temporarily) not available, the MissingIndexProviderStrategy resets the index so it is re-indexed. This is a problem (OAK-2024, OAK-2203, OAK-2429, OAK-3325, OAK-3366, OAK-3505, OAK-3512, OAK-3513), because re-indexing is slow and one transaction. It can also cause many threads to concurrently build the index. Currently, synchronous indexes are built in one "transaction", which is anyway a performance problem (for new indexes and reindexing). If an index is not available when running a query, traversal is used, which is also a problem.
>
> What about:
>
> * (a) Hardcode (not rely on the Whiteboard or OSGi) the known indexes for property, reference, nodeType, lucene, counter index. This is for both writing (IndexEditor) and reading (QueryIndex) . That way, those indexes are always available, and we never get into a situation where they are temporarily not available.
>
> * (b) Where we can't use hardcoding, use hard service references (Whiteboard / OSGi).
>
> * (c) If we can't do that, block or fail commits if one of the configured indexes is not available, for example for the Solr index (if such an index is configured).
>
> Additionally, for "synchronous" indexes (property index and so on), I would like to always create and reindex them asynchronously by default, and only once they are available switch to sychronous mode. I think (but I'm not sure) this is OAK-1456.
>
> What do you think?
>
> Regards,
> Thomas
>