You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2016/06/27 07:47:52 UTC

[jira] [Comment Edited] (OAK-4412) Lucene hybrid index

    [ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350537#comment-15350537 ] 

Chetan Mehrotra edited comment on OAK-4412 at 6/27/16 7:46 AM:
---------------------------------------------------------------

The approach proposed here is bit different. So would try to explain again

Oak provides a MVCC storage and currently Lucene indexes are updated on one of the cluster node via an asynchronous task. This task compares the NodeStates between previous last indexed state and current head state and indexes the changed content. Once done the updated index state is committed. This change is then picked by other cluster node and they update there local copy of index to latest state.

Due to the async nature of this task the index state lags the latest repository state.

To address this lag this Hybrid index approach is proposed. Under this

# A periodic task keeps the persisted lucene index (L0) upto date as explained before. Lets say the index is upto date till revision R0-1 i.e. index content match the repository state at revision R1
# Now lets say for a cluster node N1 the current head is R1-1. Here R1-1 > R0-1. 
# On each cluster node we would have an observor which maintains a "local" index (L1) and "only" indexes content changed between R0-1 and R1-1. i.e. index content changed between there current local head and last indexed state for L0
# Any query would then be run against both L0 and L1 and results would be joined

Note that each cluster node would have its own copy of the "local" index and whose state might differ from other cluster nodes. It does not involve any replication of the in memory index, instead rely on fact that each cluster node maintains it own copy of local index for the respective delta between there current head state and last indexed state

With this setup and with sticky session enabled it should be possible to provide more accurate query result for changes happening on that cluster node. 

So this is all in theory which still needs to validated and see how it performs!!


was (Author: chetanm):
The approach proposed here is bit different. So would try to explain again

Oak provides a MVCC storage and currently Lucene indexes are updated on one of the cluster node via an asynchronous task. This task compares the NodeStates between previous last indexed state and current head state and indexes the changed content. Once done the updated index state is committed. This change is then picked by other cluster node and they update there local copy of index to latest state.

Due to the async nature of this task the index state lags the latest repository state.

To address this lag this Hybrid index approach is proposed. Under this

# A periodic task keeps the persisted lucene index (L0) upto date as explained before. Lets say the index is upto date till revision R0-1 i.e. index content match the repository state at revision R1
# Now lets say for a cluster node N1 the current head is R1-1. Here R1-1 > R0-1. 
# On each cluster node we would have an observor which maintains a "local" index (L1) and "only" indexes content changed between R0-1 and R1-1
# Any query would then be run against both L0 and L1 and results would be joined

Note that each cluster node would have its own copy of the "local" index and whose state might differ from other cluster nodes. It does not involve any replication of the in memory index, instead rely on fact that each cluster node maintains it own copy of local index for the respective delta between there current head state and last indexed state

With this setup and with sticky session enabled it should be possible to provide more accurate query result for changes happening on that cluster node. 

So this is all in theory which still needs to validated and see how it performs!!

> Lucene hybrid index
> -------------------
>
>                 Key: OAK-4412
>                 URL: https://issues.apache.org/jira/browse/OAK-4412
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Tomek Rękawek
>            Assignee: Tomek Rękawek
>             Fix For: 1.6
>
>         Attachments: OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After performing some stress-tests with a geo-distributed Mongo cluster, we've found out that updating property indexes is a large part of the overall traffic.
> The asynchronous index would be an answer here (as the index update won't be made in the client request thread), but the AEM requires the updates to be visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a synchronous, locally-stored counterpart that will persist only the data since the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local files. Once the "main" Lucene index is being updated, the local index will be purged.
> Queries will use an union of results from the {{lucene}} and {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)