You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2016/09/06 11:48:20 UTC

[jira] [Commented] (OAK-4412) Lucene hybrid index

    [ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467219#comment-15467219 ] 

Chetan Mehrotra commented on OAK-4412:
--------------------------------------


Planned feature work is now done and [patch|^OAK-4412-v1.diff] is ready for review.

h3. Purpose

Hybrid index provides 2 indexing modes

h4. nrt
In this mode for each commit Lucene Documents would be created as part of sync commit and would be added to a *local* index asynchronously where the IndexReader would be refreshed with _refresh interval_ of 1 sec

h4. sync
In this mode the lucene document would be added to index and IndexReader would be *immediately* refreshed. Functionally this would be similar to property index. This mode has lower performance compared to {{nrt}}. 

This mode should be used for those cases where code expects changes made to session immediately reflected in the query. So if a session set _/a/b/@foo_ to _bar_ and just after session save performs a query for 'bar' and expects /a/n/@foo to be part of result set then this mode should be used. 

Performance wise this mode is slower and slows down writes compared to 'nrt'

The indexes created under hybrid index are local and maintain index data between last async index cycle to most recent commit. Any search would be performed via MultiReader with readers from local index and another from index built as part of async indexing.


h3. Usage

To enable this mode for any index you need to make the {{async}} property as a multi value property with following values

* {{async}} = [{{async}}, {{nrt}}] - Enables the NRT mode
* {{async}} = [{{async}}, {{sync}}] - Enables the sync mode

{{LuceneIndexProviderService}} - Provides some tuning configuration which can be modfied as per setup requirements


h4. Implementation Detail

Most of the new code lives under {{org.apache.jackrabbit.oak.plugins.index.lucene.hybrid}} package. For any commit involving any index definition marked with {{nrt}} or {{sync}} {{LuceneIndexEditorProvider}} would return a {{LuceneIndexEditor}} backed by {{LocalIndexWriterFactory}}. This factory would use {{LocalIndexWriter}} and stores the prepared {{LuceneDoc}} in {{LuceneDocumentHolder}}. This holder instance is stored as part of {{CommitContext}} (which is stored in {{CommitInfo}} associated with the commit).

Once merge is done for that commit the change is picked by {{LocalIndexObserver}} (a sync observer). This observer would then look for {{LuceneDocumentHolder}} and if found would process the {{LuceneDoc}} stored in it

* For documents belonging to {{nrt}} mode it would add the docs to {{DocumentQueue}}
* For documents belonging ti {{sync}} mode it would directly write the document to {{NRTIndex}} configured for that index

{{DocumentQueue}} asynchronously picks up the docs from the queue and then write them to the index. 

*NRTIndex*
On indexing side each index (represented by {{IndexNode}}) has a matching {{NRTIndex}} which is constructed from {{NRTIndexFactory}}. Whenever a new {{IndexNode}} instance is created as a result of change in async index (via {{IndexTracker}}) the factory would create a new {{NRTIndex}} for that. It keeps maximum 2 instance of {{NRTIndex}} and closes and garbage collect older onces. So a {{NRTIndex}} would only have index data for the data indexed between 2 consecutive async indexing cycle.

{{NRTIndex}} provides access to {{IndexWriter}} which is used by {{DocumentQueue}} to write documents to it. It also creates {{IndexReader}} which is obtained from {{IndexWriter}} making use of [Lucene NRT Support|http://wiki.apache.org/lucene-java/NearRealtimeSearch]

{{NRTIndex}} also provides access to {{ReaderRefreshPolicy}} which determines how and when the reader should be refreshed. The policy instance is also made aware of the changes done to index. For {{nrt}} indexes {{TimedRefreshPolicy}} is used which by default refreshes the reader after 1 sec delay. For {{sync}} index {{RefreshOnWritePolicy}} is used which refreshes the reader after any writes

h4. Benchmark

A benchmark has been implemented in oak-run under {{HybridIndexTest}}. It creates multiple indexes (_numOfIndexes_ = 10) to simulate a system having multiple indexes defined and then creates node with property {{foo}} being set with value as per enum _Status_. Each thread then creates nodes in breadth first fashion (defaults to 5 child node per node and then for each child node). 

In addition there is a {{Searcher}} thread which queries for different values and a {{Mutator}} which modifies the values
* refreshDeltaMillis - 1000 - Time delay between reader reopen for nrt
* asyncInterval - 5 - Time in seconds for async indexer
* queueSize - 1000 - Size of queue used by {{DocumentQueue}}
* hybridIndexEnabled - Boolean flag. If set to true hybrid index would be used otherwise property index would be used
* indexingMode - Defaults to nrt - [nrt/sync] - Which mode to use if hybridIndexEnabled
* useOakCodec - Boolean flag if set to true {{oakCodec}} would be used to avoid compression which slows down the searches (OAK-1737)

{noformat}
java  -DhybridIndexEnabled=true -DindexingMode=nrt -jar oak-run*.jar benchmark --concurrency=5 HybridIndexTest Oak-Mongo-FDS Oak-Segment-Tar-FDS
{noformat}

_Results would be posted soon_

h4. Pending Feature Work

* Support for listening to external changes and then update the {{nrt}} indexes based on those changes
* JMX MBean around NRTIndexFactory to see rate of change etc


> Lucene hybrid index
> -------------------
>
>                 Key: OAK-4412
>                 URL: https://issues.apache.org/jira/browse/OAK-4412
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: lucene
>            Reporter: Tomek Rękawek
>            Assignee: Chetan Mehrotra
>             Fix For: 1.6
>
>         Attachments: OAK-4412-v1.diff, OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After performing some stress-tests with a geo-distributed Mongo cluster, we've found out that updating property indexes is a large part of the overall traffic.
> The asynchronous index would be an answer here (as the index update won't be made in the client request thread), but the AEM requires the updates to be visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a synchronous, locally-stored counterpart that will persist only the data since the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local files. Once the "main" Lucene index is being updated, the local index will be purged.
> Queries will use an union of results from the {{lucene}} and {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)