You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by th...@apache.org on 2017/03/21 17:04:34 UTC
svn commit: r1788005 -
/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md
Author: thomasm
Date: Tue Mar 21 17:04:34 2017
New Revision: 1788005
URL: http://svn.apache.org/viewvc?rev=1788005&view=rev
Log:
OAK-5946 - Document indexing flow (review)
Modified:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md
Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md?rev=1788005&r1=1788004&r2=1788005&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md Tue Mar 21 17:04:34 2017
@@ -43,17 +43,20 @@
## <a name="overview"></a> Overview
-For queries to perform well Oak supports indexing content stored in repository. Indexing works
-on diff between the base NodeState and modified NodeState. Depending on how diff is performed and
-when the index content gets updated there are 3 types of indexing modes
+For queries to perform well, Oak supports indexing of content that is stored in the repository.
+Indexing works on comparing different versions of the node data
+(technically, `Diff` between the base `NodeState` and the modified `NodeState`).
+There are indexing modes that define
+how comparing is performed, and when the index content gets updated:
1. Synchronous Indexing
2. Asynchronous Indexing
-3. Near real time indexing
+3. Near Real Time (NRT) Indexing
-Indexing makes use of [Commit Editors](../architecture/nodestate.html#commit-editors). Some of the editors
-are `IndexEditor` which are responsible for updating index content based on changes in main content. Currently
-Oak has following in built `IndexEditor`s
+Indexing makes use of [Commit Editors](../architecture/nodestate.html#commit-editors).
+Some of the editors are of type `IndexEditor`, which are responsible for updating index content
+based on changes in main content.
+Currently, Oak has following in built editors:
1. PropertyIndexEditor
2. ReferenceEditor
@@ -62,21 +65,24 @@ Oak has following in built `IndexEditor`
### <a name="new-1.6"></a> New in 1.6
-* [Near Real Time Indexing](#nrt-indexing)
+* [Near Real Time (NRT) Indexing](#nrt-indexing)
* [Multiple Async indexers setup via OSGi config](#async-index-setup)
* [Isolating Corrupt Indexes](#corrupt-index-handling)
## <a name="indexing-flow"></a> Indexing Flow
-`IndexEditor` are invoked as part of commit or as part of asynchronous diff process. For both cases at some stage
-diff is performed between _before_ and _after_ state and passed to `IndexUpdate` which is responsible for invoking
-`IndexEditor` based on _discovered_ index definitions.
+The `IndexEditor` is invoked as part of a commit (`Session.save()`),
+or as part of the asynchronous "diff" process.
+For both cases, at some stage "diff" is performed between the _before_ and the _after_ state,
+and passed to `IndexUpdate`, which is responsible for invoking the `IndexEditor`
+based on the _discovered_ index definitions.
### <a name="index-defnitions"></a> Index Definitions
-Index definitions are nodes of type `oak:QueryIndexDefinition` which are stored under a special node named `oak:index`.
-As part of diff traversal at each level `IndexUpdate` would look for `oak:index` nodes. Below is the canonical index
-definition structure
+Index definitions are nodes of type `oak:QueryIndexDefinition`
+which are stored under a special node named `oak:index`.
+As part of diff traversal, at each level `IndexUpdate` looks for `oak:index` nodes.
+Below is the canonical index definition structure:
/oak:index/indexName
- jcr:primaryType = "oak:QueryIndexDefinition"
@@ -84,85 +90,100 @@ definition structure
- async (string) multiple
- reindex (boolean)
+The index definitions nodes have following properties:
-The index definitions nodes have following properties
-
-1. `type` - It determines the _type_ of index. Based on the `type` `IndexUpdate` would look for `IndexEditor` of given
- type from registered `IndexEditorProvider`. For out of the box Oak setup it can have one of the following value
- * `reference` - Configured with out of box setup
- * `counter` - Configured with out of box setup
+1. `type` - It determines the _type_ of index. Based on the `type`,
+ `IndexUpdate` looks for an `IndexEditor` of the given
+ type from the registered `IndexEditorProvider`.
+ For out-of-the-box Oak setup, it can have one of the following values
+ * `reference` - Configured with the out-of-the-box setup
+ * `counter` - Configured with the out-of-the-box setup
* `property`
* `lucene`
* `solr`
-2. `async` - It determines if the index is to be updated synchronously or asynchronously. It can have following values
- * `sync` - Also the default value. It indicates that index is meant to be updated as part of commit
+2. `async` - This determines if the index is to be updated synchronously or asynchronously.
+ It can have following values:
+ * `sync` - The default value. It indicates that index is meant to be updated as part of each commit.
* `nrt` - Indicates that index is a [near real time](#nrt-indexing) index.
- * `async` - Indicates that index is to be updated asynchronously. In such a case this value is used to determine
+ * `async` - Indicates that index is to be updated asynchronously.
+ In such a case, this value is used to determine
the [indexing lane](#indexing-lane)
* Any other value which ends in `async`.
-3. `reindex` - If set to `true` reindexing would be performed for that index. Post which the property would be removed.
+3. `reindex` - If set to `true`, reindexing is performed for that index.
+ After reindexing is done, the property value is set to `false`.
Refer to [reindexing](#reindexing) for more details.
-Based on above 2 properties `IndexUpdate` creates `IndexEditor` instances as it traverses the diff and registers them
-with itself passing on the callbacks for various changes
+Based on the above two properties, the `IndexUpdate` creates an `IndexEditor` instances
+as it traverses the "diff", and registers them with itself, passing on the callbacks for various changes.
#### <a name="oak-index-nodes"></a>oak:index node
-Indexing logic supports placing `oak:index` nodes at any path. Depending on the location such indexes would only index
-content which are present under those paths. So for e.g. if 'oak:index' is present at _'/content/oak:index'_ then indexes
-defined under that node would only index repository state present under _'/content'_
-
-Depending on type of index one can create these index definitions under root path ('/') or non root paths. Currently
-only `lucene` indexes support creating index definitions at non root paths. `property` indexes can only be created
-under root path i.e. under '/'
+Indexing logic supports placing `oak:index` nodes at any path.
+Depending on the location, such indexes only index content which are present under those paths.
+So for example, if 'oak:index' is present at _'/content/oak:index'_, then indexes
+defined under that node only index repository data present under _'/content'_.
+
+Depending on the type of the index, one can create these index definitions under the root path ('/'),
+or non root paths.
+Currently only `lucene` indexes support creating index definitions at non-root paths.
+`property` indexes can only be created under the root path, that is, under '/'.
### <a name="sync-indexing"></a> Synchronous Indexing
-Under synchronous indexing the index content gets updates as part of commit itself. Changes to both index content
-and main content are done atomically in single commit.
+Under synchronous indexing, the index content gets updates as part of commit itself.
+Changes to both the main content, as well as the index content, are done atomically in a single commit.
-This mode is currently supported by `property` and `reference` indexes
+This mode is currently supported by `property` and `reference` indexes.
### <a name="async-indexing"></a> Asynchronous Indexing
-Asynchronous Indexing (also referred as async indexing) is performed using periodic scheduled jobs. As part of setup
-Oak would schedule certain periodic jobs which would perform diff of the repository content and update the index content
-based on that diff.
-
-Each periodic job i.e. `AsyncIndexUpdate` is assigned to an [indexing lane](#indexing-lane) and is scheduled to run at
-certain interval. At time of execution the job would perform work
-
-1. Look for last indexed state via stored checkpoint data. If such a checkpoint exist then resolve the `NodeState` for
- that checkpoint. If no such state exist or no such checkpoint is present then it treats it as initial indexing case where
- base state is set to empty. This state is considered as `before` state
-2. Create a checkpoint for _current_ state and refer to this as `after` state
-3. Create an `IndexUpdate` instance bound to current _indexing lane_ and trigger a diff between the `before` and
- `after` state
-4. `IndexUpdate` would then pick up index definitions which are bound to current indexing lane and would create
- `IndexEditor` instances for them and pass them the diff callbacks
-5. The diff traverses in a depth first manner and at the end of diff the `IndexEditor` would do final changes for
- current indexing run. Depending on index implementation the index data can be either stored in NodeStore itself
- (e.g. lucene) or in any remote store (e.g. solr)
-6. `AsyncIndexUpdate` would then update the last indexed checkpoint to current checkpoint and do a commit.
-
-Such async indexes are _eventually consistent_ with the repository state and lag behind the latest repository state
-by some time. However the index content would be eventually consistent and never end up in wrong state with respect
+Asynchronous indexing (also referred as async indexing) is performed using periodic scheduled jobs.
+As part of the setup, Oak schedules certain periodic jobs which perform
+diff of the repository content, and update the index content based on that diff.
+
+Each periodic `AsyncIndexUpdate` job, is assigned to an [indexing lane](#indexing-lane),
+and is scheduled to run at a certain interval.
+At time of execution, the job perform its work:
+
+1. Look for the last indexed state via stored checkpoint data.
+ If such a checkpoint exist, then resolve the `NodeState` for that checkpoint.
+ If no such state exist, or no such checkpoint is present,
+ then it treats it as initial indexing, in which case the base state is empty.
+ This state is considered the `before` state.
+2. Create a checkpoint for _current_ state and refer to this as `after` state.
+3. Create an `IndexUpdate` instance bound to the current _indexing lane_,
+ and trigger a diff between the `before` and the `after` state.
+4. `IndexUpdate` will then pick up index definitions which are bound to the current indexing lane,
+ will create `IndexEditor` instances for them,
+ and pass them the diff callbacks.
+5. The diff traverses in a depth-first manner,
+ and at the end of diff, the `IndexEditor` will do final changes for the current indexing run.
+ Depending on the index implementation, the index data can be either stored in NodeStore itself
+ (for indexes of type `lucene` and `property`), or in any remote store (for type `solr`).
+6. `AsyncIndexUpdate` will then update the last indexed checkpoint to the current checkpoint
+ and do a commit.
+
+Such async indexes are _eventually consistent_ with the repository state,
+and lag behind the latest repository state by some time.
+However the index content is eventually consistent, and never end up in wrong state with respect
to repository state.
#### <a name="checkpoint"></a> Checkpoint
-Checkpoint is a mechanism whereby a client of NodeStore can request it to ensure that repository state at that time
-can be preserved and not garbage collected by revision garbage collection process. Later that state can be retrieved
-back from NodeStore by passing the checkpoint back. You can treat checkpoint like a named revision or a tag in git
-repo.
+A checkpoint is a mechanism, whereby a client of `NodeStore` can request Oak to ensure
+that the repository state (snapshot) at that time can be preserved, and not garbage collected
+by the revision garbage collection process.
+Later, that state can be retrieved from the NodeStore by passing the checkpoint.
+You think of a checkpoint as a tag in a git repository, or as a named revision.
Async indexing makes use of checkpoint support to access older repository state.
#### <a name="indexing-lane"></a> Indexing Lane
-Indexing lane refers to a set of indexes which are to be indexed by given async indexer. Each index definition meant for
-async indexing defines an `async` property whose value is the name of indexing lane. For e.g. consider following 2 index
-definitions
+The term indexing lane refers to a set of indexes which are to be updated by a given async indexer.
+Each index definition meant for async indexing defines an `async` property,
+whose value is the name of the indexing lane.
+For example, consider following 2 index definitions:
/oak:index/userIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
@@ -172,116 +193,131 @@ definitions
- jcr:primaryType = "oak:QueryIndexDefinition"
- async = "fulltext-async"
-Here _userIndex_ is bound to "async" indexing lane while _assetIndex_ is bound to "fulltext-async" lane. Oak
-[setup](#async-index-setup) would configure 2 `AsyncIndexUpdate` jobs one for "async" and one for "fulltext-async".
-When job for "async" would run it would only process index definition where `async` value is `async` while when job
-for "fulltext-async" would run it would pick up index definitions where `async` value is `fulltext-async`.
-
-These jobs can be scheduled to run at different intervals and also on different cluster nodes. Each job would keep its
-own bookkeeping of checkpoint state and can be [paused and resumed](#async-index-mbean) separately.
-
-Prior to Oak 1.4 there was only one indexing lane `async`. In Oak 1.4 support was added to create 2 lanes `async` and
-`fulltext-async`. With 1.6 its possible to [create multiple lanes](#async-index-setup).
+Here, _userIndex_ is bound to the "async" indexing lane,
+while _assetIndex_ is bound to the "fulltext-async" lane.
+Oak [setup](#async-index-setup) configures two `AsyncIndexUpdate` jobs:
+one for "async", and one for "fulltext-async".
+When the job for "async" is run,
+it only processes index definition where the `async` value is `async`,
+while when the job for "fulltext-async" is run,
+it only pick up index definitions where the `async` value is `fulltext-async`.
+
+These jobs can be scheduled to run at different intervals, and also on different cluster nodes.
+Each job keeps its own bookkeeping of checkpoint state,
+and can be [paused and resumed](#async-index-mbean) separately.
+
+Prior to Oak 1.4, there was only one indexing lane: `async`.
+In Oak 1.4, support was added to create two lanes: `async` and `fulltext-async`.
+With 1.6, it is possible to [create multiple lanes](#async-index-setup).
#### <a name="cluster"></a> Clustered Setup
-In a clustered setup it needs to be ensured by the host application that async indexing jobs for specific lanes are to
-be run as singleton in the cluster. If `AsyncIndexUpdate` for same lane gets executed concurrently on different cluster
-nodes then it can lead to race conditions where old checkpoint gets lost leading to reindexing of the indexes.
+In a clustered setup, one needs to be ensured in the host application that
+the async indexing jobs for specific lanes are to be run as singleton in the cluster.
+If `AsyncIndexUpdate` for same lane gets executed concurrently on different cluster nodes,
+it leads to race conditions, where an old checkpoint gets lost,
+leading to reindexing of the indexes.
-Refer to [clustering](../clustering.html#scheduled-jobs) for more details on how the host application should schedule
-such indexing jobs
+See also [clustering](../clustering.html#scheduled-jobs)
+for more details on how the host application should schedule such indexing jobs.
##### <a name="async-index-lease"></a> Indexing Lease
-`AsyncIndexUpdate` has an inbuilt lease logic to ensure that even if the jobs gets scheduled to run on different cluster
-nodes then also only one of them runs. This is done by keeping a lease property which gets periodically updated as
+`AsyncIndexUpdate` has an in-built "lease" logic to ensure that
+even if the jobs gets scheduled to run on different cluster nodes, only one of them runs.
+This is done by keeping a lease property, which gets periodically updated as
indexing progresses.
-An `AsyncIndexUpdate` run would skip indexing if current lease has not expired i.e. if the last
-update of lease was done long ago (default 15 mins) then it would be assumed that cluster node doing indexing is not
-available and some other node would take over.
-
-The lease logic can delay start of indexing if the system is not stopped cleanly. As of Oak 1.6 this does not affect
-non clustered setup like those based on SegmentNodeStore but only [affects DocumentNodeStore][OAK-5159] based setups
+An `AsyncIndexUpdate` run skip indexing if the current lease has not expired.
+If the last update of the lease was done long ago (default 15 mins),
+then it is assumed that cluster node doing indexing is not available,
+and some other node will take over.
+
+The lease logic can delay the start of indexing if the system is not stopped cleanly.
+As of Oak 1.6, this does not affect non clustered setups like those based on SegmentNodeStore,
+but only [affects DocumentNodeStore][OAK-5159] based setups.
#### <a name="async-index-lag"></a> Indexing Lag
-Async indexing jobs are by default configured to run at interval of 5 secs. Depending on the system load and diff size
-of content to be indexed the indexing may start lagging by longer time intervals. Due to this the indexing results would
-lag behind the repository state and may become stale i.e. new content added would show up in result after some time.
+Async indexing jobs are by default configured to run at an interval of 5 seconds.
+Depending on the system load and diff size of content to be indexed,
+the indexing may start lagging by a longer time interval.
+Due to this, the indexing results can lag behind the repository state,
+and may become stale, that is new content added will show up in query results after some time.
-`IndexStats` MBean keeps a time series and metrics stats for the indexing frequency. This can be used to track the
-indexing state
+The `IndexStats` MBean keeps a time series and metrics stats for the indexing frequency.
+This can be used to track the indexing state.
-[NRT Indexing](#nrt-indexing) introduced in Oak 1.6 would help in such situations and can keep the results more upto
-date
+[NRT Indexing](#nrt-indexing) introduced in Oak 1.6 helps in such situations,
+and can keep the results more up to date.
#### <a name="async-index-setup"></a> Setup
`@since Oak 1.6`
-Async indexers can be configure via OSGi config for `org.apache.jackrabbit.oak.plugins.index.AsyncIndexerService`
+Async indexers can be configure via the OSGi config for `org.apache.jackrabbit.oak.plugins.index.AsyncIndexerService`.
![Async Indexing Config](async-index-config.png)
-Different lanes can be configured by adding more rows of _Async Indexer Configs_. Prior to 1.6 the indexers were
-created programatically while constructing Oak.
+Different lanes can be configured by adding more rows of _Async Indexer Configs_.
+Prior to 1.6, the indexers were created programatically while constructing Oak.
#### <a name="async-index-mbean"></a> Async Indexing MBean
-For each configured async indexer in the setup the indexer exposes a `IndexStatsMBean` which provides various
-stats around current indexing state.
+For each configured async indexer in the setup, the indexer exposes a `IndexStatsMBean`,
+which provides various stats around the current indexing state:
org.apache.jackrabbit.oak: async (IndexStats)
org.apache.jackrabbit.oak: fulltext-async (IndexStats)
It provide details like
-* FailingIndexStats - Stats around indexes which are [failing and marked as corrupt](#corrupt-index-handling)
-* LastIndexedTime - Time upto which repository state has been indexed
-* Status - running, done, failing etc
-* Failing - boolean flag indicating that indexing has been failing due to some issue. This can be monitored
- for detecting if indexer is healthy or not
-* ExecutionCount - Time series data around when number of execution for various time intervals
+* FailingIndexStats - Stats around indexes which are [failing and marked as corrupt](#corrupt-index-handling).
+* LastIndexedTime - Time up to which the repository state has been indexed.
+* Status - running, done, failing etc.
+* Failing - boolean flag indicating that indexing has been failing due to some issue.
+ This can be monitored for detecting if indexer is healthy or not.
+* ExecutionCount - Time series data around the number of runs for various time intervals.
Further it provides operations like
-* pause - Pauses the indexer
-* abortAndPause - Aborts any running indexing cycle and pauses the indexer. Invoke 'resume' once you are ready
- to resume indexing again
-* resume - Resume the indexing
+* pause - Pauses the indexer.
+* abortAndPause - Aborts any running indexing cycle and pauses the indexer.
+ Invoke 'resume' once you are ready to resume indexing again.
+* resume - Resume indexing.
#### <a name="corrupt-index-handling"></a> Isolating Corrupt Indexes
`Since 1.6`
-AsyncIndexerService would now mark any index which fails to update for 30 mins (configurable) as `corrupt` and
-ignore such indexes from further indexing.
+The `AsyncIndexerService` marks any index which fails to update for 30 mins
+(configurable) as `corrupt`, and ignore such indexes from further indexing.
-When any index is marked as corrupt following log entry would be made
+When any index is marked as corrupt, the following log entry is made:
- 2016-11-22 12:52:35,484 INFO NA [async-index-update-fulltext-async] o.a.j.o.p.i.AsyncIndexUpdate - Marking
- [/oak:index/lucene] as corrupt. The index is failing since Tue Nov 22 12:51:25 IST 2016 ,1 indexing cycles, failed
- 7 times, skipped 0 time
+ 2016-11-22 12:52:35,484 INFO NA [async-index-update-fulltext-async] o.a.j.o.p.i.AsyncIndexUpdate -
+ Marking [/oak:index/lucene] as corrupt. The index is failing since Tue Nov 22 12:51:25 IST 2016,
+ 1 indexing cycles, failed 7 times, skipped 0 time
-Post this when any new content gets indexed and any such corrupt index is skipped then following warn entry would be made
+Post this, when any new content gets indexed and any such corrupt index is skipped,
+the following warn entry is made:
- 2016-11-22 12:52:35,485 WARN NA [async-index-update-fulltext-async] o.a.j.o.p.index.IndexUpdate - Ignoring corrupt
- index [/oak:index/lucene] which has been marked as corrupt since [2016-11-22T12:51:25.492+05:30]. This index MUST be
- reindexed for indexing to work properly
+ 2016-11-22 12:52:35,485 WARN NA [async-index-update-fulltext-async] o.a.j.o.p.index.IndexUpdate -
+ Ignoring corrupt index [/oak:index/lucene] which has been marked as corrupt since
+ [2016-11-22T12:51:25.492+05:30]. This index MUST be reindexed for indexing to work properly
-This info would also be seen in MBean
+This info is also seen in the MBean
![Corrupt Index stats in IndexStatsMBean](corrupt-index-mbean.png)
-Later once the index is reindexed following log entry would be made
+Later, once the index is reindexed, the following log entry is made
- 2016-11-22 12:56:25,486 INFO NA [async-index-update-fulltext-async] o.a.j.o.p.index.IndexUpdate - Removing corrupt
- flag from index [/oak:index/lucene] which has been marked as corrupt since [corrupt = 2016-11-22T12:51:25.492+05:30]
+ 2016-11-22 12:56:25,486 INFO NA [async-index-update-fulltext-async] o.a.j.o.p.index.IndexUpdate -
+ Removing corrupt flag from index [/oak:index/lucene] which has been marked as corrupt since
+ [corrupt = 2016-11-22T12:51:25.492+05:30]
-This feature can be disabled by setting `failingIndexTimeoutSeconds` to 0 in AsyncIndexService config. Refer to
-[OAK-4939][OAK-4939] for more details
+This feature can be disabled by setting `failingIndexTimeoutSeconds` to 0 in the `AsyncIndexService` config.
+See also [OAK-4939][OAK-4939] for more details.
### <a name="nrt-indexing"></a> Near Real Time Indexing
@@ -289,61 +325,66 @@ This feature can be disabled by setting
_This mode is only supported for `lucene` indexes_
-Lucene indexes perform well for evaluating complex queries and also have the benefit of being evaluated locally with
-copy-on-read support. However they are `async` index and depending on system load can lag behind the repository state.
-For cases where such lag (of order of minutes) is not acceptable one has to use `property` indexes. For such cases
-Oak 1.6 has [added support for near real time indexing][OAK-4412]
+Lucene indexes perform well for evaluating complex queries,
+and also have the benefit of being evaluated locally with copy-on-read support.
+However, they are `async`, and depending on system load can lag behind the repository state.
+For cases where such lag (in the order of minutes) is not acceptable,
+one has to use `property` indexes.
+For such cases, Oak 1.6 has [added support for near real time indexing][OAK-4412]
![NRT Index Flow](index-nrt.png)
-In this mode the indexing would happen in 2 modes and query would consult multiple indexes. The diagram above shows
-indexing flow with time. In above flow
+In this mode, the indexing happen in two modes, and a query will consult multiple indexes.
+The diagram above shows the indexing flow with time. In the above flow,
* T1, T3 and T5 - Time instances at which checkpoint is created
-* T2 and T4 - Time instance when async indexer run completed and indexes were updated
+* T2 and T4 - Time instance when async indexer runs completed and indexes were updated
* Persisted Index
- * v2 - Index version v2 which has repository state upto time T1 indexed
- * v3 - Index version v2 which has repository state upto time T3 indexed
+ * v2 - Index version v2, which has repository state up to time T1 indexed
+ * v3 - Index version v2, which has repository state up to time T3 indexed
* Local Index
- * NRT1 - Local index which repository state between time T2 and T4 indexed
- * NRT2 - Local index which repository state between time T4 and T6 indexed
+ * NRT1 - Local index, which has repository state between time T2 and T4 indexed
+ * NRT2 - Local index, which has repository state between time T4 and T6 indexed
-As repository state changes with time Async indexer would run and index state between last known checkpoint and
-current state when that run started. So when asyn run 1 completed the persisted index has repository state indexed
-upto time T3.
-
-Now without NRT index support if any query is performed between time T2 and T4 it would only see index result for
-repository state at time T1 as thats state which the persisted indexes have data for. Any change after that would not be
-seen untill next async indexing cycle complete (by time T4).
-
-With NRT indexing support indexing would happen at 2 places
-
-* Persisted Index - This is the index which is updated via async indexer run. This flow would remain same i.e. it
- would be periodically updated by the indexer run
-* Local Index - In addition to persisted index each cluster node would also maintain a local index. This index would
- only keep data between 2 async indexer run. Post each run the previous index would be discarded and a new index would
- be built (actually previous index is retained for one cycle)
+As the repository state changes with time, the Async indexer will run and index the
+state between last known checkpoint and current state when that run started.
+So when asyncc run 1 completed, the persisted index has the repository state indexed up to time T3.
+
+Now without NRT index support, if any query is performed between time T2 and T4,
+it can only see index result for repository state at time T1,
+as thats the state where the persisted indexes have data for.
+Any change after that can not be seen until the next async indexing cycle is complete (by time T4).
+
+With NRT indexing support indexing will happen at two places:
+
+* Persisted Index - This is the index which is updated via the async indexer run.
+ This flow remains the same, it will be periodically updated by the indexer run.
+* Local Index - In addition to persisted index, each cluster node will also maintain a local index.
+ This index only keeps data between two async indexer runs.
+ Post each run, the previous index is discarded, and a new index is built
+ (actually the previous index is retained for one cycle).
-Any query making use of such an index would make use of both indexes. With this new content added in repository
-after the last async index run would also show up quickly.
+Any query making use of such an index will automatically make use of both the persisted and the local indexes.
+With this, new content added in the repository after the last async index run will also show up quickly.
#### <a name="nrt-indexing-modes"></a> NRT Indexing Modes
-NRT indexing can be enabled for any index by configuring the `async` property
+NRT (Near real time) indexing can be enabled for any index by configuring the `async` property:
/oak:index/assetIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
- async = ['fulltext-async', 'nrt']
-Here `async` value has been set to a multi value property where
+Here, the `async` value has been set to a multi-valued property, with the
-* Indexing lane - Like `async` or `fulltext-async`
-* NRT Indexing Mode - `nrt` or `sync`
+* Indexing lane - For example `async` or `fulltext-async`,
+* NRT Indexing Mode - `nrt` or `sync`.
##### <a name="nrt-indexing-mode-nrt"></a> nrt
-In this mode the local index would be updated asynchronously on that cluster nodes post commit and the index reader
-would be refreshed after 1 sec. So any change done should should show up on that cluster node in 1-2 secs
+In this mode, the local index is updated asynchronously on that cluster nodes post each commit,
+and the index reader is refreshed each second.
+So any change done should should show up on that cluster node within 1 to 2 seconds.
/oak:index/userIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
@@ -351,73 +392,81 @@ would be refreshed after 1 sec. So any c
##### <a name="nrt-indexing-mode-sync"></a> sync
-In this mode the local index would be updated synchronously on that cluster nodes post commit and the index reader
-would be refreshed immediately. This mode performs slowly compared to the "nrt" mode
+In this mode, the local index is updated synchronously on that cluster nodes post each commit,
+and the index reader is refreshed immediately.
+This mode performs more slowly compared to the "nrt" mode.
/oak:index/userIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
- async = ['async', 'sync']
-For a single node setup (like with SegmentNodeStore) this mode effectively makes async lucene index perform same as
-synchronous property indexes. However 'nrt' mode performs better so using that would be preferable
-
+For a single node setup (for example with the `SegmentNodeStore`),
+this mode effectively makes async lucene index perform same as synchronous property indexes.
+However, the 'nrt' mode performs better, so using that is preferable.
+
#### <a name="nrt-indexing-cluster-setup"></a> Cluster Setup
-In cluster setup each cluster node would maintain its own local index for changes happening in that cluster node.
-In addition to that it would also index changes from other cluster node by relying on [Oak observation for external
-changes][OAK-4808]. This depends on how frequently external changes are delivered. Due to this even with NRT indexing
-changes from other cluster node would take some more time to reflect in query result compared to local changes.
+In cluster setup, each cluster node maintains its own local index for changes happening in that cluster node.
+In addition to that, it also indexes changes from other cluster node by relying on
+[Oak observation for external changes][OAK-4808].
+This depends on how frequently external changes are delivered.
+Due to this, even with NRT indexing changes from other cluster nodes will take some more time
+to be reflected in query results compared to local changes.
#### <a name="nrt-indexing-config"></a> Configuration
-NRT indexing expose few configuration options as part of [LuceneIndexProviderService](lucene.html#osgi-config)
+NRT indexing expose a few configuration options as part of the [LuceneIndexProviderService](lucene.html#osgi-config):
+
+* `enableHybridIndexing` - Boolean property, defaults to `true`.
+ Can be set to `false` to disable the NRT indexing feature completely.
+* `hybridQueueSize` - The size of the in memory queue used
+ to hold Lucene documents for indexing in the `nrt` mode.
+ The default size is 10000.
-* `enableHybridIndexing` - Boolean property defaults to `true`. Can be set to `false` to disable NRT indexing feature
- completely
-* `hybridQueueSize` - Size of in memory queue used to hold Lucene documents for indexing in `nrt` mode. Default size is
- 10000
-
## <a name="reindexing"></a> Reindexing
-Reindexing of existing indexes is required in following scenarios
+Reindexing of existing indexes is required in the following scenarios:
-* Incompatible change in index definition - For example adding properties to the index which is already
- present in repository
-* Corrupted Index - If the index is corrupt and `AsyncIndexUpdate` run fails with exception pointing to index being
- corrupt
+* Incompatible changes in the index definition -
+ For example adding properties to the index which is already
+ present in repository.
+* Corrupted Index - If the index is corrupt and `AsyncIndexUpdate` run fails
+ with an exception pointing to index being corrupt.
-Reindexing does not resolve other problems, such that queries not returning data. For such cases, it is _not_
-recommended to reindex (also because this can be very slow and use a lot of temporary disk space).
+Reindexing does not resolve other problems, such that queries not returning data.
+For such cases, it is _not_ recommended to reindex (also because this can be very slow and use a lot of temporary disk space).
If queries don't return the right data, then possibly the index is [not yet up-to-date][OAK-5159],
-or the query is incorrect, or included/excluded path settings are wrong (for Lucene indexes). Instead of reindexing, it
-is suggested to first check the log file, modify the query so it uses a different index or traversal and run the query again.
+or the query is incorrect, or included/excluded path settings are wrong (for Lucene indexes).
+Instead of reindexing, it is suggested to first check the log file,
+modify the query so it uses a different index or traversal and run the query again.
One case were reindexing can help is if the query engine picks a very slow index for some queries because the counter index
[got out of sync after adding and removing lots of nodes many times (fixed in recent version)][OAK-4065].
For this case, it is recommended to verify the contents of the counter index first,
and upgrade Oak before reindexing.
-Also note that with Oak 1.6 for Lucene indexes changes in index definition are only effective
-[post reindexing](lucene.html#stored-index-definition)
+Also note that with Oak 1.6, for Lucene indexes, changes in the index definition are only effective
+[post reindexing](lucene.html#stored-index-definition).
-To reindex any index set the `reindex` flag to `true` in index definition
+To reindex any index, set the `reindex` flag to `true` in index definition:
/oak:index/userIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
- async = ['async']
- reindex = true
-Once changes are saved the index would be reindexed. For synchronous indexes the reindexing would be done
-as part of save (or commit) itself. While for asynchronous indexes are reindexed whenever the next async
-indexing cycle run happens. Once reindexing starts following log entries can be seen in the log
+Once changes are saved, the index is reindexed. For synchronous indexes,
+the reindexing is done as part of save (or commit) itself.
+While for asynchronous indexes, reindex starts with the next async indexing cycle.
+Once reindexing starts, the following log entries can be seen in the log:
[async-index-update-async] o.a.j.o.p.i.IndexUpdate Reindexing will be performed for following indexes: [/oak:index/userIndex]
[async-index-update-async] o.a.j.o.p.i.IndexUpdate Reindexing Traversed #100000 /home/user/admin
[async-index-update-async] o.a.j.o.p.i.AsyncIndexUpdate [async] Reindexing completed for indexes: [/oak:index/userIndex*(4407016)] in 30 min
-
-In both cases once reindexing is complete the `reindex` flag would be removed.
-For property index you can also make use of `PropertyIndexAsyncReindexMBean`. Refer to
-[reindeinxing property indexes](property-index.html#reindexing) section for more details on that
+In both cases, once reindexing is complete, the `reindex` flag is removed.
+
+For a property index, you can also make use of the `PropertyIndexAsyncReindexMBean`.
+See also the [reindeinxing property indexes](property-index.html#reindexing) section for more details on that.
[OAK-5159]: https://issues.apache.org/jira/browse/OAK-5159