You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by ch...@apache.org on 2017/03/20 07:31:50 UTC
svn commit: r1787690 - in /jackrabbit/oak/trunk/oak-doc/src/site/markdown:
architecture/nodestate.md query/indexing.md
Author: chetanm
Date: Mon Mar 20 07:31:50 2017
New Revision: 1787690
URL: http://svn.apache.org/viewvc?rev=1787690&view=rev
Log:
OAK-5946 - Document indexing flow
Added:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md
Modified:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/architecture/nodestate.md
Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/architecture/nodestate.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/architecture/nodestate.md?rev=1787690&r1=1787689&r2=1787690&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/architecture/nodestate.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/architecture/nodestate.md Mon Mar 20 07:31:50 2017
@@ -269,7 +269,7 @@ until all the hooks have had a chance to
final node state is then persisted as a new revision and made available to
other Oak clients.
-## Commit editors
+## <a name="commit-editors"/> Commit editors
In practice most commit hooks are interested in the content diff as returned
by the `compareAgainstBaseState` call mentioned above. This call can be
Added: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md?rev=1787690&view=auto
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md (added)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md Mon Mar 20 07:31:50 2017
@@ -0,0 +1,156 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+
+# Indexing
+
+## <a name="overview"></a> Overview
+
+For queries to perform well Oak supports indexing content stored in repository. Indexing works
+on diff between the base NodeState and modified NodeState. Depending on how diff is performed and
+when the index content gets updated there are 3 types of indexing modes
+
+1. Synchronous Indexing
+2. Asynchronous Indexing
+3. Near real time indexing
+
+Indexing makes use of [Commit Editors](../architecture/nodestate.html#commit-editors). Some of the editors
+are `IndexEditor` which are responsible for updating index content based on changes in main content. Currently
+Oak has following in built `IndexEditor`s
+
+1. PropertyIndexEditor
+2. ReferenceEditor
+3. LuceneIndexEditor
+4. SolrIndexEditor
+
+## <a name="indexing-flow"></a> Indexing Flow
+
+`IndexEditor` are invoked as part of commit or as part of asynchronous diff process. For both cases at some stage
+diff is performed between _before_ and _after_ state and passed to `IndexUpdate` which is responsible for invoking
+`IndexEditor` based on _discovered_ index definitions.
+
+### <a name="index-defnitions"></a> Index Definitions
+
+Index definitions are nodes of type `oak:QueryIndexDefinition` which are stored under a special node named `oak:index`.
+As part of diff traversal at each level `IndexUpdate` would look for `oak:index` nodes. The index definitions nodes have
+following properties
+
+1. `type` - It determines the _type_ of index. For e.g. it can be `property`, `lucene`, `solr` etc. Based on the `type`
+ `IndexUpdate` would look for `IndexEditor` of given type from registered `IndexEditorProvider`
+2. `async` - It determines if the index is to be updated synchronously or asynchronously. It can have following values
+ * `sync` - Also the default value. It indicates that index is meant to be updated as part of commit
+ * `nrt` - Indicates that index is a [near real time](#nrt-indexing) index.
+ * `async` - Indicates that index is to be updated asynchronously. In such a case this value is used to determine
+ the [indexing lane](#indexing-lane)
+ * Any other value which ends in `async`.
+
+Based on above 2 properties `IndexUpdate` creates `IndexEditor` instances as it traverses the diff and registers them
+with itself passing on the callbacks for various changes
+
+#### <a name="oak-index-nodes"></a>oak:index node
+
+Indexing logic supports placing `oak:index` nodes at any path. Depending on the location such indexes would only index
+content which are present under those paths. So for e.g. if 'oak:index' is present at _'/content/oak:index'_ then indexes
+defined under that node would only index repository state present under _'/content'_
+
+Depending on type of index one can create these index definitions under root path ('/') or non root paths. Currently
+only `lucene` indexes support creating index definitions at non root paths. `property` indexes can only be created
+under root path i.e. under '/'
+
+### <a name="sync-indexing"></a> Synchronous Indexing
+
+Under synchronous indexing the index content gets updates as part of commit itself. Changes to both index content
+and main content are done atomically in single commit.
+
+This mode is currently supported by `property` and `reference` indexes
+
+### <a name="async-indexing"></a> Asynchronous Indexing
+
+Asynchronous Indexing (also referred as async indexing) is performed using periodic scheduled jobs. As part of setup
+Oak would schedule certain periodic jobs which would perform diff of the repository content and update the index content
+based on that diff.
+
+Each periodic job i.e. `AsyncIndexUpdate` is assigned to an [indexing lane](#indexing-lane) and is scheduled to run at
+certain interval. At time of execution the job would perform work
+
+1. Look for last indexed state via stored checkpoint data. If such a checkpoint exist then resolve the `NodeState` for
+ that checkpoint. If no such state exist or no such checkpoint is present then it treats it as initial indexing case where
+ base state is set to empty. This state is considered as `before` state
+2. Create a checkpoint for _current_ state and refer to this as `after` state
+3. Create an `IndexUpdate` instance bound to current _indexing lane_ and trigger a diff between the `before` and
+ `after` state
+4. `IndexUpdate` would then pick up index definitions which are bound to current indexing lane and would create
+ `IndexEditor` instances for them and pass them the diff callbacks
+5. The diff traverses in a depth first manner and at the end of diff the `IndexEditor` would do final changes for
+ current indexing run. Depending on index implementation the index data can be either stored in NodeStore itself
+ (e.g. lucene) or in any remote store (e.g. solr)
+6. `AsyncIndexUpdate` would then update the last indexed checkpoint to current checkpoint and do a commit.
+
+#### <a name="checkpoint"></a> Checkpoint
+
+Checkpoint is a mechanism whereby a client of NodeStore can request it to ensure that repository state at that time
+can be preserved and not garbage collected by revision garbage collection process. Later that state can be retrieved
+back from NodeStore by passing the checkpoint back. You can treat checkpoint like a named revision or a tag in git
+repo.
+
+Async indexing makes use of checkpoint support to access older repository state.
+
+#### <a name="indexing-lane"></a> Indexing Lane
+
+Indexing lane refers to a set of indexes which are to be indexed by given async indexer. Each index definition meant for
+async indexing defines an `async` property whose value is the name of indexing lane. For e.g. consider following 2 index
+definitions
+
+ /oak:index/userIndex
+ - jcr:primaryType = "oak:QueryIndexDefinition"
+ - async = "async"
+
+ /oak:index/assetIndex
+ - jcr:primaryType = "oak:QueryIndexDefinition"
+ - async = "fulltext-async"
+
+Here _userIndex_ is bound to "async" indexing lane while _assetIndex_ is bound to "fulltext-async" lane. Oak
+[setup](#async-index-setup) would configure 2 `AsyncIndexUpdate` jobs one for "async" and one for "fulltext-async".
+When job for "async" would run it would only process index definition where `async` value is `async` while when job
+for "fulltext-async" would run it would pick up index definitions where `async` value is `fulltext-async`.
+
+These jobs can be scheduled to run at different intervals and also on different cluster nodes. Each job would keep its
+own bookkeeping of checkpoint state and can be [paused and resumed](#async-index-mbean) separately.
+
+Prior to Oak 1.4 there was only one indexing lane `async`. In Oak 1.4 support was added to create 2 lanes `async` and
+`fulltext-async`. With 1.6 its possible to [create multiple lanes](#async-index-setup).
+
+#### <a name="async-index-lag"></a> Indexing Lag
+
+#### <a name="async-index-setup"></a> Setup
+
+`Since 1.6`
+
+#### <a name="cluster"></a> Clustered Setup
+
+#### <a name="async-index-mbean"></a> Clustered Setup
+
+## <a name="nrt-indexing"></a> Near Real Time Indexing
+
+## Index Types
+
+### Property Indexes
+
+### Lucene Indexes
+
+
+
+
\ No newline at end of file