You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by ch...@apache.org on 2017/03/16 11:49:43 UTC
svn commit: r1787162 - /jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md

Author: chetanm
Date: Thu Mar 16 11:49:42 2017
New Revision: 1787162

URL: http://svn.apache.org/viewvc?rev=1787162&view=rev
Log:
OAK-5917 - Document enhancements in indexing in 1.6

Add table of content

Modified:
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md

Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md?rev=1787162&r1=1787161&r2=1787162&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md Thu Mar 16 11:49:42 2017
@@ -17,6 +17,45 @@
 
 ## Lucene Index
 
+* [Index Definition](#index-definition)
+    * [Indexing Rules](#indexing-rules)
+        * [Cost Overrides](#cost-overrides)
+        * [Indexing Rule inheritance](#indexing-rule-inheritence)
+        * [Property Definitions](#property-definitions)
+        * [Evaluate Path Restrictions](#path-restrictions)
+        * [Include and Exclude paths from indexing](#include-exclude)
+    * [Aggregation](#aggregation)
+    * [Analyzers](#analyzers)
+        * [Specify analyzer class directly](#analyzer-classes)
+        * [Create analyzer via composition](#analyzer-composition)
+    * [Codec](#codec)
+    * [Boost and Search Relevancy](#boost)
+* [LuceneIndexProvider Configuration](#osgi-config)
+* [Tika Config](#tika-config)
+    * [Mime type usage](#mime-type-usage)
+* [Non Root Index Definitions](#non-root-index)
+* [Native Query and Index Selection](#native-query)
+* [CopyOnRead](#copy-on-read)
+* [CopyOnWrite](#copy-on-write)
+* [Lucene Index MBeans](#mbeans)
+* [Analyzing created Lucene Index](#luke)
+* [Pre-Extracting Text from Binaries](#text-extraction)
+* [Advanced search features](#advanced-search-features)
+    * [Suggestions](#suggestions)
+    * [Spellchecking](#spellchecking)
+    * [Facets](#facets)
+    * [Score Explanation](#score-explanation)
+    * [Custom hooks](#custom-hooks)
+* [Design Considerations](#design-considerations)
+* [Lucene Index vs Property Index](#lucene-vs-property)
+* [Examples](#examples)
+    * [A - Simple queries](#simple-queries)
+    * [B - Queries for structured content](#queries-structured-content)
+        * [UC1 - Find all assets which are having `status` as `published`](#uc1)
+        * [UC2 - Find all assets which are having `status` as `published` sorted by last modified date](#uc2)
+        * [UC3 - Find all assets where comment contains _december_](#uc3)
+        * [UC4 - Find all assets which are created by David and refer to december](#uc4)
+
 Oak supports Lucene based indexes to support both property constraint and full
 text constraints. Depending on the configuration a Lucene index can be used
 to evaluate property constraints, full text constraints, path restrictions
@@ -74,7 +113,7 @@ The Lucene index needs to be configured
               - isRegexp = true
               - nodeScopeIndex = true
 
-### Index Definition
+### <a name="index-definition"></a> Index Definition
 
 Lucene index definition consist of `indexingRules`, `analyzers` ,
 `aggregates` etc which determine which node and properties are to be indexed
@@ -162,7 +201,7 @@ compatVersion
 [maxFieldLength][OAK-2469]
 : Numbers of terms indexed per field. Defaults to 10000
 
-#### Indexing Rules
+#### <a name="indexing-rules"></a> Indexing Rules
 
 Indexing rules defines which types of node and properties are indexed. An
 index configuration can define one or more `indexingRules` for different
@@ -225,7 +264,7 @@ indexNodeName
     * //element(*, app:Asset)[fn:name() = 'kite'] 
     * //element(kite, app:Asset)
 
-##### Cost Overrides
+##### <a name="cost-overrides"></a> Cost Overrides
 
 By default, the cost of using this index is calculated follows: For each query,
 the overhead is one operation. For each entry in the index, the cost is one.
@@ -242,7 +281,7 @@ Cost per entry is the cost per node in t
 Using 0.5 means the cost is half, which means the index would be used used more often 
 (that is, even if there is a different index with similar cost).
 
-##### Indexing Rule inheritance
+##### <a name="indexing-rule-inheritence"></a>Indexing Rule inheritance
 
 `indexRules` are defined per nodeType and support nodeType inheritance. For
 example while indexing any node the indexer would lookup for applicable
@@ -255,7 +294,7 @@ which has `orderable` child nodes)
 If `inherited` is set to false on any rule then that rule would only be
 applicable if exact match is found
 
-##### Property Definitions
+##### <a name="property-definitions"></a>Property Definitions
 
 Each index rule consist of one ore more property definition defined under
 `properties`. Order of property definition node is important as some properties
@@ -395,8 +434,7 @@ Property name can be one of following
    `nodeScopeIndex=true` is akin to setting `indexNodeName=true` on indexing
    rule. (`@since Oak 1.3.15, 1.2.14`)
 
-<a name="path-restrictions"></a>
-##### Evaluate Path Restrictions
+##### <a name="path-restrictions"></a> Evaluate Path Restrictions
 
 Lucene index provides support for evaluating path restrictions natively.
 Consider a query like
@@ -412,8 +450,7 @@ would only return nodes which are under
 Enabling this feature would incur cost in terms of slight increase in index
 size. Refer to [OAK-2306][OAK-2306] for more details.
 
-<a name="include-exclude"></a>
-##### Include and Exclude paths from indexing
+##### <a name="include-exclude"></a> Include and Exclude paths from indexing
 
 `@since Oak 1.0.14, 1.2.3`
 
@@ -485,8 +522,7 @@ any overlap.
     
 Refer to [OAK-2599][OAK-2599] for more details.
 
-<a name="aggregation"></a>
-#### Aggregation
+#### <a name="aggregation"></a>Aggregation
 
 Sometimes it is useful to include the contents of descendant nodes into a single
 node to easier search on content that is scattered across multiple nodes.
@@ -595,7 +631,7 @@ defaults to 5
             - path = "renditions/original"
             - relativeNode = true
 
-#### Analyzers
+#### <a name="analyzers"></a>Analyzers
 
 `@since Oak 1.5.5, 1.4.7`
 Unless custom analyzer is configured (as documented below), in-built analyzer
@@ -623,7 +659,7 @@ The default analyzer can be configured v
             + pathText
             ...
 
-##### Specify analyzer class directly
+##### <a name="analyzer-classes"></a>Specify analyzer class directly
 
 If any of the out of the box analyzer is to be used then it can configured directly
 
@@ -644,7 +680,7 @@ the analyzer node
                 - luceneMatchVersion = "LUCENE_47" (optional)
                 + stopwords (nt:file)
 
-##### Create analyzer via composition
+##### <a name="analyzer-composition"></a>Create analyzer via composition
 
 Analyzers can also be composed based on `Tokenizers`, `TokenFilters` and
 `CharFilters`. This is similar to the support provided in Solr where you can
@@ -710,8 +746,7 @@ Points to note
 Note that currently only one analyzer can be configured per index. Its not possible to specify separate
 analyzer for query and index time currently. 
 
-<a name="codec"></a>
-#### Codec
+#### <a name="codec"></a>Codec
 
 Name of [Lucene Codec][lucene-codec] to use. By default if the index involves 
 fulltext indexing then Oak Lucene uses `OakCodec` which disables compression.
@@ -727,8 +762,7 @@ the codec to `Lucene46`
 Refer to [OAK-2853][OAK-2853] for details. Enabling the `Lucene46` codec
 would lead to smaller and compact indexes.
 
-<a name="boost"></a>
-#### Boost and Search Relevancy
+#### <a name="boost"></a>Boost and Search Relevancy
 
 `@since Oak 1.2.5`
 
@@ -788,8 +822,7 @@ Would have those node (of type app:Asset
 _jcr:title_. While those nodes where search text is found in other field
 like aggregated content would come later
 
-<a name="osgi-config"></a>
-### LuceneIndexProvider Configuration
+### <a name="osgi-config"></a>LuceneIndexProvider Configuration
 
 Some of the runtime aspects of the Oak Lucene support can be configured via OSGi
 configuration. The configuration needs to be done for PID `org.apache
@@ -818,7 +851,7 @@ debug
 : Boolean value. Defaults to `false`
 : If enabled then Lucene logging would be integrated with Slf4j
 
-### Tika Config
+### <a name="tika-config"></a>Tika Config
 
 `@since Oak 1.0.12, 1.2.3`
 
@@ -839,15 +872,15 @@ the config file via `tika/config.xml` no
     * maxExtractLength = -10, maxFieldLength = 10000 -> Actual value = 100000
     * maxExtractLength = 1000 -> Actual value = 1000
 
-#### Mime type usage
+#### <a name="mime-type-usage"></a>Mime type usage
 
 A binary would only be index if there is an associated property `jcr:mimeType` defined
 and that is supported by Tika. By default indexer uses [TypeDetector][OAK-2895]
 instead of default `DefaultDetector` which relies on the `jcr:mimeType` to pick up the
 right parser. 
 
-<a name="non-root-index"></a>
-### Non Root Index Definitions
+
+### <a name="non-root-index"></a>Non Root Index Definitions
 
 Lucene index definition can be defined at any location in repository and need
 not always be defined at root. For example if your query involves path 
@@ -859,8 +892,7 @@ Then you can create the required index d
 `/content/companya/oak:index/assetIndex`. In such a case that index would 
 contain data for the subtree under `/content/companya`
 
-<a name="native-query"></a>
-### Native Query and Index Selection
+### <a name="native-query"></a>Native Query and Index Selection
 
 Oak query engine supports native queries like
 
@@ -882,7 +914,7 @@ should be used
 
     //*[rep:native('lucene-assetIndex', 'name:(Hello OR World)')]
 
-### Persisting indexes to FileSystem
+### <a name="native-query"></a>Persisting indexes to FileSystem
 
 By default Lucene indexes are stored in the `NodeStore`. If required they can
 be stored on the file system directly
@@ -902,8 +934,8 @@ Note that this setup would only for thos
 backend `NodeStore` supports clustering then index data would not be 
 accessible on other cluster nodes
 
-<a name="copy-on-read"></a>
-### CopyOnRead
+
+### <a name="copy-on-read"></a>CopyOnRead
 
 Lucene indexes are stored in `NodeStore`. Oak Lucene provides a custom directory
 implementation which enables Lucene to load index from `NodeStore`. This 
@@ -924,8 +956,7 @@ For more details refer to [OAK-1724][OAK
 
 _With Oak 1.0.13 this feature is now enabled by default._
 
-<a name="copy-on-write"></a>
-### CopyOnWrite
+### <a name="copy-on-write"></a>CopyOnWrite
 
 `@since Oak 1.0.15, 1.2.3`
 
@@ -945,15 +976,14 @@ during the indexing process locally and
 For more details refer to [OAK-2247][OAK-2247]. This feature can be enabled via
 [Lucene Index provider service configuration](#osgi-config)
 
-### Lucene Index MBeans
+### <a name="mbeans"></a>Lucene Index MBeans
 
 Oak Lucene registers a JMX bean `LuceneIndex` which provide details about the 
 index content e.g. size of index, number of documents present in index etc
 
 ![Lucene Index MBean](lucene-index-mbean.png)
 
-<a name="luke"></a>
-### Analyzing created Lucene Index
+### <a name="luke"></a>Analyzing created Lucene Index
 
 [Luke]  is a handy development and diagnostic tool, which accesses already 
 existing Lucene indexes and allows you to display index details. In Oak 
@@ -997,8 +1027,7 @@ mentioned steps
         
 From the Luke UI shown you can access various details.
 
-<a name="text-extraction"></a>
-### Pre-Extracting Text from Binaries
+### <a name="text-extraction"></a>Pre-Extracting Text from Binaries
 
 `@since Oak 1.0.18, 1.2.3`
 
@@ -1047,9 +1076,9 @@ to validate if `PreExtractedTextProvider
 
 For more details on this feature refer to [OAK-2892][OAK-2892]
 
-### Advanced search features
+### <a name="advanced-search-features"></a>Advanced search features
 
-#### Suggestions
+#### <a name="suggestions"></a>Suggestions
 
 `@since Oak 1.1.17, 1.0.15`
 
@@ -1121,7 +1150,7 @@ or
 Note, the subset is done by filtering top 10 suggestions. So, it's possible to get no suggestions for a subtree query,
 if top 10 suggestions are not part of that subtree. For details look at [OAK-3994] and related issues.
 
-#### Spellchecking
+#### <a name="spellchecking"></a>Spellchecking
 
 `@since Oak 1.1.17, 1.0.13`
 
@@ -1165,7 +1194,7 @@ or
 Note, the subset is done by filtering top 10 spellchecks. So, it's possible to get no results for a subtree query,
 if top 10 spellchecks are not part of that subtree. For details look at [OAK-3994] and related issues.
 
-#### Facets
+#### <a name="facets"></a>Facets
 
 `@since Oak 1.3.14`
 
@@ -1213,7 +1242,7 @@ Specific facet related features for Luce
           - propertyIndex = true
 ```
 
-#### Score Explanation
+#### <a name="score-explanation"></a>Score Explanation
 
 `@since Oak 1.3.12`
 
@@ -1223,7 +1252,7 @@ e.g. `select [oak:scoreExplanation], * f
 _Note that showing explanation score is expensive. So, this feature should be used for debug purposes only_.
 
 
-#### Custom hooks
+#### <a name="custom-hooks"></a>Custom hooks
 
 `@since Oak 1.3.14`
 
@@ -1231,7 +1260,7 @@ In OSGi enviroment, implementations of `
 `org.apache.jackrabbit.oak.plugins.index.lucene.spi` (see javadoc [here][oak-lucene]) are called during indexing
 and querying as documented in javadocs.
 
-### Design Considerations
+### <a name="design-considerations"></a>Design Considerations
 
 Lucene index provides quite a few features to meet various query requirements. 
 While defining the index definition do consider the following aspects
@@ -1286,7 +1315,7 @@ nodetype as Table in your DB and all the
 in that table. Various property definitions can then be considered as index for 
 those columns. 
 
-### Lucene Index vs Property Index
+### <a name="lucene-vs-property"></a>Lucene Index vs Property Index
 
 Lucene based index can be restricted to index only specific properties and in that
 case it is similar to [Property Index](query.html#property-index). However it differs
@@ -1303,9 +1332,9 @@ from property index in following aspects
 2.  Lucene index cannot enforce uniqueness constraint - By virtue of it being asynchronous
     it cannot enforce uniqueness constraint.
 
-### Examples
+### <a name="examples"></a>Examples
 
-#### A - Simple queries
+#### <a name="simple-queries"></a>A - Simple queries
 
 In many cases the query is purely based on some specific property and is not 
 restricted to any specific nodeType
@@ -1414,7 +1443,7 @@ This can also be clubbed in same index d
           - name = "offTime"
 ```
 
-#### B - Queries for structured content
+#### <a name="queries-structured-content"></a>B - Queries for structured content
 
 Queries in previous examples were based on mostly unstructured content where no
 nodeType restrictions were applied. However in many cases the nodes being queried
@@ -1445,6 +1474,7 @@ confirm to certain structure. For exampl
 
 Content like above is then queried in multiple ways. So lets take first query
 
+<a name="uc1"></a>
 **UC1 - Find all assets which are having `status` as `published`**
 
 ```
@@ -1478,6 +1508,7 @@ Above index definition
 * Indexes all nodes of type `app:Asset` **only**
 * Indexes relative property `jcr:content/metadata/status` for all such nodes
 
+<a name="uc2"></a>
 **UC2 - Find all assets which are having `status` as `published` sorted by last 
 modified date**
 
@@ -1514,6 +1545,7 @@ Above index definition
 * Property type is set to `Date`
 * Indexes both `status` and `jcr:lastModified`
 
+<a name="uc3"></a>
 **UC3 - Find all assets where comment contains _december_**
 
 ```
@@ -1542,6 +1574,7 @@ Above index definition
 * `propertyIndex` is not enabled as this property is not going to be used to
   perform equality check
 
+<a name="uc4"></a>
 **UC4 - Find all assets which are created by David and refer to december **
 
 ```