You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by th...@apache.org on 2016/09/15 09:44:00 UTC
svn commit: r1760904 - /jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md

Author: thomasm
Date: Thu Sep 15 09:43:59 2016
New Revision: 1760904

URL: http://svn.apache.org/viewvc?rev=1760904&view=rev
Log:
OAK-301 Document Oak - property index cost

Modified:
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md

Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md?rev=1760904&r1=1760903&r2=1760904&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md Thu Sep 15 09:43:59 2016
@@ -84,6 +84,15 @@ or to simplify you can use one of the ex
 
 #### Reindexing
 
+Usually, reindexing is only needed if the configuration of an index is changed, 
+such that the index should contain more or different data.
+For example, reindexing is needed if the property to be indexed is changed, 
+if a nodetype is added to __`declaringNodeTypes`__, or if __`includedPaths`__ is changed.
+It is not strictly needed if less data is to be indexed, for example if a nodetype is removed.
+However, to save space, it might make sense to reindex even in that case.
+Typically, if a query does not return the expected result, reindexing does not help;
+more likely, the reason in somewhere else to be found, and disabling the index should be tried first.
+
 Reindexing a property index happens synchronously by setting the __`reindex`__ flag to __`true`__. This means that the 
 first #save call will generate a full repository traversal with the purpose of building the index content and it might
 take a long time.
@@ -106,3 +115,44 @@ Example:
         .setProperty("reindex", true);
     }
 
+#### Cost Estimation
+
+When running a query, the property index reports its estimated cost to the query engine,
+and then the query engine picks the index with the lowest cost (cost-based query optimization).
+The algorithm to calculate the estimated cost is roughly as follows (a bit simplified):
+
+* The cost is infinity (so the index is never used) 
+  if the condition contains a fulltext constraint, 
+  no applicable restriction,
+  the wrong nodetype, or
+  if the path filtering (`includedPaths` / `excludedPaths`) does not match the query.
+* For the nodetype index, the cost is the sum of the cost for the `jcr:primaryType` lookup
+  (if the primary type is known),
+  plus the cost for the `jcr:mixinTypes` lookup (if that is known).
+* Otherwise, the cost is based on the overhead (which is 2), 
+  plus the estimated number of entries.
+* For an "x is not null" condition, 
+  the estimated number of entries is
+  either the configured `entryCount` or, if not set, the 
+  approximate number of entries in the index.
+  The approximation is an "order of magnitude" estimation (Morris' algorithm).
+* For a unique index and "x = 1" condition, 
+  the estimated number of entries is either 0 or 1 
+  (depending on whether the key is found).
+* For a non-unique index and a "x = 1" condition,
+  if the `entryCount` and `keyCount` are set, those setting are used to estimate
+  the number of entries. If not, the 
+  approximate number of entries for the key is read (maintained using Morris’ algorithm).
+  In addition to that, the path condition is used to scale down
+  the estimated count depending on the approximate number of nodes
+  in that subtree versus the approximate number of entries
+  in the repository, using approximation available via the `counter` index.
+
+For example, for a query with path restriction "/content/products/t-shirts" and property restriction
+"color = 'red'", if there is an index for the property "color", then
+the entry count approximation is read from the index. Let's say it is 10'000 for this value.
+Then the approximate number of nodes in the subtree "/content/products/t-shirts" is read 
+(let's say it is 20'000), and the approximate number of nodes in the repository 
+(let's say it is 1 million).
+Therefore, the estimated number of entries is scaled down (divided by 50) from 10'000 to 200.
+The estimated cost is therefore 202, due to the overhead of 2.
\ No newline at end of file