You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by th...@apache.org on 2016/09/15 09:44:00 UTC
svn commit: r1760904 -
/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md
Author: thomasm
Date: Thu Sep 15 09:43:59 2016
New Revision: 1760904
URL: http://svn.apache.org/viewvc?rev=1760904&view=rev
Log:
OAK-301 Document Oak - property index cost
Modified:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md
Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md?rev=1760904&r1=1760903&r2=1760904&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/property-index.md Thu Sep 15 09:43:59 2016
@@ -84,6 +84,15 @@ or to simplify you can use one of the ex
#### Reindexing
+Usually, reindexing is only needed if the configuration of an index is changed,
+such that the index should contain more or different data.
+For example, reindexing is needed if the property to be indexed is changed,
+if a nodetype is added to __`declaringNodeTypes`__, or if __`includedPaths`__ is changed.
+It is not strictly needed if less data is to be indexed, for example if a nodetype is removed.
+However, to save space, it might make sense to reindex even in that case.
+Typically, if a query does not return the expected result, reindexing does not help;
+more likely, the reason in somewhere else to be found, and disabling the index should be tried first.
+
Reindexing a property index happens synchronously by setting the __`reindex`__ flag to __`true`__. This means that the
first #save call will generate a full repository traversal with the purpose of building the index content and it might
take a long time.
@@ -106,3 +115,44 @@ Example:
.setProperty("reindex", true);
}
+#### Cost Estimation
+
+When running a query, the property index reports its estimated cost to the query engine,
+and then the query engine picks the index with the lowest cost (cost-based query optimization).
+The algorithm to calculate the estimated cost is roughly as follows (a bit simplified):
+
+* The cost is infinity (so the index is never used)
+ if the condition contains a fulltext constraint,
+ no applicable restriction,
+ the wrong nodetype, or
+ if the path filtering (`includedPaths` / `excludedPaths`) does not match the query.
+* For the nodetype index, the cost is the sum of the cost for the `jcr:primaryType` lookup
+ (if the primary type is known),
+ plus the cost for the `jcr:mixinTypes` lookup (if that is known).
+* Otherwise, the cost is based on the overhead (which is 2),
+ plus the estimated number of entries.
+* For an "x is not null" condition,
+ the estimated number of entries is
+ either the configured `entryCount` or, if not set, the
+ approximate number of entries in the index.
+ The approximation is an "order of magnitude" estimation (Morris' algorithm).
+* For a unique index and "x = 1" condition,
+ the estimated number of entries is either 0 or 1
+ (depending on whether the key is found).
+* For a non-unique index and a "x = 1" condition,
+ if the `entryCount` and `keyCount` are set, those setting are used to estimate
+ the number of entries. If not, the
+ approximate number of entries for the key is read (maintained using Morris’ algorithm).
+ In addition to that, the path condition is used to scale down
+ the estimated count depending on the approximate number of nodes
+ in that subtree versus the approximate number of entries
+ in the repository, using approximation available via the `counter` index.
+
+For example, for a query with path restriction "/content/products/t-shirts" and property restriction
+"color = 'red'", if there is an index for the property "color", then
+the entry count approximation is read from the index. Let's say it is 10'000 for this value.
+Then the approximate number of nodes in the subtree "/content/products/t-shirts" is read
+(let's say it is 20'000), and the approximate number of nodes in the repository
+(let's say it is 1 million).
+Therefore, the estimated number of entries is scaled down (divided by 50) from 10'000 to 200.
+The estimated cost is therefore 202, due to the overhead of 2.
\ No newline at end of file