You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jackrabbit.apache.org by ca...@apache.org on 2018/12/18 01:07:21 UTC

svn commit: r1849136 - in /jackrabbit/site/live/oak/docs: img/facets-statistical-error-rate-plot.png query/lucene.html

Author: catholicon
Date: Tue Dec 18 01:07:20 2018
New Revision: 1849136

URL: http://svn.apache.org/viewvc?rev=1849136&view=rev
Log:
OAK-301: Oak docu

Added:
    jackrabbit/site/live/oak/docs/img/facets-statistical-error-rate-plot.png   (with props)
Modified:
    jackrabbit/site/live/oak/docs/query/lucene.html

Added: jackrabbit/site/live/oak/docs/img/facets-statistical-error-rate-plot.png
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/img/facets-statistical-error-rate-plot.png?rev=1849136&view=auto
==============================================================================
Binary file - no diff available.

Propchange: jackrabbit/site/live/oak/docs/img/facets-statistical-error-rate-plot.png
------------------------------------------------------------------------------
    svn:mime-type = image/png

Modified: jackrabbit/site/live/oak/docs/query/lucene.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/lucene.html?rev=1849136&r1=1849135&r2=1849136&view=diff
==============================================================================
--- jackrabbit/site/live/oak/docs/query/lucene.html (original)
+++ jackrabbit/site/live/oak/docs/query/lucene.html Tue Dec 18 01:07:20 2018
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia Site Renderer 1.8.1 at 2018-12-09 
+ | Generated by Apache Maven Doxia Site Renderer 1.8.1 at 2018-12-18 
  | Rendered using Apache Maven Fluido Skin 1.6
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20181209" />
+    <meta name="Date-Revision-yyyymmdd" content="20181218" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Jackrabbit Oak &#x2013; Lucene Index</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
@@ -142,7 +142,7 @@
 
       <div id="breadcrumbs">
         <ul class="breadcrumb">
-        <li id="publishDate">Last Published: 2018-12-09<span class="divider">|</span>
+        <li id="publishDate">Last Published: 2018-12-18<span class="divider">|</span>
 </li>
           <li id="projectVersion">Version: 1.10-SNAPSHOT</li>
         </ul>
@@ -1413,18 +1413,73 @@ Copied 8.5 MB in 218.7 ms
           - propertyIndex = true
 </pre></div></div>
 
-<p>Specific facet related features for Lucene property index can be configured in a separate <i>facets</i> node below the index definition. By default ACL checks are always performed on facets by the Lucene property index however this can be avoided by setting the property <i>secure</i> to <i>false</i> in the <i>facets</i> configuration node. <tt>@since Oak 1.5.15</tt> The no. of facets to be retrieved is configurable via the <i>topChildren</i> property, which defaults to 10.</p>
+<p>Specific facet related features for Lucene property index can be configured in a separate <i>facets</i> node below the index definition. <tt>@since Oak 1.5.15</tt> The no. of facets to be retrieved is configurable via the <i>topChildren</i> property, which defaults to 10.</p>
 
 <div>
 <div>
-<pre class="source">/oak:index/lucene-with-unsecure-facets
+<pre class="source">/oak:index/lucene-with-more-facets
   - jcr:primaryType = &quot;oak:QueryIndexDefinition&quot;
   - compatVersion = 2
   - type = &quot;lucene&quot;
   - async = &quot;async&quot;
   + facets
     - topChildren = 100
-    - secure = false
+  + indexRules
+    - jcr:primaryType = &quot;nt:unstructured&quot;
+    + nt:base
+      + properties
+        - jcr:primaryType = &quot;nt:unstructured&quot;
+        + tags
+          - facets = true
+          - propertyIndex = true
+</pre></div></div>
+
+<p>By default ACL checks are always performed on facets by the Lucene property index however there are a few configuration option to configure how ACL checks are done by configuring <i>secure</i> property in the <i>facets</i> configuration node. <tt>@since Oak 1.6.16, 1.8.10, 1.9.13</tt> <tt>secure</tt> property is a string with allowed values of <tt>secure</tt>, <tt>statistical</tt> and <tt>insecure</tt> - <tt>secure</tt> being the default value. Before that <tt>secure</tt> was a boolean property and to maintain compatibility <tt>false</tt> maps to <tt>insecure</tt> while <tt>true</tt> (default at the time) maps to <tt>secure</tt>.</p>
+<p>For <tt>insecure</tt> facets, the facet counts reported by lucene index are reported back as is. For <tt>secure</tt> configuration all results of a query are checked for access permissions and facets returned by index are updated accordingly. This can be very bad from performance point of view for large result set. As a trade off <tt>statistical</tt> configuration can be used to randomly sample some items (default <tt>1000</tt> configurable via <tt>sampleSize</tt>) and check ACL for the random samples. Facet counts returned via index are updated proportionally to the percentage of accessible samples that were checked for ACL. Do note that the <a class="externalLink" href="https://onlinecourses.science.psu.edu/stat100/node/16/">beauty of sampling</a> is that a sample size of <tt>1000</tt> would have 3% error rate with 95% confidence. But that&#x2019;s a theoretical limit for infinite number of experiments - in practice though, a low rate of accessible documents decreases chances t
 o reach that average rate. To have a sense of expectation of error rate, here&#x2019;s how errors looked like in different scenarios of test runs with sample size of 1000 with error averaged over 1000 random runs for each scenario.</p>
+
+<div>
+<div>
+<pre class="source">|-----------------|-----------------------|------------------------|
+| Result set size | %age accessible nodes | Avg error in 1000 runs |
+|-----------------|-----------------------|------------------------|
+| 2000            |  5                    |  5.79                  |
+| 5000            |  5                    |  9.99                  |
+| 10000           |  5                    |  10.938                |
+| 100000          |  5                    |  11.13                 |
+|                 |                       |                        |
+| 2000            | 25                    | 2.4192004              |
+| 5000            | 25                    | 3.8087976              |
+| 10000           | 25                    | 4.096                  |
+| 100000          | 25                    | 4.3699985              |
+|                 |                       |                        |
+| 2000            | 50                    | 1.3990011              |
+| 5000            | 50                    | 2.2695997              |
+| 10000           | 50                    | 2.5303981              |
+| 100000          | 50                    | 2.594599               |
+|                 |                       |                        |
+| 2000            | 75                    | 0.80360085             |
+| 5000            | 75                    | 1.1929348              |
+| 10000           | 75                    | 1.4357346              |
+| 100000          | 75                    | 1.4272015              |
+|                 |                       |                        |
+| 2000            | 95                    | 0.30958                |
+| 5000            | 95                    | 0.52715933             |
+| 10000           | 95                    | 0.5109484              |
+| 100000          | 95                    | 0.5481065              |
+|-----------------|-----------------------|------------------------|
+</pre></div></div>
+
+<p><img src="../img/facets-statistical-error-rate-plot.png" alt="error rate plot" /></p>
+<p>Notice that error rate does increase with large result set sizes but it flattens after around 10000 results. Also, note that even with 50% results being accessible, error rate averages at less that 3%.</p>
+<p>So, in most cases, sampling size of 1000 should give fairly decent estimation of facet counts. On the off chance that the setup is such that error rates are intolerable, sample size can be configured with <i>sampleSize</i> property under <i>facets</i> configuration node. Error rates are generally inversely proportional to <tt>&#x221a;sample-size</tt>. So, to reduce error rate by 1/2 sample size needs to increased 4 times.</p>
+<p>Canonical example of <tt>statistical</tt> configuration would look like:</p>
+
+<div>
+<div>
+<pre class="source">/oak:index/lucene-with-statistical-facets
+  + facets
+    - secure = &quot;statistical&quot;
+    - sampleSize = 1500
   + indexRules
     - jcr:primaryType = &quot;nt:unstructured&quot;
     + nt:base