You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Vikas Saurabh (JIRA)" <ji...@apache.org> on 2018/12/03 23:17:00 UTC

[jira] [Comment Edited] (OAK-7929) Incorrect Facet Count With Large Dataset and ACLs

    [ https://issues.apache.org/jira/browse/OAK-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706465#comment-16706465 ] 

Vikas Saurabh edited comment on OAK-7929 at 12/3/18 11:16 PM:
--------------------------------------------------------------

Since checking for ACL is fairly expensive and to get accurate count we'd have to do ACL check over the whole result set. So, we'd expand current boolean form of {{secure}} to an enum - {{insecure}}, {{statistical}} and {{secure}}.

The {{insecure}} mode won't do any ACL check and would return facets as were returned from index. The {{secure}} mode would generalize the current form for checking 50 documents to the whole result set instead.

The {{statistical}} mode would randomly sample some documents from the result set. It'd see ratio of accessible samples and extrapolate the facet counts with returned ratio. A few implementation details below:
* default mode would be kept -{{statistical}} with default {{sampleSize}} as 1000- {{secure}} (for backward compatibility).
* when {{statistical}} mode is selected, the default value of {{sampleSize}} is 1000
* both the defaults (mode and sampleSize) can be over-ridden system wide using JVM param {{oak.facets.secure}} and {{oak.facet.statistical.sampleSize}}
* one can also set {{secure}} (String) and {{sampleSize}} (long casted to int) under {{<definition>/facets}} to override these per index definition
* the sampling is done using idea presented in https://dl.acm.org/citation.cfm?id=368159
* the reason to pick 1000 as default sample size as expected error rate in sampled data is given by {{sampleSize ^ -0.5}} \[0]. For 1000, this roughly comes out as 3% expected error rate.
* for random number seed, we'd insert a random long number {{seed}} under index definition during an indexing cycle. This is kept to keep consistent result across refreshes without any other change in indexed data. From random-ness pov this should still be ok as actual generated random numbers depend on result size; which in turn would depend on search query and indexed data. From security pov, the seed should be ok as index defs are administrative data.

\[0]: https://onlinecourses.science.psu.edu/stat100/node/16/


was (Author: catholicon):
Since checking for ACL is fairly expensive and to get accurate count we'd have to do ACL check over the whole result set. So, we'd expand current boolean form of {{secure}} to an enum - {{insecure}}, {{statistical}} and {{secure}}.

The {{insecure}} mode won't do any ACL check and would return facets as were returned from index. The {{secure}} mode would generalize the current form for checking 50 documents to the whole result set instead.

The {{statistical}} mode would randomly sample some documents from the result set. It'd see ratio of accessible samples and extrapolate the facet counts with returned ratio. A few implementation details below:
* default mode would be kept {{statistical}} with default {{sampleSize}} as 1000.
* both the defaults can be over-ridden system wide using JVM param {{oak.facets.secure}} and {{oak.facet.statistical.sampleSize}}
* one can also set {{secure}} (String) and {{sampleSize}} (long casted to int) under {{<definition>/facets}} to override these per index definition
* the sampling is done using idea presented in https://dl.acm.org/citation.cfm?id=368159
* the reason to pick 1000 as default sample size as expected error rate in sampled data is given by {{sampleSize ^ -0.5}} \[0]. For 1000, this roughly comes out as 3% expected error rate.
* for random number seed, we'd insert a random long number {{seed}} under index definition during an indexing cycle. This is kept to keep consistent result across refreshes without any other change in indexed data. From random-ness pov this should still be ok as actual generated random numbers depend on result size; which in turn would depend on search query and indexed data. From security pov, the seed should be ok as index defs are administrative data.

\[0]: https://onlinecourses.science.psu.edu/stat100/node/16/

> Incorrect Facet Count With Large Dataset and ACLs
> -------------------------------------------------
>
>                 Key: OAK-7929
>                 URL: https://issues.apache.org/jira/browse/OAK-7929
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>            Priority: Major
>             Fix For: 1.10
>
>         Attachments: 0001-OAK-7930-Add-tape-sampling.patch, 0002-OAK-7929-Incorrect-Facet-Count-With-Large-Dataset-an.patch
>
>
> Currently ACL (secure) facet handling only deals with first batch of results from lucene index (50 documents). So, for large result sets, the facet count hence doesn't get decremented for large part of the result set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)