You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by k-jingyang <k....@protonmail.com.INVALID> on 2021/03/23 07:10:50 UTC

Content search and applying ACL

Hello everyone,

I have a use case for my users which I'm having issues implementing. Hoping
to find some insights here.

We are trying to let our users search for almost any content data that they
have, while respecting access control policies. My users are grouped into
teams, and policies are applied on the content based on teams.

How we are doing it now is by storing any piece of data as a document, 

{ 
type: contact_number,
value: 1234567890,
aclId: 1_contact_number
}

in our content index (6 million documents)

and 

{
aclId: 1_contact_number,
canRead: TEAM_A
}

in our acl index (2 million documents).

DocValues is enabled for aclId on both indexes.

During query, we'll query the content index and use the Join Query Parser in
the fq as such, fq={!join from=aclId fromIndex=acl to=acl_id}canRead: TEAM_A
OR TEAM_B, where the user is part of TEAM_A and TEAM_B. This takes close to
8 seconds for uncached queries. 

Based on my understanding, this is slow because Solr has to 
1. Retrieve all hits from the acl index
2. Comb through the entire content index, finding documents whose aclId
matches those hits from the acl index
3. Apply any remaining content query to filter the results from the content
index

We have also tried using {!join ... score=none} (based on what we Googled)

Thoughts on improving this

- Thought of using streaming expressions but using /export on the content
index requires sorting by fields other than the score
- Querying the content index based on just the content, get the results
filter based on acl on our backend until we have the first 10 results.
  - This requires us to load the entire acl
  - Repeatedly query content index if documents keep getting dropped because
of acl
  - Benefit is that we don't have to comb the entire content index
  - This could be a Plugin? (not sure if it's worth the effort)


Am I barking up the wrong tree?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Content search and applying ACL

Posted by k-jingyang <k....@protonmail.com.INVALID>.

Thanks for the reply, Alessandro! 

I'll try out your tips and see if anything works out.

Hope to post contribute back to here soon!  



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Content search and applying ACL

Posted by k-jingyang <k....@protonmail.com.INVALID>.

After upgrading our Solr to 8.8 from 7.2, I've tested out the *score=none*,
and *method=toplevelDV* with two collections. The first few queries took
even longer, the first one took up to 20s as compared to 8s from
*method=index*. 

I guess this is inline with what the docs mentioned about topLevelDV, "These
data structures outperform other methods as the number of values matched in
the from field grows high. But they are also expensive to build and need to
be lazily populated after each commit, causing a sometimes-noticeable
slowdown on the first query to use them after each commit." 

What I'm now curious is what these "data structures that are expensive to
build" are?

I dived into /TopLevelJoinQuery.java/,  and saw that the implementation
retrieves the DocValues from the respective indexes and use them for the
join operation. Am I right to assume that the "data structures that are
expensive to build" is referring to DocValues?  

Would that also mean that other operations (e.g. faceting) that use
DocValues will take some time after a commit?

Anyway, to solve the problem detailed in the first post, we denormalised our
indexes (i.e. embed the list of teams that can read the information into our
content documents), and lookups are now almost instantaneous. One
consequence of this is that whenever the ACL changes (i.e. which teams can
read this content), we'll have to re-index all relevant content data. 

We have judged this to be acceptable, because
1) We can run this asynchronously
2) ACL changes need not be instantaneous
3) Due to the nature of our app, the amount time required to re-index the
relevant content data is still within acceptable range and will be difficult
to grow out of unacceptable range.

Hope this helps any individual who comes across this! :)    




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Content search and applying ACL

Posted by Alessandro Benedetti <a....@sease.io>.

Hi,
it's definitely an interesting question, it happened I had to work
personally on acl designs in the past.

It has been a while I don't look at the Lucene/Solr internals of that bit,
but first of all, I suspect you are going to get a performance boost if you
store documents and acls in the same collection(index).
https://solr.apache.org/guide/8_8/other-parsers.html#parameters
I would go with *score=none* and *method=topLevelDV*.
It could be definitely interesting to compare it with *method=index*.
Then I would take a look at the caches involved and tune them appropriately.

Further improvements can be obtained but it would be necessary to
investigate the internals a bit more.
Another alternative could be to denormalize and directly put the reader
information in the original document (rather than the Acl ID).
This of course brings other observations and consequences.

Cheers

--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant
www.sease.io


On Tue, 23 Mar 2021 at 07:21, k-jingyang <k....@protonmail.com.invalid>
wrote:

> Hello everyone,
>
> I have a use case for my users which I'm having issues implementing. Hoping
> to find some insights here.
>
> We are trying to let our users search for almost any content data that they
> have, while respecting access control policies. My users are grouped into
> teams, and policies are applied on the content based on teams.
>
> How we are doing it now is by storing any piece of data as a document,
>
> {
> type: contact_number,
> value: 1234567890,
> aclId: 1_contact_number
> }
>
> in our content index (6 million documents)
>
> and
>
> {
> aclId: 1_contact_number,
> canRead: TEAM_A
> }
>
> in our acl index (2 million documents).
>
> DocValues is enabled for aclId on both indexes.
>
> During query, we'll query the content index and use the Join Query Parser
> in
> the fq as such, fq={!join from=aclId fromIndex=acl to=acl_id}canRead:
> TEAM_A
> OR TEAM_B, where the user is part of TEAM_A and TEAM_B. This takes close to
> 8 seconds for uncached queries.
>
> Based on my understanding, this is slow because Solr has to
> 1. Retrieve all hits from the acl index
> 2. Comb through the entire content index, finding documents whose aclId
> matches those hits from the acl index
> 3. Apply any remaining content query to filter the results from the content
> index
>
> We have also tried using {!join ... score=none} (based on what we Googled)
>
> Thoughts on improving this
>
> - Thought of using streaming expressions but using /export on the content
> index requires sorting by fields other than the score
> - Querying the content index based on just the content, get the results
> filter based on acl on our backend until we have the first 10 results.
>   - This requires us to load the entire acl
>   - Repeatedly query content index if documents keep getting dropped
> because
> of acl
>   - Benefit is that we don't have to comb the entire content index
>   - This could be a Plugin? (not sure if it's worth the effort)
>
>
> Am I barking up the wrong tree?
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>