You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Martyn Smith <ma...@catalyst.net.nz> on 2006/08/11 00:18:38 UTC

Searching with access controls

I'm trying to index data in a system that implements some rather nasty
access controls on the data.

Basically, there are users, and communities, and users are members of
the communities. Potentially a user could be a member of hundreds or
even thousands of communities (there's no enforced upper limit).

Now I'm trying for a solution such that a user only gets documents that
are either "public" or belong to a community that they're a member of.

I figure there are two approaches (if there are other/better ones,
please let me know).

1) For each document in the index, I store userid in a multivalued
field. I simply store every single userid that IS allowed access to the
document. This has the advantage of the query being quite simple (e.g.
useracecss:MYUSERID) but I will have to store HEAPS of data, and
potentially have to do many more updates (as users join/leave
communities).

2) For each document in the index, store the community id that it
belongs to. The obvious advantage here is less updates, and less
storage. HOWEVER, this means queries get bigger and bigger as users are
in more and more communities (e.g. communityid:(myCID1 OR myCID2 OR
myCID3 ....)

Does anyone have any thoughts on this?, are there blindingly obvious
options I'm missing that would take all this complication away?, what
performance implications do each of these methods have?

Many thanks in advance for any comments or helpful suggestions :)


--
Martyn




Re: Searching with access controls

Posted by Martyn Smith <ma...@catalyst.net.nz>.
We're not really sure how big the userbase is going to get, but it could
become huge. I think initially we need to be able to cope with several
thousand users, and probably only several thousand communities.

I'll certainly have a look at "faceted browsing" :), and yeah, a query
handler that does that sounds quite useful.

I think I need to have a read on what "filters" actually are :)

Thanks thought, It looks like I've got some more reading to do ...

--
Martyn


On Fri, 2006-08-11 at 00:07 -0400, Yonik Seeley wrote:
> On 8/10/06, Martyn Smith <ma...@catalyst.net.nz> wrote:
> > I was just reading about the limit on boolean operators in a query (it
> > seems to default to 1024 in Solr).
> >
> > Using option 2 would mean that a user can't be in any more than 1024
> > communities (assuming no other boolean logic in the query).
> >
> > Potentially a huge number of communities (10,000+ ?). Each community
> > could easily have say 100 documents each, and there's some other
> > "global" type documents too.
> >
> > Say 500,000 - 1,000,000 documents?
> 
> How many users for this system?
> 
> > What do you mean by "You could also store user documents in the
> > collection to avoid passing the security info" ?
> 
> Store a document of type "user" that contains the communities they belong to.
> Create a custom query handler that takes a base query in addition to
> the user id.
> Get the user document, get a filter for each community they belong to
> from the filter cache, union them all, and then do a filtered query.
> 
> If the number of users is low, you could cache the resulting filter
> from unioning all the communities.  If the number of users is high
> compared to the number of communities, cache the community filters
> instead.
> 
> Search the archives for faceted browsing... many of the techniques may
> be applicable.
> 
> -Yonik
> 


Re: Searching with access controls

Posted by Yonik Seeley <yo...@apache.org>.
On 8/10/06, Martyn Smith <ma...@catalyst.net.nz> wrote:
> I was just reading about the limit on boolean operators in a query (it
> seems to default to 1024 in Solr).
>
> Using option 2 would mean that a user can't be in any more than 1024
> communities (assuming no other boolean logic in the query).
>
> Potentially a huge number of communities (10,000+ ?). Each community
> could easily have say 100 documents each, and there's some other
> "global" type documents too.
>
> Say 500,000 - 1,000,000 documents?

How many users for this system?

> What do you mean by "You could also store user documents in the
> collection to avoid passing the security info" ?

Store a document of type "user" that contains the communities they belong to.
Create a custom query handler that takes a base query in addition to
the user id.
Get the user document, get a filter for each community they belong to
from the filter cache, union them all, and then do a filtered query.

If the number of users is low, you could cache the resulting filter
from unioning all the communities.  If the number of users is high
compared to the number of communities, cache the community filters
instead.

Search the archives for faceted browsing... many of the techniques may
be applicable.

-Yonik

Re: Searching with access controls

Posted by Chris Hostetter <ho...@fucit.org>.
: I was just reading about the limit on boolean operators in a query (it
: seems to default to 1024 in Solr).
:
: Using option 2 would mean that a user can't be in any more than 1024
: communities (assuming no other boolean logic in the query).

that limit applies to boolean query clauses which are used in scoring, and
can be changed in the solrconfig (see <maxBooleanClauses>, it's really
justa lucene settting that helps to save you from yourself) ... but for
things like access control you don't care about scoring -- just set
membership, so you can use use and combine Filters which can be cached
independently.

Reading up on Lucene Filters is definitely the next best step to get a
sense of how you can achieve your goal -- just don't get confused beween
Filters used in searching and "TokenFilters" used when analyzing text --
they have regretably similar names.

searching the general Lucene user groups for "access control",
"permissions" and "security" should turn up quite a few suggestions on how
to approach this problem with Lucene indexes in general, all of which can
be done in Solr as well -- many of which can be done efficiently
much easier in Solr becuase Solr takes care of the Query->Filter
conversions for you on the fly when you don't care about scoring, and
because Solr manages (and can autowarm when changes occur) your caches for
you.



-Hoss


Re: Searching with access controls

Posted by Martyn Smith <ma...@catalyst.net.nz>.
I was just reading about the limit on boolean operators in a query (it
seems to default to 1024 in Solr).

Using option 2 would mean that a user can't be in any more than 1024
communities (assuming no other boolean logic in the query).

Potentially a huge number of communities (10,000+ ?). Each community
could easily have say 100 documents each, and there's some other
"global" type documents too.

Say 500,000 - 1,000,000 documents?

What do you mean by "You could also store user documents in the
collection to avoid passing the security info" ?

I'm not really a Java programmer of any significance, but I work with
people who are, and I can bully them into helping out. (I'm a Perl guy
myself).

Thanks,



--
Martyn


On Thu, 2006-08-10 at 23:43 -0400, Yonik Seeley wrote:
> On 8/10/06, Martyn Smith <ma...@catalyst.net.nz> wrote:
> > I'm trying to index data in a system that implements some rather nasty
> > access controls on the data.
> >
> > Basically, there are users, and communities, and users are members of
> > the communities. Potentially a user could be a member of hundreds or
> > even thousands of communities (there's no enforced upper limit).
> 
> I think option 2 (storing the community id with the document) is the way to go.
> If it's not fast enough, custom query handlers and using filters may help.
> You could also store user documents in the collection to avoid passing
> the security info (this would definitely require a custom query
> handler).
> 
> What are the number of documents, and number of communities?
> 
> -Yonik
> 


Re: Searching with access controls

Posted by Yonik Seeley <yo...@apache.org>.
On 8/10/06, Martyn Smith <ma...@catalyst.net.nz> wrote:
> I'm trying to index data in a system that implements some rather nasty
> access controls on the data.
>
> Basically, there are users, and communities, and users are members of
> the communities. Potentially a user could be a member of hundreds or
> even thousands of communities (there's no enforced upper limit).

I think option 2 (storing the community id with the document) is the way to go.
If it's not fast enough, custom query handlers and using filters may help.
You could also store user documents in the collection to avoid passing
the security info (this would definitely require a custom query
handler).

What are the number of documents, and number of communities?

-Yonik