You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Brian Victor <ho...@brianhv.org> on 2009/04/01 16:53:57 UTC

Filtering queries

We have a system in which access to documents is controlled by a
non-trivial authorization system most closely related to ACLs.  A user
is able to either see all the document's fields, a well-defined subset
of those fields, or none of the fields.

I need help figuring out how to get lucene to not search on fields of a
document that a user can't see.

I have found the Filter class.  In order to use this, it seems I need to
know the lucene document IDs of the documents that should be visible,
and from what I understand document IDs are not fixed so I can't store a
link between them and my database rows.

I have considered storing my database IDs in a lucene field on each
document.  What I can't figure out is how to guarantee that all search
results are in the set of database IDs that a user can see.  I can
retrieve that list of IDs; is there a way to have lucene filter on that
list?

Thanks!

-- 
Brian

Re: Filtering queries

Posted by Brian Victor <ho...@brianhv.org>.
On Wed, Apr 01, 2009 at 06:36:00PM +0300, Digy wrote:
>Assuming that you can store the access rights related to a doc in a field
>Like;
>Doc1:
> Text: --> "text1"
> AccessRights: --> "user1 user2"
>
>Doc2:
> Text: --> "text2"
> AccessRights: --> "user2 user3"
>
>You can inject a "+AccessRights:User1" into the query user(user1 in this
>case) supplied.

Thanks for the idea, Digy.  Unfortunately, that probably won't work for
us.  Access rights to documents change fairly frequently (a few times a
day, depending on system usage).  I would expect that would lead to a
lot of reindexing and slowness.

Of course, I admit this is guesswork.  If I'm wrong, please let me know.

-- 
Brian

RE: Filtering queries

Posted by "Granroth, Neal V." <ne...@thermofisher.com>.
An alternative is to use a filtered query.


Filter fx = new CachingWrapperFilter(
        new AccessFilter( document-id-list ) );

FilteredQuery fq = new FilteredQuery( userQuery, fx);

Hits h = Search( fq );


where the AccessFilter is a class you write which derives from Lucene.Net.Search.Filter. Its constructor would use your list of document IDs and provide a BitArray for use by Lucene's FilteredQuery of those documents that are allowed to be found.


-- Neal


-----Original Message-----
From: Digy [mailto:digydigy@gmail.com]
Sent: Wednesday, April 01, 2009 1:42 PM
To: lucene-net-user@incubator.apache.org
Subject: RE: Filtering queries

> Let's make the problem easier for a moment.  If I add a "DatabaseID"
> field to every document, and I have a list of database IDs, can I tell
lucene to only return documents in that list?  The list could be in the
magnitude of thousands.

Think DatabaseID as my AccessRights. You can search like
"other-part-of-query +(DatabaseID:id1 DatabaseID:id2)".
But If you have many( how many?) terms in your query, your search will be
slower.

Many values in DatabaseID field is not a problem for lucene. Only, you will
get a big index.

DIGY.




-----Original Message-----
From: Brian Victor [mailto:homeusenet4@brianhv.org]
Sent: Wednesday, April 01, 2009 9:08 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: Filtering queries

On Wed, Apr 01, 2009 at 09:00:19PM +0300, Digy wrote:
>2- Search results should be filtered (in a loop while reading the docs from
>index?) before returning to user, utilizing the field "AccessRights".

I don't think this is so.  Search results should be filtered, yes, but I
don't think "AccessRights" is the way to do it.  I already know what
documents a user can see; I'm just trying to figure out how to make
lucene filter to those documents.

Let's make the problem easier for a moment.  If I add a "DatabaseID"
field to every document, and I have a list of database IDs, can I tell
lucene to only return documents in that list?  The list could be in the
magnitude of thousands.

--
Brian


RE: Filtering queries

Posted by Digy <di...@gmail.com>.
> Let's make the problem easier for a moment.  If I add a "DatabaseID"
> field to every document, and I have a list of database IDs, can I tell
lucene to only return documents in that list?  The list could be in the
magnitude of thousands.

Think DatabaseID as my AccessRights. You can search like
"other-part-of-query +(DatabaseID:id1 DatabaseID:id2)".
But If you have many( how many?) terms in your query, your search will be
slower.

Many values in DatabaseID field is not a problem for lucene. Only, you will
get a big index.

DIGY.




-----Original Message-----
From: Brian Victor [mailto:homeusenet4@brianhv.org] 
Sent: Wednesday, April 01, 2009 9:08 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: Filtering queries

On Wed, Apr 01, 2009 at 09:00:19PM +0300, Digy wrote:
>2- Search results should be filtered (in a loop while reading the docs from
>index?) before returning to user, utilizing the field "AccessRights".

I don't think this is so.  Search results should be filtered, yes, but I
don't think "AccessRights" is the way to do it.  I already know what
documents a user can see; I'm just trying to figure out how to make
lucene filter to those documents.

Let's make the problem easier for a moment.  If I add a "DatabaseID"
field to every document, and I have a list of database IDs, can I tell
lucene to only return documents in that list?  The list could be in the
magnitude of thousands.

-- 
Brian


Re: Filtering queries

Posted by Brian Victor <ho...@brianhv.org>.
On Wed, Apr 01, 2009 at 09:29:13PM +0300, Digy wrote:
>    public class MyHitCollector : Lucene.Net.Search.HitCollector
[snip]

This looks like it can be made to do what we're looking for.  Thanks for
your help with my admittedly confusing requirements!

-- 
Brian

RE: Filtering queries

Posted by Digy <di...@gmail.com>.
For the first Part of your Question:

> I don't think this is so.  Search results should be filtered, yes, but I
don't think "AccessRights" is the way to do it.  I already know what
documents a user can see; I'm just trying to figure out how to make lucene
filter to those documents.


You make your own filtering (and return some custom "array-of-result") to
the user in a HitCollector like below

    public class MyResult
    {
        public string Title = "";
        public string Text = "";
    }

    public class MyHitCollector : Lucene.Net.Search.HitCollector
    {
        Lucene.Net.Index.IndexReader Reader = null;
        public MyHitCollector(Lucene.Net.Index.IndexReader r)
        {
            Reader = r;
        }
        public List<MyResult> Result = new List<MyResult>();
        public override void Collect(int doc, float score)
        {
            MyResult m = new MyResult();
            Lucene.Net.Documents.Document doc =  Reader.Document(doc);
            if(some logic)
            {
               m.Text="?";
               m.Title= "?";
               Result.Add(m);
            }
        }
    }


DIGY.


-----Original Message-----
From: Brian Victor [mailto:homeusenet4@brianhv.org] 
Sent: Wednesday, April 01, 2009 9:08 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: Filtering queries

On Wed, Apr 01, 2009 at 09:00:19PM +0300, Digy wrote:
>2- Search results should be filtered (in a loop while reading the docs from
>index?) before returning to user, utilizing the field "AccessRights".

I don't think this is so.  Search results should be filtered, yes, but I
don't think "AccessRights" is the way to do it.  I already know what
documents a user can see; I'm just trying to figure out how to make
lucene filter to those documents.

Let's make the problem easier for a moment.  If I add a "DatabaseID"
field to every document, and I have a list of database IDs, can I tell
lucene to only return documents in that list?  The list could be in the
magnitude of thousands.

-- 
Brian


Re: Filtering queries

Posted by Brian Victor <ho...@brianhv.org>.
On Wed, Apr 01, 2009 at 09:00:19PM +0300, Digy wrote:
>2- Search results should be filtered (in a loop while reading the docs from
>index?) before returning to user, utilizing the field "AccessRights".

I don't think this is so.  Search results should be filtered, yes, but I
don't think "AccessRights" is the way to do it.  I already know what
documents a user can see; I'm just trying to figure out how to make
lucene filter to those documents.

Let's make the problem easier for a moment.  If I add a "DatabaseID"
field to every document, and I have a list of database IDs, can I tell
lucene to only return documents in that list?  The list could be in the
magnitude of thousands.

-- 
Brian

Re: Filtering queries

Posted by Michael Neel <mi...@gmail.com>.
I'm willing to appreciate we all get forced into bad design some
times, so if I was in your shoes I'd create a single field and give it
multiple values based on who can read it.

First I'd create a set up groups and assign users to these groups,
based on access rights.  This can be managed in the database and won't
require updates to the lucene index as users change groups.  Hopefully
you have a manageable number of groups.

Next I'd add a field to every document called "Access", and add an
instance of the field for each group that can read a document:

Doc1:
   Access: all
   Access: marketing
   Access: admin

Doc2:
   Access: marketing
   Access: admin

Then on searches, add in the field to the query.  So if you added
Access: all to the query, you wouldn't get a hit on doc 2.

The down side is adding a group will mean tagging all existing docs
with the group that need access.  The good news is, from a security
side, no one will see anything until that's done.

Mike

On Wed, Apr 1, 2009 at 2:00 PM, Digy <di...@gmail.com> wrote:
> As far as I can see,
>
> 1- You should be able to store the ACL in a field(like AccessRights)
> somehow, so that some docs (for ex.,doc2 in a search ""Text:text*)can be
> stripped out from the search result like( "+AccessRights:user1")
> BUT, you don't want it for performance etc. reasons.
>
> 2- Search results should be filtered (in a loop while reading the docs from
> index?) before returning to user, utilizing the field "AccessRights".
>
> 3- You are in trouble :-)
>
> DIGY
>
>
>
>
>
> -----Original Message-----
> From: Brian Victor [mailto:homeusenet4@brianhv.org]
> Sent: Wednesday, April 01, 2009 8:37 PM
> To: lucene-net-user@incubator.apache.org
> Subject: Re: Filtering queries
>
> On Wed, Apr 01, 2009 at 08:27:08PM +0300, Digy wrote:
>>Doc1:
>>       Title: "title1" //everyone can see.
>>       Text:  "text1"  //only user1 can see
>>
>>Doc2:
>>       Title: "title2" //everyone can see.
>>       Text:  "text2"  //only user2 can see
>>
>>
>>If I make a search(as user1) like "Title:title*" then I should get 2 hits
>>but I should not read the "Text" field of Doc2.
>>
>>Am I correct?
>
> Yes.  Moreover, if you're user1 and you search for "text", you should
> only get one hit.  And to add yet another winkle, user3 may not be
> allowed to see Doc1 at all.
>
> So for any given user/document combination, one of the following
> applies:
>
>  1) User can read every field in document
>  2) User can read "text" field in document
>  3) User cannot read any part of document
>
> --
> Brian
>
>

RE: Filtering queries

Posted by Digy <di...@gmail.com>.
As far as I can see,

1- You should be able to store the ACL in a field(like AccessRights)
somehow, so that some docs (for ex.,doc2 in a search ""Text:text*)can be
stripped out from the search result like( "+AccessRights:user1")
BUT, you don't want it for performance etc. reasons.

2- Search results should be filtered (in a loop while reading the docs from
index?) before returning to user, utilizing the field "AccessRights".

3- You are in trouble :-)

DIGY





-----Original Message-----
From: Brian Victor [mailto:homeusenet4@brianhv.org] 
Sent: Wednesday, April 01, 2009 8:37 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: Filtering queries

On Wed, Apr 01, 2009 at 08:27:08PM +0300, Digy wrote:
>Doc1:
>	Title: "title1" //everyone can see.
>	Text:  "text1"  //only user1 can see
>
>Doc2:
>	Title: "title2" //everyone can see.
>	Text:  "text2"  //only user2 can see
>
>
>If I make a search(as user1) like "Title:title*" then I should get 2 hits
>but I should not read the "Text" field of Doc2.
>
>Am I correct?

Yes.  Moreover, if you're user1 and you search for "text", you should
only get one hit.  And to add yet another winkle, user3 may not be
allowed to see Doc1 at all.

So for any given user/document combination, one of the following
applies:

  1) User can read every field in document
  2) User can read "text" field in document
  3) User cannot read any part of document

-- 
Brian


Re: Filtering queries

Posted by Brian Victor <ho...@brianhv.org>.
On Wed, Apr 01, 2009 at 08:27:08PM +0300, Digy wrote:
>Doc1:
>	Title: "title1" //everyone can see.
>	Text:  "text1"  //only user1 can see
>
>Doc2:
>	Title: "title2" //everyone can see.
>	Text:  "text2"  //only user2 can see
>
>
>If I make a search(as user1) like "Title:title*" then I should get 2 hits
>but I should not read the "Text" field of Doc2.
>
>Am I correct?

Yes.  Moreover, if you're user1 and you search for "text", you should
only get one hit.  And to add yet another winkle, user3 may not be
allowed to see Doc1 at all.

So for any given user/document combination, one of the following
applies:

  1) User can read every field in document
  2) User can read "text" field in document
  3) User cannot read any part of document

-- 
Brian

RE: Filtering queries

Posted by Digy <di...@gmail.com>.
Ok, I think I overlooked the problem.

If I understand it correctly, then Brian has a case like below

Doc1:
	Title: "title1" //everyone can see.
	Text:  "text1"  //only user1 can see

Doc2:
	Title: "title2" //everyone can see.
	Text:  "text2"  //only user2 can see


If I make a search(as user1) like "Title:title*" then I should get 2 hits
but I should not read the "Text" field of Doc2.

Am I correct?

DIGY.


-----Original Message-----
From: Granroth, Neal V. [mailto:neal.granroth@thermofisher.com] 
Sent: Wednesday, April 01, 2009 7:56 PM
To: lucene-net-user@incubator.apache.org
Subject: RE: Filtering queries

Storing user-id within the index is not a good plan.
It would force you to update the index every time user access rights change,
or as users are added and removed from the system.

Also, if I have read Brian's question correctly, it is not document access
that needs to be controlled but access to specific fields.

If there is a specific set of restricted fields then rather than filtering
the results after the search a better approach would be to programtically
alter the search criteria to search only those document fields that are not
restricted.

-- Neal




-----Original Message-----
From: Digy [mailto:digydigy@gmail.com]
Sent: Wednesday, April 01, 2009 10:36 AM
To: lucene-net-user@incubator.apache.org
Subject: RE: Filtering queries

Assuming that you can store the access rights related to a doc in a field
Like;
Doc1:
 Text: --> "text1"
 AccessRights: --> "user1 user2"

Doc2:
 Text: --> "text2"
 AccessRights: --> "user2 user3"

You can inject a "+AccessRights:User1" into the query user(user1 in this
case) supplied.

DIGY.



-----Original Message-----
From: Brian Victor [mailto:homeusenet4@brianhv.org]
Sent: Wednesday, April 01, 2009 5:54 PM
To: lucene-net-user@incubator.apache.org
Subject: Filtering queries

We have a system in which access to documents is controlled by a
non-trivial authorization system most closely related to ACLs.  A user
is able to either see all the document's fields, a well-defined subset
of those fields, or none of the fields.

I need help figuring out how to get lucene to not search on fields of a
document that a user can't see.

I have found the Filter class.  In order to use this, it seems I need to
know the lucene document IDs of the documents that should be visible,
and from what I understand document IDs are not fixed so I can't store a
link between them and my database rows.

I have considered storing my database IDs in a lucene field on each
document.  What I can't figure out is how to guarantee that all search
results are in the set of database IDs that a user can see.  I can
retrieve that list of IDs; is there a way to have lucene filter on that
list?

Thanks!

--
Brian


RE: Filtering queries

Posted by "Granroth, Neal V." <ne...@thermofisher.com>.
Storing user-id within the index is not a good plan.
It would force you to update the index every time user access rights change, or as users are added and removed from the system.

Also, if I have read Brian's question correctly, it is not document access that needs to be controlled but access to specific fields.

If there is a specific set of restricted fields then rather than filtering the results after the search a better approach would be to programtically alter the search criteria to search only those document fields that are not restricted.

-- Neal




-----Original Message-----
From: Digy [mailto:digydigy@gmail.com]
Sent: Wednesday, April 01, 2009 10:36 AM
To: lucene-net-user@incubator.apache.org
Subject: RE: Filtering queries

Assuming that you can store the access rights related to a doc in a field
Like;
Doc1:
 Text: --> "text1"
 AccessRights: --> "user1 user2"

Doc2:
 Text: --> "text2"
 AccessRights: --> "user2 user3"

You can inject a "+AccessRights:User1" into the query user(user1 in this
case) supplied.

DIGY.



-----Original Message-----
From: Brian Victor [mailto:homeusenet4@brianhv.org]
Sent: Wednesday, April 01, 2009 5:54 PM
To: lucene-net-user@incubator.apache.org
Subject: Filtering queries

We have a system in which access to documents is controlled by a
non-trivial authorization system most closely related to ACLs.  A user
is able to either see all the document's fields, a well-defined subset
of those fields, or none of the fields.

I need help figuring out how to get lucene to not search on fields of a
document that a user can't see.

I have found the Filter class.  In order to use this, it seems I need to
know the lucene document IDs of the documents that should be visible,
and from what I understand document IDs are not fixed so I can't store a
link between them and my database rows.

I have considered storing my database IDs in a lucene field on each
document.  What I can't figure out is how to guarantee that all search
results are in the set of database IDs that a user can see.  I can
retrieve that list of IDs; is there a way to have lucene filter on that
list?

Thanks!

--
Brian


RE: Filtering queries

Posted by Digy <di...@gmail.com>.
Assuming that you can store the access rights related to a doc in a field
Like;
Doc1:
 Text: --> "text1"
 AccessRights: --> "user1 user2"

Doc2:
 Text: --> "text2"
 AccessRights: --> "user2 user3"

You can inject a "+AccessRights:User1" into the query user(user1 in this
case) supplied.

DIGY.



-----Original Message-----
From: Brian Victor [mailto:homeusenet4@brianhv.org] 
Sent: Wednesday, April 01, 2009 5:54 PM
To: lucene-net-user@incubator.apache.org
Subject: Filtering queries

We have a system in which access to documents is controlled by a
non-trivial authorization system most closely related to ACLs.  A user
is able to either see all the document's fields, a well-defined subset
of those fields, or none of the fields.

I need help figuring out how to get lucene to not search on fields of a
document that a user can't see.

I have found the Filter class.  In order to use this, it seems I need to
know the lucene document IDs of the documents that should be visible,
and from what I understand document IDs are not fixed so I can't store a
link between them and my database rows.

I have considered storing my database IDs in a lucene field on each
document.  What I can't figure out is how to guarantee that all search
results are in the set of database IDs that a user can see.  I can
retrieve that list of IDs; is there a way to have lucene filter on that
list?

Thanks!

-- 
Brian