You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "John, Phil (CSS)" <ph...@capita.co.uk> on 2012/01/23 11:49:38 UTC

Filtering search results by an external set of values

Hi,

We're building quite a large shared index of resources, using Solr. The
application that makes use of these resources is a multitenant one
(i.e., many customers using the same index). For resources that are
"private" to a customer, it's fairly easy to tag a document with their
customer ID and using a FilterQuery to limit results to just their
"stuff".

We are soon going to be adding a large number (many tens of millions) of
records that will be shared amongst customers. Not all customers will
have access to the same shared resources, e.g.:

* Shared resource 1:

o Customer 1

o Customer 3

* Shared resource 2:

o Customer 2

o Customer 1

The issue is, what is the best way to model this in Solr? Should we have
multiple customer_id fields on each record, and then use the filter
query as with "private" resources, or is there a better way of doing it?
What happens if we need to do a bulk change - i.e. adding new customer,
or a previous customer has a large change in what shared resources they
have access to? Am I right in thinking that we'd need to go through
every shared resource, read it, make the required change, and reindex
it?

I'm wondering if there's a way, instead of updating these resources
directly, I could construct a set of documents that would act as a
filter at query time of which shared resources to return?

Kind regards,

Phil John

Technical Lead, Capita Software Services

Knights Court, Solihull Parkway

Birmingham Business Park B37 7YB

Office: 0870 400 5000

Fax: 0870 400 5001
email: philjohn@capita.co.uk <ma...@capita.co.uk>

Part of Capita plc www.capita.co.uk <http://www.capita.co.uk>

This email and any attachment to it are confidential. Unless you are the intended recipient, you may not use, copy or disclose either the message or any information contained in the message. If you are not the intended recipient, you should delete this email and notify the sender immediately.

Any views or opinions expressed in this email are those of the sender only, unless otherwise stated. All copyright in any Capita material in this email is reserved.

All emails, incoming and outgoing, may be recorded by Capita and monitored for legitimate business purposes.

Capita exclude all liability for any loss or damage arising or resulting from the receipt, use or transmission of this email to the fullest extent permitted by law.

Re: Filtering search results by an external set of values

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Phil,

Some time ago I posted my thoughts about the similar problem. Scroll to
part II.

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201201.mbox/%3CCANGii8egoB1_rXFfwJMheyxx72v48B_DA-6KteKOymiBrR=MJQ@mail.gmail.com%3E

Regards

On Tue, Jan 24, 2012 at 1:36 PM, John, Phil (CSS) <ph...@capita.co.uk>wrote:

> Thanks for the responses.
>
> Groups probably wouldn't work as while there will be some overlap between
> customers, each will have a very different overall set of accessible
> resources.
>
> I'll try the suggestion about simply reindexing, or using the no-cache
> option and see how I get on.
>
> Failing that, are there hooks to write custom filter modules that used
> other parts of the records to decide on whether to include them in a result
> set or not? In our use case, the documents represent articles, which have
> an "issue" field. Each customer has defined issues (or ranges of issues)
> that they have subscriptions to, so the upper bounds for "what to filter"
> would probably be fairly small (10k - 20k issues/ranges). This could
> probably be used with the no-cache option you've pointed me to.
>
> Best wishes,
>
> Phil.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: 23 January 2012 17:34
> To: solr-user@lucene.apache.org
> Subject: Re: Filtering search results by an external set of values
>
> A second, but arguably quite expert option, is to use the no-cache option.
> See: https://issues.apache.org/jira/browse/SOLR-2429
>
> The idea here is that you can specify that a filter is "expensive" and it
> will only be run after all the other filters & etc have been applied.
> Furthermore,
> it will not be cached and only documents that pass through all the other
> filters will be matched against this filter. It has been specifically used
> for ACL calculations...
>
> That said, see exactly how painful storing auth tokens is. I can index, on
> a relatively underpowered laptop, 11M Wiki documents in 5 minutes or so. If
> your worst-case rights update take 1/2 hour to re-index and it only happens
> once a month, why be complex?
>
> And groups, as Jan says, often make even this unnecessary.
>
> Best
> Erick
>
> On Mon, Jan 23, 2012 at 5:16 AM, Jan Høydahl <ja...@cominvent.com>
> wrote:
> > Hi,
> >
> > Do you have any kind of "group" membership for you users? If you have,
> > a resource's list of security access tokens could be smaller and avoid
> > re-indexing most resources when adding "normal" users which mostly
> > belong to groups. The common way is to add filters on the query. You
> > may do it yourself or have some framework/plugin to it for you, see
> > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com Solr Training - www.solrtraining.com
> >
> > On 23. jan. 2012, at 11:49, John, Phil (CSS) wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> We're building quite a large shared index of resources, using Solr.
> >> The application that makes use of these resources is a multitenant
> >> one (i.e., many customers using the same index). For resources that
> >> are "private" to a customer, it's fairly easy to tag a document with
> >> their customer ID and using a FilterQuery to limit results to just
> >> their "stuff".
> >>
> >>
> >>
> >> We are soon going to be adding a large number (many tens of millions)
> >> of records that will be shared amongst customers. Not all customers
> >> will have access to the same shared resources, e.g.:
> >>
> >>
> >>
> >> *         Shared resource 1:
> >>
> >> o   Customer 1
> >>
> >> o   Customer 3
> >>
> >>
> >>
> >> *         Shared resource 2:
> >>
> >> o   Customer 2
> >>
> >> o   Customer 1
> >>
> >>
> >>
> >> The issue is, what is the best way to model this in Solr? Should we
> >> have multiple customer_id fields on each record, and then use the
> >> filter query as with "private" resources, or is there a better way of
> doing it?
> >> What happens if we need to do a bulk change - i.e. adding new
> >> customer, or a previous customer has a large change in what shared
> >> resources they have access to? Am I right in thinking that we'd need
> >> to go through every shared resource, read it, make the required
> >> change, and reindex it?
> >>
> >>
> >>
> >> I'm wondering if there's a way, instead of updating these resources
> >> directly, I could construct a set of documents that would act as a
> >> filter at query time of which shared resources to return?
> >>
> >>
> >>
> >> Kind regards,
> >>
> >>
> >>
> >> Phil John
> >>
> >> Technical Lead, Capita Software Services
> >>
> >> Knights Court, Solihull Parkway
> >>
> >> Birmingham Business Park B37 7YB
> >>
> >> Office: 0870 400 5000
> >>
> >> Fax: 0870 400 5001
> >> email: philjohn@capita.co.uk <ma...@capita.co.uk>
> >>
> >>
> >>
> >> Part of Capita plc www.capita.co.uk <http://www.capita.co.uk>
> >>
> >>
> >>
> >>
> >>
> >> This email and any attachment to it are confidential.  Unless you are
> the intended recipient, you may not use, copy or disclose either the
> message or any information contained in the message. If you are not the
> intended recipient, you should delete this email and notify the sender
> immediately.
> >>
> >> Any views or opinions expressed in this email are those of the sender
> only, unless otherwise stated.  All copyright in any Capita material in
> this email is reserved.
> >>
> >> All emails, incoming and outgoing, may be recorded by Capita and
> monitored for legitimate business purposes.
> >>
> >> Capita exclude all liability for any loss or damage arising or
> resulting from the receipt, use or transmission of this email to the
> fullest extent permitted by law.
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

RE: Filtering search results by an external set of values

Posted by "John, Phil (CSS)" <ph...@capita.co.uk>.

Thanks for the responses.

Groups probably wouldn't work as while there will be some overlap between customers, each will have a very different overall set of accessible resources.

I'll try the suggestion about simply reindexing, or using the no-cache option and see how I get on.

Failing that, are there hooks to write custom filter modules that used other parts of the records to decide on whether to include them in a result set or not? In our use case, the documents represent articles, which have an "issue" field. Each customer has defined issues (or ranges of issues) that they have subscriptions to, so the upper bounds for "what to filter" would probably be fairly small (10k - 20k issues/ranges). This could probably be used with the no-cache option you've pointed me to.

Best wishes,

Phil.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: 23 January 2012 17:34
To: solr-user@lucene.apache.org
Subject: Re: Filtering search results by an external set of values

A second, but arguably quite expert option, is to use the no-cache option.
See: https://issues.apache.org/jira/browse/SOLR-2429

The idea here is that you can specify that a filter is "expensive" and it will only be run after all the other filters & etc have been applied.
Furthermore,
it will not be cached and only documents that pass through all the other filters will be matched against this filter. It has been specifically used for ACL calculations...

That said, see exactly how painful storing auth tokens is. I can index, on a relatively underpowered laptop, 11M Wiki documents in 5 minutes or so. If your worst-case rights update take 1/2 hour to re-index and it only happens once a month, why be complex?

And groups, as Jan says, often make even this unnecessary.

Best
Erick

On Mon, Jan 23, 2012 at 5:16 AM, Jan Høydahl <ja...@cominvent.com> wrote:
> Hi,
>
> Do you have any kind of "group" membership for you users? If you have, 
> a resource's list of security access tokens could be smaller and avoid 
> re-indexing most resources when adding "normal" users which mostly 
> belong to groups. The common way is to add filters on the query. You 
> may do it yourself or have some framework/plugin to it for you, see 
> http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
>
> --
> Jan Høydahl, search solution architect Cominvent AS - 
> www.cominvent.com Solr Training - www.solrtraining.com
>
> On 23. jan. 2012, at 11:49, John, Phil (CSS) wrote:
>
>> Hi,
>>
>>
>>
>> We're building quite a large shared index of resources, using Solr. 
>> The application that makes use of these resources is a multitenant 
>> one (i.e., many customers using the same index). For resources that 
>> are "private" to a customer, it's fairly easy to tag a document with 
>> their customer ID and using a FilterQuery to limit results to just 
>> their "stuff".
>>
>>
>>
>> We are soon going to be adding a large number (many tens of millions) 
>> of records that will be shared amongst customers. Not all customers 
>> will have access to the same shared resources, e.g.:
>>
>>
>>
>> *         Shared resource 1:
>>
>> o   Customer 1
>>
>> o   Customer 3
>>
>>
>>
>> *         Shared resource 2:
>>
>> o   Customer 2
>>
>> o   Customer 1
>>
>>
>>
>> The issue is, what is the best way to model this in Solr? Should we 
>> have multiple customer_id fields on each record, and then use the 
>> filter query as with "private" resources, or is there a better way of doing it?
>> What happens if we need to do a bulk change - i.e. adding new 
>> customer, or a previous customer has a large change in what shared 
>> resources they have access to? Am I right in thinking that we'd need 
>> to go through every shared resource, read it, make the required 
>> change, and reindex it?
>>
>>
>>
>> I'm wondering if there's a way, instead of updating these resources 
>> directly, I could construct a set of documents that would act as a 
>> filter at query time of which shared resources to return?
>>
>>
>>
>> Kind regards,
>>
>>
>>
>> Phil John
>>
>> Technical Lead, Capita Software Services
>>
>> Knights Court, Solihull Parkway
>>
>> Birmingham Business Park B37 7YB
>>
>> Office: 0870 400 5000
>>
>> Fax: 0870 400 5001
>> email: philjohn@capita.co.uk <ma...@capita.co.uk>
>>
>>
>>
>> Part of Capita plc www.capita.co.uk <http://www.capita.co.uk>
>>
>>
>>
>>
>>
>> This email and any attachment to it are confidential.  Unless you are the intended recipient, you may not use, copy or disclose either the message or any information contained in the message. If you are not the intended recipient, you should delete this email and notify the sender immediately.
>>
>> Any views or opinions expressed in this email are those of the sender only, unless otherwise stated.  All copyright in any Capita material in this email is reserved.
>>
>> All emails, incoming and outgoing, may be recorded by Capita and monitored for legitimate business purposes.
>>
>> Capita exclude all liability for any loss or damage arising or resulting from the receipt, use or transmission of this email to the fullest extent permitted by law.
>

Re: Filtering search results by an external set of values

Posted by Erick Erickson <er...@gmail.com>.

A second, but arguably quite expert option, is to use the no-cache option.
See: https://issues.apache.org/jira/browse/SOLR-2429

The idea here is that you can specify that a filter is "expensive" and it
will only be run after all the other filters & etc have been applied.
Furthermore,
it will not be cached and only documents that pass through all the other
filters will be matched against this filter. It has been specifically used
for ACL calculations...

That said, see exactly how painful storing auth tokens is. I can index, on a
relatively underpowered laptop, 11M Wiki documents in 5 minutes or so. If
your worst-case rights update take 1/2 hour to re-index and it only happens
once a month, why be complex?

And groups, as Jan says, often make even this unnecessary.

Best
Erick

On Mon, Jan 23, 2012 at 5:16 AM, Jan Høydahl <ja...@cominvent.com> wrote:
> Hi,
>
> Do you have any kind of "group" membership for you users? If you have, a resource's list of security access tokens could be smaller and avoid re-indexing most resources when adding "normal" users which mostly belong to groups. The common way is to add filters on the query. You may do it yourself or have some framework/plugin to it for you, see http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 23. jan. 2012, at 11:49, John, Phil (CSS) wrote:
>
>> Hi,
>>
>>
>>
>> We're building quite a large shared index of resources, using Solr. The
>> application that makes use of these resources is a multitenant one
>> (i.e., many customers using the same index). For resources that are
>> "private" to a customer, it's fairly easy to tag a document with their
>> customer ID and using a FilterQuery to limit results to just their
>> "stuff".
>>
>>
>>
>> We are soon going to be adding a large number (many tens of millions) of
>> records that will be shared amongst customers. Not all customers will
>> have access to the same shared resources, e.g.:
>>
>>
>>
>> *         Shared resource 1:
>>
>> o   Customer 1
>>
>> o   Customer 3
>>
>>
>>
>> *         Shared resource 2:
>>
>> o   Customer 2
>>
>> o   Customer 1
>>
>>
>>
>> The issue is, what is the best way to model this in Solr? Should we have
>> multiple customer_id fields on each record, and then use the filter
>> query as with "private" resources, or is there a better way of doing it?
>> What happens if we need to do a bulk change - i.e. adding new customer,
>> or a previous customer has a large change in what shared resources they
>> have access to? Am I right in thinking that we'd need to go through
>> every shared resource, read it, make the required change, and reindex
>> it?
>>
>>
>>
>> I'm wondering if there's a way, instead of updating these resources
>> directly, I could construct a set of documents that would act as a
>> filter at query time of which shared resources to return?
>>
>>
>>
>> Kind regards,
>>
>>
>>
>> Phil John
>>
>> Technical Lead, Capita Software Services
>>
>> Knights Court, Solihull Parkway
>>
>> Birmingham Business Park B37 7YB
>>
>> Office: 0870 400 5000
>>
>> Fax: 0870 400 5001
>> email: philjohn@capita.co.uk <ma...@capita.co.uk>
>>
>>
>>
>> Part of Capita plc www.capita.co.uk <http://www.capita.co.uk>
>>
>>
>>
>>
>>
>> This email and any attachment to it are confidential.  Unless you are the intended recipient, you may not use, copy or disclose either the message or any information contained in the message. If you are not the intended recipient, you should delete this email and notify the sender immediately.
>>
>> Any views or opinions expressed in this email are those of the sender only, unless otherwise stated.  All copyright in any Capita material in this email is reserved.
>>
>> All emails, incoming and outgoing, may be recorded by Capita and monitored for legitimate business purposes.
>>
>> Capita exclude all liability for any loss or damage arising or resulting from the receipt, use or transmission of this email to the fullest extent permitted by law.
>

Re: Filtering search results by an external set of values

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Do you have any kind of "group" membership for you users? If you have, a resource's list of security access tokens could be smaller and avoid re-indexing most resources when adding "normal" users which mostly belong to groups. The common way is to add filters on the query. You may do it yourself or have some framework/plugin to it for you, see http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 23. jan. 2012, at 11:49, John, Phil (CSS) wrote:

> Hi,
> 
> 
> 
> We're building quite a large shared index of resources, using Solr. The
> application that makes use of these resources is a multitenant one
> (i.e., many customers using the same index). For resources that are
> "private" to a customer, it's fairly easy to tag a document with their
> customer ID and using a FilterQuery to limit results to just their
> "stuff".
> 
> 
> 
> We are soon going to be adding a large number (many tens of millions) of
> records that will be shared amongst customers. Not all customers will
> have access to the same shared resources, e.g.:
> 
> 
> 
> *         Shared resource 1:
> 
> o   Customer 1
> 
> o   Customer 3
> 
> 
> 
> *         Shared resource 2:
> 
> o   Customer 2
> 
> o   Customer 1
> 
> 
> 
> The issue is, what is the best way to model this in Solr? Should we have
> multiple customer_id fields on each record, and then use the filter
> query as with "private" resources, or is there a better way of doing it?
> What happens if we need to do a bulk change - i.e. adding new customer,
> or a previous customer has a large change in what shared resources they
> have access to? Am I right in thinking that we'd need to go through
> every shared resource, read it, make the required change, and reindex
> it?
> 
> 
> 
> I'm wondering if there's a way, instead of updating these resources
> directly, I could construct a set of documents that would act as a
> filter at query time of which shared resources to return?
> 
> 
> 
> Kind regards,
> 
> 
> 
> Phil John
> 
> Technical Lead, Capita Software Services
> 
> Knights Court, Solihull Parkway
> 
> Birmingham Business Park B37 7YB
> 
> Office: 0870 400 5000
> 
> Fax: 0870 400 5001
> email: philjohn@capita.co.uk <ma...@capita.co.uk> 
> 
> 
> 
> Part of Capita plc www.capita.co.uk <http://www.capita.co.uk>  
> 
> 
> 
> 
> 
> This email and any attachment to it are confidential.  Unless you are the intended recipient, you may not use, copy or disclose either the message or any information contained in the message. If you are not the intended recipient, you should delete this email and notify the sender immediately.
> 
> Any views or opinions expressed in this email are those of the sender only, unless otherwise stated.  All copyright in any Capita material in this email is reserved.
> 
> All emails, incoming and outgoing, may be recorded by Capita and monitored for legitimate business purposes. 
> 
> Capita exclude all liability for any loss or damage arising or resulting from the receipt, use or transmission of this email to the fullest extent permitted by law.