You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jamie Johnson <je...@gmail.com> on 2012/03/21 22:34:01 UTC

Grouping queries

I was wondering how much more intensive grouping queries are in
general.  I am considering using grouping queries as my primary query
because I have the need to store a document as pieces with varying
access controls, for instance a portion of a document a user can see
but an admin can see the entire thing (I'm greatly simplifying this).
My thought was to do a grouping request and group on a field which
contained a key which the documents all shared, but I am worried about
how well this will perform at scale.  Any thoughts/suggestions on this
would be appreciated.

Re: Grouping queries

Posted by Martijn v Groningen <ma...@gmail.com>.

>
> Where is Join documented?  I looked at
> http://wiki.apache.org/solr/Join and see no reference to "fromIndex".
> Also does this work in a distributed environment?
>
The "fromIndex" isn't documented in the wiki.... It is mentioned in the
issue and you can find in the Solr code:
https://issues.apache.org/jira/browse/SOLR-2272?focusedCommentId=13024918&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13024918


The Solr join only works in a distributed environment if you partition your
documents properly. Documents that link to each other need to reside on the
same shard and this can be a problem in some cases.

Martijn

Re: Grouping queries

Posted by Jamie Johnson <je...@gmail.com>.

On Fri, Mar 23, 2012 at 6:37 AM, Martijn v Groningen
<ma...@gmail.com> wrote:
> On 22 March 2012 03:10, Jamie Johnson <je...@gmail.com> wrote:
>
>> I need to apologize I believe that in my example I have too grossly
>> over simplified the problem and it's not clear what I am trying to do,
>> so I'll try again.
>>
>> I have a situation where I have a set of access controls say user,
>> super user and ultra user.  These controls are not necessarily
>> hierarchical in that user < super user < ultra user.  Each of these
>> controls should only be able to see documents from with some
>> combination of access controls they have.  In my actual case we have
>> many access controls and they can be combined in a number of fashions
>> so I can't simply constrain what they are searching by a query alone
>> (i.e. if it's a user the query is auth:user AND (some query)).  Now I
>> have a case where a document contains information that a user can see
>> but also contains information a super user can see.  Our current
>> system marks this document at the super user level and the user can't
>> see it.  We now have a requirement to make the pieces that are at the
>> user level available to the user while still allowing the super user
>> to see and search all the information.  My original thought was to
>> simply index the document twice, this would end up in a possible
>> duplicate (say if a user had both user and super user) but since this
>> situation is rare it may not matter.  After coming across the grouping
>> capability in solr I figured I could execute a group query where we
>> grouped on some key which indicated that 2 documents were the same
>> just with different access controls (user and super user in this
>> example).  We could then filter out the documents in the group the
>> user isn't allowed to see and only keep the document with the access
>> controls they have.
>>
> Maybe I'm not understanding this right... But why can't you save the access
> controls
> as a multivalued field in your schema? In your example your can then if the
> current user is a normal user just query auth:user AND (query) and if the
> current user
> is a super user auth:superuser AND (query). A document that is searchable
> for
> both superuser and user is then returned (if it matches the rest of the
> query).
>

I'd like to avoid having duplicates, although my access controls are
not strictly hierarchical there are cases where super user can see his
docs and user docs.  The idea was to have the Super User doc be a
super set of the user doc.  So in my case I really have only 1
document, but that 1 document has pieces user can see and pieces only
super user can see.  The idea was to index the document twice once
with the entire doc and once with the pieces just the user could see.

>>
>> I hope this makes more sense, unfortunately the Join queries I don't
>> believe will work because I don't think if I create documents which
>> would be relevant to each access control I could search across these
>> document as if it was a single document (i.e. search for something in
>> the user document and something in the super user document in a single
>> query).  This lead me to believe that grouping was the way to go in
>> this case, but again I am very interested in any suggestions that the
>> community could offer.
>>
> I wouldn't use grouping. The Solr join is still a option. Lets say you have
> many access controls and the access controls change often on your documents.
> You can then choose to store the access controls with an id to your logic
> document
> as a separate document in a different Solr Core (index). In the core were
> your main
> documents are you don't keep the access controls. You can then use the solr
> join
> to filter out documents that the current user isn't supposed to search.
> Something like this:
> q=(query)&fq={!join fromIndex=core1 from=doc_id to=id}auth:superuser
> Core 1 is the core containing the access control documents and the doc_id
> is the id that
> points to your regular documents.
>
> The benefit of this approach is that if you fine tune the core1 for high
> updatability you can
> change you access controls very frequently without paying a big
> performance penalty.

Where is Join documented?  I looked at
http://wiki.apache.org/solr/Join and see no reference to "fromIndex".
Also does this work in a distributed environment?

Re: Grouping queries

Posted by Martijn v Groningen <ma...@gmail.com>.

On 22 March 2012 03:10, Jamie Johnson <je...@gmail.com> wrote:

> I need to apologize I believe that in my example I have too grossly
> over simplified the problem and it's not clear what I am trying to do,
> so I'll try again.
>
> I have a situation where I have a set of access controls say user,
> super user and ultra user.  These controls are not necessarily
> hierarchical in that user < super user < ultra user.  Each of these
> controls should only be able to see documents from with some
> combination of access controls they have.  In my actual case we have
> many access controls and they can be combined in a number of fashions
> so I can't simply constrain what they are searching by a query alone
> (i.e. if it's a user the query is auth:user AND (some query)).  Now I
> have a case where a document contains information that a user can see
> but also contains information a super user can see.  Our current
> system marks this document at the super user level and the user can't
> see it.  We now have a requirement to make the pieces that are at the
> user level available to the user while still allowing the super user
> to see and search all the information.  My original thought was to
> simply index the document twice, this would end up in a possible
> duplicate (say if a user had both user and super user) but since this
> situation is rare it may not matter.  After coming across the grouping
> capability in solr I figured I could execute a group query where we
> grouped on some key which indicated that 2 documents were the same
> just with different access controls (user and super user in this
> example).  We could then filter out the documents in the group the
> user isn't allowed to see and only keep the document with the access
> controls they have.
>
Maybe I'm not understanding this right... But why can't you save the access
controls
as a multivalued field in your schema? In your example your can then if the
current user is a normal user just query auth:user AND (query) and if the
current user
is a super user auth:superuser AND (query). A document that is searchable
for
both superuser and user is then returned (if it matches the rest of the
query).

>
> I hope this makes more sense, unfortunately the Join queries I don't
> believe will work because I don't think if I create documents which
> would be relevant to each access control I could search across these
> document as if it was a single document (i.e. search for something in
> the user document and something in the super user document in a single
> query).  This lead me to believe that grouping was the way to go in
> this case, but again I am very interested in any suggestions that the
> community could offer.
>
I wouldn't use grouping. The Solr join is still a option. Lets say you have
many access controls and the access controls change often on your documents.
You can then choose to store the access controls with an id to your logic
document
as a separate document in a different Solr Core (index). In the core were
your main
documents are you don't keep the access controls. You can then use the solr
join
to filter out documents that the current user isn't supposed to search.
Something like this:
q=(query)&fq={!join fromIndex=core1 from=doc_id to=id}auth:superuser
Core 1 is the core containing the access control documents and the doc_id
is the id that
points to your regular documents.

The benefit of this approach is that if you fine tune the core1 for high
updatability you can
change you access controls very frequently without paying a big
performance penalty.

Re: Grouping queries

Posted by Jamie Johnson <je...@gmail.com>.

I need to apologize I believe that in my example I have too grossly
over simplified the problem and it's not clear what I am trying to do,
so I'll try again.

I have a situation where I have a set of access controls say user,
super user and ultra user.  These controls are not necessarily
hierarchical in that user < super user < ultra user.  Each of these
controls should only be able to see documents from with some
combination of access controls they have.  In my actual case we have
many access controls and they can be combined in a number of fashions
so I can't simply constrain what they are searching by a query alone
(i.e. if it's a user the query is auth:user AND (some query)).  Now I
have a case where a document contains information that a user can see
but also contains information a super user can see.  Our current
system marks this document at the super user level and the user can't
see it.  We now have a requirement to make the pieces that are at the
user level available to the user while still allowing the super user
to see and search all the information.  My original thought was to
simply index the document twice, this would end up in a possible
duplicate (say if a user had both user and super user) but since this
situation is rare it may not matter.  After coming across the grouping
capability in solr I figured I could execute a group query where we
grouped on some key which indicated that 2 documents were the same
just with different access controls (user and super user in this
example).  We could then filter out the documents in the group the
user isn't allowed to see and only keep the document with the access
controls they have.

I hope this makes more sense, unfortunately the Join queries I don't
believe will work because I don't think if I create documents which
would be relevant to each access control I could search across these
document as if it was a single document (i.e. search for something in
the user document and something in the super user document in a single
query).  This lead me to believe that grouping was the way to go in
this case, but again I am very interested in any suggestions that the
community could offer.

On Wed, Mar 21, 2012 at 8:41 PM, Jamie Johnson <je...@gmail.com> wrote:
> Join looks interesting for this as well, is this currently supported
> in SolrCloud?
>
> On Wed, Mar 21, 2012 at 6:56 PM, Jamie Johnson <je...@gmail.com> wrote:
>> What would you recommend instead, I had thought about block join perhaps I'm
>> open to suggestions tbough
>>
>>
>> On Wednesday, March 21, 2012, Martijn v Groningen
>> <ma...@gmail.com> wrote:
>>> I'm not sure if grouping is the right feature to use for your
>>> requirements... Grouping does have an impact on performance which you need
>>> to take into account.
>>> Depending on what grouping features you're going to use (grouped facets,
>>> ngroups), grouping performs well on large indices if you use filters
>>> queries well.
>>> (E.g. 100M travel offers, but during searching only interrested in in
>>> travel offers to specific destinations or in a specific period of time).
>>> Best way to find this out, is to just try it out.
>>>
>>> On 21 March 2012 22:34, Jamie Johnson <je...@gmail.com> wrote:
>>>
>>>> I was wondering how much more intensive grouping queries are in
>>>> general.  I am considering using grouping queries as my primary query
>>>> because I have the need to store a document as pieces with varying
>>>> access controls, for instance a portion of a document a user can see
>>>> but an admin can see the entire thing (I'm greatly simplifying this).
>>>> My thought was to do a grouping request and group on a field which
>>>> contained a key which the documents all shared, but I am worried about
>>>> how well this will perform at scale.  Any thoughts/suggestions on this
>>>> would be appreciated.
>>>>
>>>
>>>
>>>
>>> --
>>> Met vriendelijke groet,
>>>
>>> Martijn van Groningen
>>>

Re: Grouping queries

Posted by Jamie Johnson <je...@gmail.com>.

Join looks interesting for this as well, is this currently supported
in SolrCloud?

On Wed, Mar 21, 2012 at 6:56 PM, Jamie Johnson <je...@gmail.com> wrote:
> What would you recommend instead, I had thought about block join perhaps I'm
> open to suggestions tbough
>
>
> On Wednesday, March 21, 2012, Martijn v Groningen
> <ma...@gmail.com> wrote:
>> I'm not sure if grouping is the right feature to use for your
>> requirements... Grouping does have an impact on performance which you need
>> to take into account.
>> Depending on what grouping features you're going to use (grouped facets,
>> ngroups), grouping performs well on large indices if you use filters
>> queries well.
>> (E.g. 100M travel offers, but during searching only interrested in in
>> travel offers to specific destinations or in a specific period of time).
>> Best way to find this out, is to just try it out.
>>
>> On 21 March 2012 22:34, Jamie Johnson <je...@gmail.com> wrote:
>>
>>> I was wondering how much more intensive grouping queries are in
>>> general.  I am considering using grouping queries as my primary query
>>> because I have the need to store a document as pieces with varying
>>> access controls, for instance a portion of a document a user can see
>>> but an admin can see the entire thing (I'm greatly simplifying this).
>>> My thought was to do a grouping request and group on a field which
>>> contained a key which the documents all shared, but I am worried about
>>> how well this will perform at scale.  Any thoughts/suggestions on this
>>> would be appreciated.
>>>
>>
>>
>>
>> --
>> Met vriendelijke groet,
>>
>> Martijn van Groningen
>>

Re: Grouping queries

Posted by Jamie Johnson <je...@gmail.com>.

What would you recommend instead, I had thought about block join perhaps
I'm open to suggestions tbough

On Wednesday, March 21, 2012, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:
> I'm not sure if grouping is the right feature to use for your
> requirements... Grouping does have an impact on performance which you need
> to take into account.
> Depending on what grouping features you're going to use (grouped facets,
> ngroups), grouping performs well on large indices if you use filters
> queries well.
> (E.g. 100M travel offers, but during searching only interrested in in
> travel offers to specific destinations or in a specific period of time).
> Best way to find this out, is to just try it out.
>
> On 21 March 2012 22:34, Jamie Johnson <je...@gmail.com> wrote:
>
>> I was wondering how much more intensive grouping queries are in
>> general.  I am considering using grouping queries as my primary query
>> because I have the need to store a document as pieces with varying
>> access controls, for instance a portion of a document a user can see
>> but an admin can see the entire thing (I'm greatly simplifying this).
>> My thought was to do a grouping request and group on a field which
>> contained a key which the documents all shared, but I am worried about
>> how well this will perform at scale.  Any thoughts/suggestions on this
>> would be appreciated.
>>
>
>
>
> --
> Met vriendelijke groet,
>
> Martijn van Groningen
>

Re: Grouping queries

Posted by Martijn v Groningen <ma...@gmail.com>.

I'm not sure if grouping is the right feature to use for your
requirements... Grouping does have an impact on performance which you need
to take into account.
Depending on what grouping features you're going to use (grouped facets,
ngroups), grouping performs well on large indices if you use filters
queries well.
(E.g. 100M travel offers, but during searching only interrested in in
travel offers to specific destinations or in a specific period of time).
Best way to find this out, is to just try it out.

On 21 March 2012 22:34, Jamie Johnson <je...@gmail.com> wrote:

> I was wondering how much more intensive grouping queries are in
> general.  I am considering using grouping queries as my primary query
> because I have the need to store a document as pieces with varying
> access controls, for instance a portion of a document a user can see
> but an admin can see the entire thing (I'm greatly simplifying this).
> My thought was to do a grouping request and group on a field which
> contained a key which the documents all shared, but I am worried about
> how well this will perform at scale.  Any thoughts/suggestions on this
> would be appreciated.
>

-- 
Met vriendelijke groet,

Martijn van Groningen