You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jamie Johnson <je...@gmail.com> on 2011/09/01 17:10:10 UTC

Re: Post Processing Solr Results

Ok, so I feel like I'm 90% of the way there.  For standard queries
things work fine, but for distributed queries I'm running into a bit
of an issue.  Right now the queries run fine but when doing
distributed queries (using SolrCloud) the numFound is always getting
set to the number of requested rows.  Can anyone shed some light on
why this might be happening?

On Tue, Aug 30, 2011 at 8:53 AM, Jamie Johnson <je...@gmail.com> wrote:
> This might work in conjunction with what POST processing to help to
> pair down the results, but the logic for the actual access to the data
> is too complex to have entirely in solr.
>
> On Mon, Aug 29, 2011 at 2:02 PM, Erick Erickson <er...@gmail.com> wrote:
>> It's reasonable, but post-filtering is often difficult, you have
>> too many documents to wade through. If you can see any way
>> at all to just include a clause in the query, you'll save a world
>> of effort...
>>
>> Is there any way you can include a value in some kind of
>> "permissions" field? Let's say you have a document that
>> is only to be visible for "tier 1" customers. If your permissions
>> field contained the tiers (e.g. tier0, tier1), then a simple
>> AND permissions:tier1 would do the trick...
>>
>> I know this is a trivial example, but you see where this is headed.
>> The documents can contain as many of these tokens in permissions
>> as you want. As long as you can string together a clause
>> like "AND permissions:(A OR B OR C)" and not have the clause
>> get ridiculously long (as in thousands of values), that works best.
>>
>> Any such scheme depends upon being able to assign the documents
>> some kind of code that doesn't change too often (because when it does
>> you have to re-index) and figure out, at query time, what permissions
>> a user has.
>>
>> Using FieldCache or low-level Lucene routines can answer the question
>> "Does doc X contain token Y in field Z" reasonably easily. What it has
>> a hard time doing is answering "For document X, what are all the value
>> in the inverted index in field Z".
>>
>> If this doesn't make sense, could you explain a bit more about your
>> permissions model?
>>
>> Hope this helps
>> Erick
>>
>> On Mon, Aug 29, 2011 at 11:46 AM, Jamie Johnson <je...@gmail.com> wrote:
>>> Thanks guys, perhaps I am just going about this the wrong way.  So let
>>> me explain my problem and perhaps there is a more appropriate
>>> solution.  What I need to do is basically hide certain results based
>>> on some passed in user parameter (say their service tier for
>>> instance).  What I'd like to do is have some way to plugin my custom
>>> logic to basically remove certain documents from the result set using
>>> this information.  Now that being said I technically don't need to
>>> remove the documents from the full result set, I really only need to
>>> remove them from current page (but still ensuring that a page is
>>> filled and sorted).  At present I'm trying to see if there is a way
>>> for me to add this type of logic after the QueryComponent has
>>> executed, perhaps by going through the DocIdandSet at this point and
>>> then intersecting the DocIdSet with a DocIdSet which would filter out
>>> the stuff I don't want seen.  Does this sound reasonable or like a
>>> fools errand?
>>>
>>>
>>>
>>> On Mon, Aug 29, 2011 at 10:51 AM, Erik Hatcher <er...@gmail.com> wrote:
>>>> I haven't followed the details, but what I'm guessing you want here is Lucene's FieldCache.  Perhaps something along the lines of how faceting uses it (in SimpleFacets.java) -
>>>>
>>>>   FieldCache.DocTermsIndex si = FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);
>>>>
>>>>        Erik
>>>>
>>>> On Aug 29, 2011, at 09:58 , Erick Erickson wrote:
>>>>
>>>>> If you're asking whether there's a way to find, say,
>>>>> all the values for the "auth" field associated with
>>>>> a document... no. The nature of an inverted
>>>>> index makes this hard (think of finding all
>>>>> the definitions in a dictionary where the word
>>>>> "earth" was in the definition).
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>> Thanks Erick, if I did not know the token up front that could be in
>>>>>> the index is there not an efficient way to get the field for a
>>>>>> specific document and do some custom processing on it?
>>>>>>
>>>>>> On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson <er...@gmail.com> wrote:
>>>>>>> Start here I think:
>>>>>>>
>>>>>>> http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html
>>>>>>>
>>>>>>> Best
>>>>>>> Erick
>>>>>>>
>>>>>>> On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>> Thanks for the reply.  The fields I want are indexed, but how would I
>>>>>>>> go directly at the fields I wanted?
>>>>>>>>
>>>>>>>> In regards to indexing the auth tokens I've thought about this and am
>>>>>>>> trying to get confirmation if that is reasonable given our
>>>>>>>> constraints.
>>>>>>>>
>>>>>>>> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson <er...@gmail.com> wrote:
>>>>>>>>> Yeah, loading the document inside a Collector is a
>>>>>>>>> definite no-no. Have you tried going directly
>>>>>>>>> at the fields you want (assuming they're
>>>>>>>>> indexed)? That *should* be much faster, but
>>>>>>>>> whether it'll be fast enough is a good question. I'm
>>>>>>>>> thinking some of the Terms methods here. You
>>>>>>>>> *might* get some joy out of making sure lazy
>>>>>>>>> field loading is enabled (and make sure the
>>>>>>>>> fields you're accessing for your logic are
>>>>>>>>> indexed), but I'm not entirely sure about
>>>>>>>>> that bit.
>>>>>>>>>
>>>>>>>>> This kind of problem is sometimes handled
>>>>>>>>> by indexing "auth tokens" with the documents
>>>>>>>>> and including an OR clause on the query
>>>>>>>>> with the authorizations for a particular
>>>>>>>>> user, but that works best if there is an upper
>>>>>>>>> limit (in the 100s) of tokens that a user can possibly
>>>>>>>>> have, often this works best with some kind of
>>>>>>>>> grouping. Making this work when a user can
>>>>>>>>> have tens of thousands of auth tokens is...er...
>>>>>>>>> contra-indicated...
>>>>>>>>>
>>>>>>>>> Hope this helps a bit...
>>>>>>>>> Erick
>>>>>>>>>
>>>>>>>>> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>> Just a bit more information.  Inside my class which extends
>>>>>>>>>> FilteredDocIdSet all of the time seems to be getting spent in
>>>>>>>>>> retrieving the document from the readerCtx, doing this
>>>>>>>>>>
>>>>>>>>>> Document doc = readerCtx.reader.document(docid);
>>>>>>>>>>
>>>>>>>>>> If I comment out this and just return true things fly along as I
>>>>>>>>>> expect.  My query is returning a total of 2 million documents also.
>>>>>>>>>>
>>>>>>>>>> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>>> I have a need to post process Solr results based on some access
>>>>>>>>>>> controls which are setup outside of Solr, currently we've written
>>>>>>>>>>> something that extends SearchComponent and in the prepare method I'm
>>>>>>>>>>> doing something like this
>>>>>>>>>>>
>>>>>>>>>>>                    QueryWrapperFilter qwf = new
>>>>>>>>>>> QueryWrapperFilter(rb.getQuery());
>>>>>>>>>>>                    Filter filter = new CustomFilter(qwf);
>>>>>>>>>>>                    FilteredQuery fq = new FilteredQuery(rb.getQuery(), filter);
>>>>>>>>>>>                    rb.setQuery(fq);
>>>>>>>>>>>
>>>>>>>>>>> Inside my CustomFilter I have a FilteredDocIdSet which checks if the
>>>>>>>>>>> document should be returned.  This works as I expect but for some
>>>>>>>>>>> reason is very very slow.  Even if I take out any of the machinery
>>>>>>>>>>> which does any logic with the document and only return true in the
>>>>>>>>>>> FilteredDocIdSets match method the query still takes an inordinate
>>>>>>>>>>> amount of time as compared to not including this custom filter.  So my
>>>>>>>>>>> question, is this the most appropriate way of handling this?  What
>>>>>>>>>>> should the performance out of such a setup be expected to be?  Any
>>>>>>>>>>> information/pointers would be greatly appreciated.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>

Re: Post Processing Solr Results

Posted by Jamie Johnson <je...@gmail.com>.

Ok, think I got it.  Basically the issue was that I can't modify the
offset and start params when the search is a distributed one,
otherwise the correct offset and max are lost, a simple check in
prepare fixed this.

On Thu, Sep 1, 2011 at 11:10 AM, Jamie Johnson <je...@gmail.com> wrote:
> Ok, so I feel like I'm 90% of the way there.  For standard queries
> things work fine, but for distributed queries I'm running into a bit
> of an issue.  Right now the queries run fine but when doing
> distributed queries (using SolrCloud) the numFound is always getting
> set to the number of requested rows.  Can anyone shed some light on
> why this might be happening?
>
> On Tue, Aug 30, 2011 at 8:53 AM, Jamie Johnson <je...@gmail.com> wrote:
>> This might work in conjunction with what POST processing to help to
>> pair down the results, but the logic for the actual access to the data
>> is too complex to have entirely in solr.
>>
>> On Mon, Aug 29, 2011 at 2:02 PM, Erick Erickson <er...@gmail.com> wrote:
>>> It's reasonable, but post-filtering is often difficult, you have
>>> too many documents to wade through. If you can see any way
>>> at all to just include a clause in the query, you'll save a world
>>> of effort...
>>>
>>> Is there any way you can include a value in some kind of
>>> "permissions" field? Let's say you have a document that
>>> is only to be visible for "tier 1" customers. If your permissions
>>> field contained the tiers (e.g. tier0, tier1), then a simple
>>> AND permissions:tier1 would do the trick...
>>>
>>> I know this is a trivial example, but you see where this is headed.
>>> The documents can contain as many of these tokens in permissions
>>> as you want. As long as you can string together a clause
>>> like "AND permissions:(A OR B OR C)" and not have the clause
>>> get ridiculously long (as in thousands of values), that works best.
>>>
>>> Any such scheme depends upon being able to assign the documents
>>> some kind of code that doesn't change too often (because when it does
>>> you have to re-index) and figure out, at query time, what permissions
>>> a user has.
>>>
>>> Using FieldCache or low-level Lucene routines can answer the question
>>> "Does doc X contain token Y in field Z" reasonably easily. What it has
>>> a hard time doing is answering "For document X, what are all the value
>>> in the inverted index in field Z".
>>>
>>> If this doesn't make sense, could you explain a bit more about your
>>> permissions model?
>>>
>>> Hope this helps
>>> Erick
>>>
>>> On Mon, Aug 29, 2011 at 11:46 AM, Jamie Johnson <je...@gmail.com> wrote:
>>>> Thanks guys, perhaps I am just going about this the wrong way.  So let
>>>> me explain my problem and perhaps there is a more appropriate
>>>> solution.  What I need to do is basically hide certain results based
>>>> on some passed in user parameter (say their service tier for
>>>> instance).  What I'd like to do is have some way to plugin my custom
>>>> logic to basically remove certain documents from the result set using
>>>> this information.  Now that being said I technically don't need to
>>>> remove the documents from the full result set, I really only need to
>>>> remove them from current page (but still ensuring that a page is
>>>> filled and sorted).  At present I'm trying to see if there is a way
>>>> for me to add this type of logic after the QueryComponent has
>>>> executed, perhaps by going through the DocIdandSet at this point and
>>>> then intersecting the DocIdSet with a DocIdSet which would filter out
>>>> the stuff I don't want seen.  Does this sound reasonable or like a
>>>> fools errand?
>>>>
>>>>
>>>>
>>>> On Mon, Aug 29, 2011 at 10:51 AM, Erik Hatcher <er...@gmail.com> wrote:
>>>>> I haven't followed the details, but what I'm guessing you want here is Lucene's FieldCache.  Perhaps something along the lines of how faceting uses it (in SimpleFacets.java) -
>>>>>
>>>>>   FieldCache.DocTermsIndex si = FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);
>>>>>
>>>>>        Erik
>>>>>
>>>>> On Aug 29, 2011, at 09:58 , Erick Erickson wrote:
>>>>>
>>>>>> If you're asking whether there's a way to find, say,
>>>>>> all the values for the "auth" field associated with
>>>>>> a document... no. The nature of an inverted
>>>>>> index makes this hard (think of finding all
>>>>>> the definitions in a dictionary where the word
>>>>>> "earth" was in the definition).
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>> Thanks Erick, if I did not know the token up front that could be in
>>>>>>> the index is there not an efficient way to get the field for a
>>>>>>> specific document and do some custom processing on it?
>>>>>>>
>>>>>>> On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson <er...@gmail.com> wrote:
>>>>>>>> Start here I think:
>>>>>>>>
>>>>>>>> http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html
>>>>>>>>
>>>>>>>> Best
>>>>>>>> Erick
>>>>>>>>
>>>>>>>> On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>> Thanks for the reply.  The fields I want are indexed, but how would I
>>>>>>>>> go directly at the fields I wanted?
>>>>>>>>>
>>>>>>>>> In regards to indexing the auth tokens I've thought about this and am
>>>>>>>>> trying to get confirmation if that is reasonable given our
>>>>>>>>> constraints.
>>>>>>>>>
>>>>>>>>> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson <er...@gmail.com> wrote:
>>>>>>>>>> Yeah, loading the document inside a Collector is a
>>>>>>>>>> definite no-no. Have you tried going directly
>>>>>>>>>> at the fields you want (assuming they're
>>>>>>>>>> indexed)? That *should* be much faster, but
>>>>>>>>>> whether it'll be fast enough is a good question. I'm
>>>>>>>>>> thinking some of the Terms methods here. You
>>>>>>>>>> *might* get some joy out of making sure lazy
>>>>>>>>>> field loading is enabled (and make sure the
>>>>>>>>>> fields you're accessing for your logic are
>>>>>>>>>> indexed), but I'm not entirely sure about
>>>>>>>>>> that bit.
>>>>>>>>>>
>>>>>>>>>> This kind of problem is sometimes handled
>>>>>>>>>> by indexing "auth tokens" with the documents
>>>>>>>>>> and including an OR clause on the query
>>>>>>>>>> with the authorizations for a particular
>>>>>>>>>> user, but that works best if there is an upper
>>>>>>>>>> limit (in the 100s) of tokens that a user can possibly
>>>>>>>>>> have, often this works best with some kind of
>>>>>>>>>> grouping. Making this work when a user can
>>>>>>>>>> have tens of thousands of auth tokens is...er...
>>>>>>>>>> contra-indicated...
>>>>>>>>>>
>>>>>>>>>> Hope this helps a bit...
>>>>>>>>>> Erick
>>>>>>>>>>
>>>>>>>>>> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>>> Just a bit more information.  Inside my class which extends
>>>>>>>>>>> FilteredDocIdSet all of the time seems to be getting spent in
>>>>>>>>>>> retrieving the document from the readerCtx, doing this
>>>>>>>>>>>
>>>>>>>>>>> Document doc = readerCtx.reader.document(docid);
>>>>>>>>>>>
>>>>>>>>>>> If I comment out this and just return true things fly along as I
>>>>>>>>>>> expect.  My query is returning a total of 2 million documents also.
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson <je...@gmail.com> wrote:
>>>>>>>>>>>> I have a need to post process Solr results based on some access
>>>>>>>>>>>> controls which are setup outside of Solr, currently we've written
>>>>>>>>>>>> something that extends SearchComponent and in the prepare method I'm
>>>>>>>>>>>> doing something like this
>>>>>>>>>>>>
>>>>>>>>>>>>                    QueryWrapperFilter qwf = new
>>>>>>>>>>>> QueryWrapperFilter(rb.getQuery());
>>>>>>>>>>>>                    Filter filter = new CustomFilter(qwf);
>>>>>>>>>>>>                    FilteredQuery fq = new FilteredQuery(rb.getQuery(), filter);
>>>>>>>>>>>>                    rb.setQuery(fq);
>>>>>>>>>>>>
>>>>>>>>>>>> Inside my CustomFilter I have a FilteredDocIdSet which checks if the
>>>>>>>>>>>> document should be returned.  This works as I expect but for some
>>>>>>>>>>>> reason is very very slow.  Even if I take out any of the machinery
>>>>>>>>>>>> which does any logic with the document and only return true in the
>>>>>>>>>>>> FilteredDocIdSets match method the query still takes an inordinate
>>>>>>>>>>>> amount of time as compared to not including this custom filter.  So my
>>>>>>>>>>>> question, is this the most appropriate way of handling this?  What
>>>>>>>>>>>> should the performance out of such a setup be expected to be?  Any
>>>>>>>>>>>> information/pointers would be greatly appreciated.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>