You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Omri Suissa <om...@diffdoof.com> on 2012/11/25 09:32:09 UTC

Lucene results filtering best practices

Hi all,

All the docs in my index have a field named "groupId" to enable filtering
the search results by the user's groups. Each user have several groups
(around 20-100 in average).

Now I have 2 implementation options:

1)      Add to the query 20-100 terms (with OR) of each user group (for
example: "content:cat AND (groupId:4 OR groupId:58 OR groupId:94 … OR
groupId:N)")

2)      Search only the user's query and create a collector (I already have
one) that filters the results before scoring (get all the groupId's of the
docs and score and add only if exists in the user's group list).

Regardless the time and effort of the implementation, what is better (and
why)?



Thanks,

Omri

Re: Lucene results filtering best practices

Posted by Omri Suissa <om...@diffdoof.com>.
Thanks! :)

*Omri Suissa     **VP R&D*

*Tel:    +972 9 7724228                         **DiffDoof .ltd**
            *

*Cell:   +972 54 5395206                       **11, Galgaley Haplada
Street, *

*Fax:   +972 9 9512577**                         P.O.Box 2150***

*www.DiffDoof.com* <http://www.DiffDoof.com>*                              *
*Herzlia Pituach 46120, Israel*



On Sun, Nov 25, 2012 at 8:58 PM, Allan, Brad (Wokingham) <
Brad.Allan@fiserv.com> wrote:

> I would say because the default  scorer only considers the candidates in
> the subset - so no need, if the default scorer meets your needs, to do
> anything with scoring.
>
> I also recall so performance degradation warning if you access the reader
> from the collector - it's on one of the methods of the collector.
>
> (Sent from my Blackberry device)
> Brad Allan
> Development Lead
> Risk & Compliance
> Fiserv
> Office: +44 (0) 845 013 1137
> Mobile: +44 (0) 7866 720024
> Fax: +44 (0) 845 013 1010
> www.fiserv.com
>
> ----- Original Message -----
> From: Omri Suissa [mailto:omri.suissa@diffdoof.com]
> Sent: Sunday, November 25, 2012 10:07 AM
> To: Simon Svensson <si...@devhost.se>
> Cc: user@lucenenet.apache.org <us...@lucenenet.apache.org>
> Subject: Re: Lucene results filtering best practices
>
> Hi,
> Thanks.
> Can you tell me why TermsFilter is better then filtering with a collector?
>
> Omri
>
>
> On Sun, Nov 25, 2012 at 10:54 AM, Simon Svensson <si...@devhost.se> wrote:
>
> >  Hi,
> >
> > Use a TermsFilter<
> http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/TermsFilter.htm
> >.
> >
> >
> > Constructs a filter for docs matching any of the terms added to this
> > class. Unlike a RangeFilter this can be used for filtering on multiple
> > terms that are not necessarily in a sequence. An example might be a
> > collection of primary keys from a database query result or perhaps a
> choice
> > of "category" labels picked by the end user. As a filter, this is much
> > faster than the equivalent query (a BooleanQuery with many "should"
> > TermQueries)
> >
> > Depending on the number of users, queries and magic domain information
> > only known to you, check out the CachingWrapperFilter<
> http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/TermsFilter.htm
> >
> >
> > Wraps another filter's result and caches it. The purpose is to allow
> > filters to simply filter, and then wrap with this class to add caching.
> >
> > // Simon
> >
> > On 2012-11-25 09:32, Omri Suissa wrote:
> >
> >   Hi all,
> >
> > All the docs in my index have a field named "groupId" to enable filtering
> > the search results by the user's groups. Each user have several groups
> > (around 20-100 in average).
> >
> > Now I have 2 implementation options:
> >
> > 1)      Add to the query 20-100 terms (with OR) of each user group (for
> > example: "content:cat AND (groupId:4 OR groupId:58 OR groupId:94 … OR
> > groupId:N)")
> >
> > 2)      Search only the user's query and create a collector (I already
> have
> > one) that filters the results before scoring (get all the groupId's of
> the
> > docs and score and add only if exists in the user's group list).
> >
> > Regardless the time and effort of the implementation, what is better (and
> > why)?
> >
> >
> >
> > Thanks,
> >
> > Omri
> >
> >
> >
>
> ________________________________
>
> CheckFree Solutions Limited (trading as Fiserv)
> Registered Office: Eversheds House, 70 Great Bridgewater Street,
> Manchester, M15 ES
> Registered in England: No. 2694333
>

Re: Lucene results filtering best practices

Posted by "Allan, Brad (Wokingham)" <Br...@Fiserv.com>.
I would say because the default  scorer only considers the candidates in the subset - so no need, if the default scorer meets your needs, to do anything with scoring.

I also recall so performance degradation warning if you access the reader from the collector - it's on one of the methods of the collector.

(Sent from my Blackberry device)
Brad Allan
Development Lead
Risk & Compliance
Fiserv
Office: +44 (0) 845 013 1137
Mobile: +44 (0) 7866 720024
Fax: +44 (0) 845 013 1010
www.fiserv.com

----- Original Message -----
From: Omri Suissa [mailto:omri.suissa@diffdoof.com]
Sent: Sunday, November 25, 2012 10:07 AM
To: Simon Svensson <si...@devhost.se>
Cc: user@lucenenet.apache.org <us...@lucenenet.apache.org>
Subject: Re: Lucene results filtering best practices

Hi,
Thanks.
Can you tell me why TermsFilter is better then filtering with a collector?

Omri


On Sun, Nov 25, 2012 at 10:54 AM, Simon Svensson <si...@devhost.se> wrote:

>  Hi,
>
> Use a TermsFilter<http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/TermsFilter.htm>.
>
>
> Constructs a filter for docs matching any of the terms added to this
> class. Unlike a RangeFilter this can be used for filtering on multiple
> terms that are not necessarily in a sequence. An example might be a
> collection of primary keys from a database query result or perhaps a choice
> of "category" labels picked by the end user. As a filter, this is much
> faster than the equivalent query (a BooleanQuery with many "should"
> TermQueries)
>
> Depending on the number of users, queries and magic domain information
> only known to you, check out the CachingWrapperFilter<http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/TermsFilter.htm>
>
> Wraps another filter's result and caches it. The purpose is to allow
> filters to simply filter, and then wrap with this class to add caching.
>
> // Simon
>
> On 2012-11-25 09:32, Omri Suissa wrote:
>
>   Hi all,
>
> All the docs in my index have a field named "groupId" to enable filtering
> the search results by the user's groups. Each user have several groups
> (around 20-100 in average).
>
> Now I have 2 implementation options:
>
> 1)      Add to the query 20-100 terms (with OR) of each user group (for
> example: "content:cat AND (groupId:4 OR groupId:58 OR groupId:94 … OR
> groupId:N)")
>
> 2)      Search only the user's query and create a collector (I already have
> one) that filters the results before scoring (get all the groupId's of the
> docs and score and add only if exists in the user's group list).
>
> Regardless the time and effort of the implementation, what is better (and
> why)?
>
>
>
> Thanks,
>
> Omri
>
>
>

________________________________

CheckFree Solutions Limited (trading as Fiserv)
Registered Office: Eversheds House, 70 Great Bridgewater Street, Manchester, M15 ES
Registered in England: No. 2694333

Re: Lucene results filtering best practices

Posted by Omri Suissa <om...@diffdoof.com>.
Hi,
Thanks.
Can you tell me why TermsFilter is better then filtering with a collector?

Omri


On Sun, Nov 25, 2012 at 10:54 AM, Simon Svensson <si...@devhost.se> wrote:

>  Hi,
>
> Use a TermsFilter<http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/TermsFilter.htm>.
>
>
> Constructs a filter for docs matching any of the terms added to this
> class. Unlike a RangeFilter this can be used for filtering on multiple
> terms that are not necessarily in a sequence. An example might be a
> collection of primary keys from a database query result or perhaps a choice
> of "category" labels picked by the end user. As a filter, this is much
> faster than the equivalent query (a BooleanQuery with many "should"
> TermQueries)
>
> Depending on the number of users, queries and magic domain information
> only known to you, check out the CachingWrapperFilter<http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/TermsFilter.htm>
>
> Wraps another filter's result and caches it. The purpose is to allow
> filters to simply filter, and then wrap with this class to add caching.
>
> // Simon
>
> On 2012-11-25 09:32, Omri Suissa wrote:
>
>   Hi all,
>
> All the docs in my index have a field named "groupId" to enable filtering
> the search results by the user's groups. Each user have several groups
> (around 20-100 in average).
>
> Now I have 2 implementation options:
>
> 1)      Add to the query 20-100 terms (with OR) of each user group (for
> example: "content:cat AND (groupId:4 OR groupId:58 OR groupId:94 … OR
> groupId:N)")
>
> 2)      Search only the user's query and create a collector (I already have
> one) that filters the results before scoring (get all the groupId's of the
> docs and score and add only if exists in the user's group list).
>
> Regardless the time and effort of the implementation, what is better (and
> why)?
>
>
>
> Thanks,
>
> Omri
>
>
>

Re: Lucene results filtering best practices

Posted by Simon Svensson <si...@devhost.se>.
Hi,

Use a TermsFilter 
<http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/TermsFilter.htm>. 


    Constructs a filter for docs matching any of the terms added to this
    class. Unlike a RangeFilter this can be used for filtering on
    multiple terms that are not necessarily in a sequence. An example
    might be a collection of primary keys from a database query result
    or perhaps a choice of "category" labels picked by the end user. As
    a filter, this is much faster than the equivalent query (a
    BooleanQuery with many "should" TermQueries)

Depending on the number of users, queries and magic domain information 
only known to you, check out the CachingWrapperFilter 
<http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/all/org/apache/lucene/search/TermsFilter.htm> 


    Wraps another filter's result and caches it. The purpose is to allow
    filters to simply filter, and then wrap with this class to add caching.

// Simon

On 2012-11-25 09:32, Omri Suissa wrote:

> Hi all,
>
> All the docs in my index have a field named "groupId" to enable filtering
> the search results by the user's groups. Each user have several groups
> (around 20-100 in average).
>
> Now I have 2 implementation options:
>
> 1)      Add to the query 20-100 terms (with OR) of each user group (for
> example: "content:cat AND (groupId:4 OR groupId:58 OR groupId:94 … OR
> groupId:N)")
>
> 2)      Search only the user's query and create a collector (I already have
> one) that filters the results before scoring (get all the groupId's of the
> docs and score and add only if exists in the user's group list).
>
> Regardless the time and effort of the implementation, what is better (and
> why)?
>
>
>
> Thanks,
>
> Omri
>