You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pooja Verlani <po...@gmail.com> on 2011/07/11 11:07:57 UTC

Restricting the Solr Posting List (retrieved set)

Hi,

We want to search in an index in such a way that even if a clause has a long
posting list - Solr should stop collecting documents for the clause
after receiving X documents that match the clause.

For example, if  for query "India",solr can return 5M documents, we would
like to restrict the set at only 500K documents.

The assumption is that since we are posting chronologically - we would like
the X most recent documents to be matched for the clause only.

Is it possible anyway?

Regards,
Pooja

Re: Restricting the Solr Posting List (retrieved set)

Posted by Pooja Verlani <po...@gmail.com>.
Thanks for the reply.

I am having a very huge index, so to retrieve older documents when not
needed definitely wastes time and also at the same time I would need to do
recency boosts/ time sort. So, I am looking for a way to avoid that.
Thats why I am in need to restrict my docset  and recently added ones. I
would not prefer to use the "rows" parameter for this.

Thanks,
pooja

On Mon, Jul 11, 2011 at 5:49 PM, Bob Sandiford <bob.sandiford@sirsidynix.com
> wrote:

> A good answer may also depend on WHY you are wanting to restrict to 500K
> documents.
>
> Are you seeking to reduce the time spent by Solr in determining the doc
> count?  Are you just wanting to prevent people from moving too far into the
> result set?  Is it case that you can only display 6 digits for your return
> count? :)
>
> If Solr is performing adequately, you could always just artificially
> restrict the result set.  Solr doesn't actually 'return' all 5M documents -
> it only returns the number you have specified in your query (as well as
> having some cache for the next results in anticipation of a subsequent
> query).  So, if the total count returned exceeds 500K, then just report 500K
> as the number of results, and similarly restrict how far a user can page
> through the results...
>
> (And - you can (and sounds like you should) sort your results by descending
> post date so that you do in fact get the most recent ones coming back
> first...)
>
> Bob Sandiford | Lead Software Engineer | SirsiDynix
> P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com
> www.sirsidynix.com
>
>
> > -----Original Message-----
> > From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
> > Sent: Monday, July 11, 2011 7:43 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Restricting the Solr Posting List (retrieved set)
> >
> >
> > > We want to search in an index in such a way that even if a
> > > clause has a long
> > > posting list - Solr should stop collecting documents for
> > > the clause
> > > after receiving X documents that match the clause.
> > >
> > > For example, if  for query "India",solr can return 5M
> > > documents, we would
> > > like to restrict the set at only 500K documents.
> > >
> > > The assumption is that since we are posting chronologically
> > > - we would like
> > > the X most recent documents to be matched for the clause
> > > only.
> > >
> > > Is it possible anyway?
> >
> > Looks like your use-case is suitable for time based sharding.
> > http://wiki.apache.org/solr/DistributedSearch
> >
> > Lets say you divide your shards according to months. You will have a
> > separate core for each month.
> > http://wiki.apache.org/solr/CoreAdmin
> >
> > When a query comes in, you will hit the most recent core. If you don't
> > obtain enough results add a new value (previous month core) to &shards=
> > parameter.
> >
>
>
>

RE: Restricting the Solr Posting List (retrieved set)

Posted by Bob Sandiford <bo...@sirsidynix.com>.
A good answer may also depend on WHY you are wanting to restrict to 500K documents.

Are you seeking to reduce the time spent by Solr in determining the doc count?  Are you just wanting to prevent people from moving too far into the result set?  Is it case that you can only display 6 digits for your return count? :)

If Solr is performing adequately, you could always just artificially restrict the result set.  Solr doesn't actually 'return' all 5M documents - it only returns the number you have specified in your query (as well as having some cache for the next results in anticipation of a subsequent query).  So, if the total count returned exceeds 500K, then just report 500K as the number of results, and similarly restrict how far a user can page through the results...

(And - you can (and sounds like you should) sort your results by descending post date so that you do in fact get the most recent ones coming back first...)

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com
www.sirsidynix.com


> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
> Sent: Monday, July 11, 2011 7:43 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Restricting the Solr Posting List (retrieved set)
> 
> 
> > We want to search in an index in such a way that even if a
> > clause has a long
> > posting list - Solr should stop collecting documents for
> > the clause
> > after receiving X documents that match the clause.
> >
> > For example, if  for query "India",solr can return 5M
> > documents, we would
> > like to restrict the set at only 500K documents.
> >
> > The assumption is that since we are posting chronologically
> > - we would like
> > the X most recent documents to be matched for the clause
> > only.
> >
> > Is it possible anyway?
> 
> Looks like your use-case is suitable for time based sharding.
> http://wiki.apache.org/solr/DistributedSearch
> 
> Lets say you divide your shards according to months. You will have a
> separate core for each month.
> http://wiki.apache.org/solr/CoreAdmin
> 
> When a query comes in, you will hit the most recent core. If you don't
> obtain enough results add a new value (previous month core) to &shards=
> parameter.
> 



Re: Restricting the Solr Posting List (retrieved set)

Posted by Ahmet Arslan <io...@yahoo.com>.
 
> We want to search in an index in such a way that even if a
> clause has a long
> posting list - Solr should stop collecting documents for
> the clause
> after receiving X documents that match the clause.
> 
> For example, if  for query "India",solr can return 5M
> documents, we would
> like to restrict the set at only 500K documents.
> 
> The assumption is that since we are posting chronologically
> - we would like
> the X most recent documents to be matched for the clause
> only.
> 
> Is it possible anyway?

Looks like your use-case is suitable for time based sharding.
http://wiki.apache.org/solr/DistributedSearch

Lets say you divide your shards according to months. You will have a separate core for each month. 
http://wiki.apache.org/solr/CoreAdmin

When a query comes in, you will hit the most recent core. If you don't obtain enough results add a new value (previous month core) to &shards= parameter.