You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by vivek sar <vi...@gmail.com> on 2009/07/09 02:34:16 UTC

Boosting for most recent documents

Hi,

  I'm trying to find a way to get the most recent entry for the
searched word. For ex., if I have a document with field name "user".
If I search for user:vivek, I want to get the document that was
indexed most recently. Two ways I could think of,

1) Sort by some time stamp field - but with millions of documents this
becomes a huge memory problem as we have seen OOM with sorting before
2) Boost the most recent document - I'm not sure how to do this.
Basically, we want to have the most recent document score higher than
any other and then we can retrieve just 10 records and sort in the
application by time stamp field to get the most recent document
matching the keyword.

Any suggestion on how can this be done?

Thanks,
-vivek

Re: Boosting for most recent documents

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Mon, Aug 3, 2009 at 2:46 PM, vivek sar<vi...@gmail.com> wrote:
> So, if I run only one sort query once in a day there would still be
> 4GB required at all time. Is there any way to tell Solr/Lucene to
> release the memory once the query has been run? Basically I don't want
> cache. I've commented out all the cache parameters in the
> solrconfig.xml, but I still see the very first time I run the sort
> query the memory jumps by 4 G and remains there.

There is currently no way to tell Lucene not to cache the FieldCache
entry it uses for sorting.
If you call commit though, a new searcher will be opened and the
memory should be released.

-Yonik
http://www.lucidimagination.com

Re: Boosting for most recent documents

Posted by vivek sar <vi...@gmail.com>.
Hi,

 Related question to "getting the latest records first". After trying
few suggested ways (function query, index time boosting) of getting
the latest first I settled for simple "sort" parameter,

     sort=field+asc

As per wiki, http://wiki.apache.org/solr/SchemaDesign?highlight=(sort),

Lucene would cache "4 bytes * the number of documents" plus unique
terms for the sorted field in fieldcache. This is done so subsequent
sort requests can be retrieved from cache. So the memory usage if I
got 1 billion records in one Indexer instance, for ex,

1) 1 billion records
2) sort on time stamp field (rounded to hour) - for 1 year - 8760
unique terms. (negligible)
3) Total memory requirement  for sorting on this single field would be
around  1G * 4 = 4GB

So, if I run only one sort query once in a day there would still be
4GB required at all time. Is there any way to tell Solr/Lucene to
release the memory once the query has been run? Basically I don't want
cache. I've commented out all the cache parameters in the
solrconfig.xml, but I still see the very first time I run the sort
query the memory jumps by 4 G and remains there.

Is there any way so Lucene/Solr doesn't use so much memory for sorting
so my application can scale (sorting memory requirement won't be
function of number of documents)?

Thanks,
-vivek





On Thu, Jul 16, 2009 at 3:10 PM, Chris
Hostetter<ho...@fucit.org> wrote:
>
> :   Does anyone know if Solr supports sorting by internal document ids,
> : i.e, like Sort.INDEXORDER in Lucene? If so, how?
>
> It does not.  in Solr the decisison to make "score desc" the default
> search ment there is no way to request simple docId ordering.
>
> : Also, if anyone have any insight on if function query loads up unique
> : terms (like field sorts) in memory or not.
>
> It uses the exact same FieldCache as sorting.
>
>
>
>
> -Hoss
>

Re: Boosting for most recent documents

Posted by Chris Hostetter <ho...@fucit.org>.
:   Does anyone know if Solr supports sorting by internal document ids,
: i.e, like Sort.INDEXORDER in Lucene? If so, how?

It does not.  in Solr the decisison to make "score desc" the default 
search ment there is no way to request simple docId ordering.

: Also, if anyone have any insight on if function query loads up unique
: terms (like field sorts) in memory or not.

It uses the exact same FieldCache as sorting.




-Hoss

Re: Boosting for most recent documents

Posted by vivek sar <vi...@gmail.com>.
Hi,

  Does anyone know if Solr supports sorting by internal document ids,
i.e, like Sort.INDEXORDER in Lucene? If so, how?

Also, if anyone have any insight on if function query loads up unique
terms (like field sorts) in memory or not.

Thanks,
-vivek

On Fri, Jul 10, 2009 at 10:26 AM, vivek sar<vi...@gmail.com> wrote:
> Thanks Bill. Couple of questions,
>
> 1) Would the function query load all unique terms (for that field) in
> memory the way sort (field cache) does? If so, that wouldn't work for
> us as we can have over 5 billion records spread across multiple shards
> (up to 10 indexer instances), that would surely kill the process if it
> were to load everything in memory.
>
> 2) Would the function query work on multi-shard query? For ex.,
> recip(rord(creationDate),1,1000,1000) would it automatically do the
> function on the combined result from all the shards or would it run on
> individual shard and get results from them?
>
> I would still be interested in knowing if Solr supports
> Sort.IndexOrder - if so, how?
>
> Thanks,
> -vivek
>
> On Thu, Jul 9, 2009 at 8:27 PM, Bill Au<bi...@gmail.com> wrote:
>> With a time stamp you can use a function query to boost the score of newer
>> documents:
>> http://wiki.apache.org/solr/SolrRelevancyFAQ#head-b1b1cdedcb9cd9bfd9c994709b4d7e540359b1fd
>>
>> Bill
>>
>> On Thu, Jul 9, 2009 at 5:58 PM, vivek sar <vi...@gmail.com> wrote:
>>
>>> How do we sort by internal doc id (say on one index only) using Solr?
>>> I saw couple of threads saying it (Sort.INDEXORDER) was not supported
>>> in Solr,
>>>
>>>
>>> http://www.nabble.com/sort-by-index-id-descending--td16124009.html#a16124009
>>>
>>> http://www.nabble.com/Reverse-sorting-by-index-order-td1321032.html#a1321032
>>>
>>> Has the index order support been added in Solr 1.4? How do we use that
>>> - any documentation?
>>>
>>> Thanks,
>>> -vivek
>>>
>>> On Thu, Jul 9, 2009 at 2:21 PM, Otis
>>> Gospodnetic<ot...@yahoo.com> wrote:
>>> >
>>> > Ah, with multiple indices you can't rely on the max Lucene doc Id.  I
>>> think you have to do with the timestamp approach.
>>> >
>>> > Otis
>>> > --
>>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >
>>> >
>>> >
>>> > ----- Original Message ----
>>> >> From: vivek sar <vi...@gmail.com>
>>> >> To: solr-user@lucene.apache.org
>>> >> Sent: Thursday, July 9, 2009 1:13:54 PM
>>> >> Subject: Re: Boosting for most recent documents
>>> >>
>>> >> Thanks Otis. I got a distributed index - using Solr multi-core.
>>> >> Basically, I got 6 indexer instances running on 3 different boxes.
>>> >> Couple of questions,
>>> >>
>>> >> 1)  Is it possible to sort on document id for multiple-shards? How is
>>> that done?
>>> >> 2) How would boost by most recent doc at index time?
>>> >>
>>> >> Thanks,
>>> >> -vivek
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jul 8, 2009 at 7:47 PM, Otis
>>> >> Gospodneticwrote:
>>> >> >
>>> >> > Sort by the internal Lucene document ID and pick the highest one.
>>>  That might
>>> >> do the job for you.
>>> >> >
>>> >> > Otis
>>> >> > --
>>> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >> >
>>> >> >
>>> >> >
>>> >> > ----- Original Message ----
>>> >> >> From: vivek sar
>>> >> >> To: solr-user
>>> >> >> Sent: Wednesday, July 8, 2009 8:34:16 PM
>>> >> >> Subject: Boosting for most recent documents
>>> >> >>
>>> >> >> Hi,
>>> >> >>
>>> >> >>   I'm trying to find a way to get the most recent entry for the
>>> >> >> searched word. For ex., if I have a document with field name "user".
>>> >> >> If I search for user:vivek, I want to get the document that was
>>> >> >> indexed most recently. Two ways I could think of,
>>> >> >>
>>> >> >> 1) Sort by some time stamp field - but with millions of documents
>>> this
>>> >> >> becomes a huge memory problem as we have seen OOM with sorting before
>>> >> >> 2) Boost the most recent document - I'm not sure how to do this.
>>> >> >> Basically, we want to have the most recent document score higher than
>>> >> >> any other and then we can retrieve just 10 records and sort in the
>>> >> >> application by time stamp field to get the most recent document
>>> >> >> matching the keyword.
>>> >> >>
>>> >> >> Any suggestion on how can this be done?
>>> >> >>
>>> >> >> Thanks,
>>> >> >> -vivek
>>> >> >
>>> >> >
>>> >
>>> >
>>>
>>
>

Re: Boosting for most recent documents

Posted by vivek sar <vi...@gmail.com>.
Thanks Bill. Couple of questions,

1) Would the function query load all unique terms (for that field) in
memory the way sort (field cache) does? If so, that wouldn't work for
us as we can have over 5 billion records spread across multiple shards
(up to 10 indexer instances), that would surely kill the process if it
were to load everything in memory.

2) Would the function query work on multi-shard query? For ex.,
recip(rord(creationDate),1,1000,1000) would it automatically do the
function on the combined result from all the shards or would it run on
individual shard and get results from them?

I would still be interested in knowing if Solr supports
Sort.IndexOrder - if so, how?

Thanks,
-vivek

On Thu, Jul 9, 2009 at 8:27 PM, Bill Au<bi...@gmail.com> wrote:
> With a time stamp you can use a function query to boost the score of newer
> documents:
> http://wiki.apache.org/solr/SolrRelevancyFAQ#head-b1b1cdedcb9cd9bfd9c994709b4d7e540359b1fd
>
> Bill
>
> On Thu, Jul 9, 2009 at 5:58 PM, vivek sar <vi...@gmail.com> wrote:
>
>> How do we sort by internal doc id (say on one index only) using Solr?
>> I saw couple of threads saying it (Sort.INDEXORDER) was not supported
>> in Solr,
>>
>>
>> http://www.nabble.com/sort-by-index-id-descending--td16124009.html#a16124009
>>
>> http://www.nabble.com/Reverse-sorting-by-index-order-td1321032.html#a1321032
>>
>> Has the index order support been added in Solr 1.4? How do we use that
>> - any documentation?
>>
>> Thanks,
>> -vivek
>>
>> On Thu, Jul 9, 2009 at 2:21 PM, Otis
>> Gospodnetic<ot...@yahoo.com> wrote:
>> >
>> > Ah, with multiple indices you can't rely on the max Lucene doc Id.  I
>> think you have to do with the timestamp approach.
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: vivek sar <vi...@gmail.com>
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Thursday, July 9, 2009 1:13:54 PM
>> >> Subject: Re: Boosting for most recent documents
>> >>
>> >> Thanks Otis. I got a distributed index - using Solr multi-core.
>> >> Basically, I got 6 indexer instances running on 3 different boxes.
>> >> Couple of questions,
>> >>
>> >> 1)  Is it possible to sort on document id for multiple-shards? How is
>> that done?
>> >> 2) How would boost by most recent doc at index time?
>> >>
>> >> Thanks,
>> >> -vivek
>> >>
>> >>
>> >>
>> >> On Wed, Jul 8, 2009 at 7:47 PM, Otis
>> >> Gospodneticwrote:
>> >> >
>> >> > Sort by the internal Lucene document ID and pick the highest one.
>>  That might
>> >> do the job for you.
>> >> >
>> >> > Otis
>> >> > --
>> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message ----
>> >> >> From: vivek sar
>> >> >> To: solr-user
>> >> >> Sent: Wednesday, July 8, 2009 8:34:16 PM
>> >> >> Subject: Boosting for most recent documents
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >>   I'm trying to find a way to get the most recent entry for the
>> >> >> searched word. For ex., if I have a document with field name "user".
>> >> >> If I search for user:vivek, I want to get the document that was
>> >> >> indexed most recently. Two ways I could think of,
>> >> >>
>> >> >> 1) Sort by some time stamp field - but with millions of documents
>> this
>> >> >> becomes a huge memory problem as we have seen OOM with sorting before
>> >> >> 2) Boost the most recent document - I'm not sure how to do this.
>> >> >> Basically, we want to have the most recent document score higher than
>> >> >> any other and then we can retrieve just 10 records and sort in the
>> >> >> application by time stamp field to get the most recent document
>> >> >> matching the keyword.
>> >> >>
>> >> >> Any suggestion on how can this be done?
>> >> >>
>> >> >> Thanks,
>> >> >> -vivek
>> >> >
>> >> >
>> >
>> >
>>
>

Re: Boosting for most recent documents

Posted by Bill Au <bi...@gmail.com>.
With a time stamp you can use a function query to boost the score of newer
documents:
http://wiki.apache.org/solr/SolrRelevancyFAQ#head-b1b1cdedcb9cd9bfd9c994709b4d7e540359b1fd

Bill

On Thu, Jul 9, 2009 at 5:58 PM, vivek sar <vi...@gmail.com> wrote:

> How do we sort by internal doc id (say on one index only) using Solr?
> I saw couple of threads saying it (Sort.INDEXORDER) was not supported
> in Solr,
>
>
> http://www.nabble.com/sort-by-index-id-descending--td16124009.html#a16124009
>
> http://www.nabble.com/Reverse-sorting-by-index-order-td1321032.html#a1321032
>
> Has the index order support been added in Solr 1.4? How do we use that
> - any documentation?
>
> Thanks,
> -vivek
>
> On Thu, Jul 9, 2009 at 2:21 PM, Otis
> Gospodnetic<ot...@yahoo.com> wrote:
> >
> > Ah, with multiple indices you can't rely on the max Lucene doc Id.  I
> think you have to do with the timestamp approach.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: vivek sar <vi...@gmail.com>
> >> To: solr-user@lucene.apache.org
> >> Sent: Thursday, July 9, 2009 1:13:54 PM
> >> Subject: Re: Boosting for most recent documents
> >>
> >> Thanks Otis. I got a distributed index - using Solr multi-core.
> >> Basically, I got 6 indexer instances running on 3 different boxes.
> >> Couple of questions,
> >>
> >> 1)  Is it possible to sort on document id for multiple-shards? How is
> that done?
> >> 2) How would boost by most recent doc at index time?
> >>
> >> Thanks,
> >> -vivek
> >>
> >>
> >>
> >> On Wed, Jul 8, 2009 at 7:47 PM, Otis
> >> Gospodneticwrote:
> >> >
> >> > Sort by the internal Lucene document ID and pick the highest one.
>  That might
> >> do the job for you.
> >> >
> >> > Otis
> >> > --
> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >> >
> >> >
> >> >
> >> > ----- Original Message ----
> >> >> From: vivek sar
> >> >> To: solr-user
> >> >> Sent: Wednesday, July 8, 2009 8:34:16 PM
> >> >> Subject: Boosting for most recent documents
> >> >>
> >> >> Hi,
> >> >>
> >> >>   I'm trying to find a way to get the most recent entry for the
> >> >> searched word. For ex., if I have a document with field name "user".
> >> >> If I search for user:vivek, I want to get the document that was
> >> >> indexed most recently. Two ways I could think of,
> >> >>
> >> >> 1) Sort by some time stamp field - but with millions of documents
> this
> >> >> becomes a huge memory problem as we have seen OOM with sorting before
> >> >> 2) Boost the most recent document - I'm not sure how to do this.
> >> >> Basically, we want to have the most recent document score higher than
> >> >> any other and then we can retrieve just 10 records and sort in the
> >> >> application by time stamp field to get the most recent document
> >> >> matching the keyword.
> >> >>
> >> >> Any suggestion on how can this be done?
> >> >>
> >> >> Thanks,
> >> >> -vivek
> >> >
> >> >
> >
> >
>

Re: Boosting for most recent documents

Posted by vivek sar <vi...@gmail.com>.
How do we sort by internal doc id (say on one index only) using Solr?
I saw couple of threads saying it (Sort.INDEXORDER) was not supported
in Solr,

http://www.nabble.com/sort-by-index-id-descending--td16124009.html#a16124009
http://www.nabble.com/Reverse-sorting-by-index-order-td1321032.html#a1321032

Has the index order support been added in Solr 1.4? How do we use that
- any documentation?

Thanks,
-vivek

On Thu, Jul 9, 2009 at 2:21 PM, Otis
Gospodnetic<ot...@yahoo.com> wrote:
>
> Ah, with multiple indices you can't rely on the max Lucene doc Id.  I think you have to do with the timestamp approach.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: vivek sar <vi...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, July 9, 2009 1:13:54 PM
>> Subject: Re: Boosting for most recent documents
>>
>> Thanks Otis. I got a distributed index - using Solr multi-core.
>> Basically, I got 6 indexer instances running on 3 different boxes.
>> Couple of questions,
>>
>> 1)  Is it possible to sort on document id for multiple-shards? How is that done?
>> 2) How would boost by most recent doc at index time?
>>
>> Thanks,
>> -vivek
>>
>>
>>
>> On Wed, Jul 8, 2009 at 7:47 PM, Otis
>> Gospodneticwrote:
>> >
>> > Sort by the internal Lucene document ID and pick the highest one.  That might
>> do the job for you.
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: vivek sar
>> >> To: solr-user
>> >> Sent: Wednesday, July 8, 2009 8:34:16 PM
>> >> Subject: Boosting for most recent documents
>> >>
>> >> Hi,
>> >>
>> >>   I'm trying to find a way to get the most recent entry for the
>> >> searched word. For ex., if I have a document with field name "user".
>> >> If I search for user:vivek, I want to get the document that was
>> >> indexed most recently. Two ways I could think of,
>> >>
>> >> 1) Sort by some time stamp field - but with millions of documents this
>> >> becomes a huge memory problem as we have seen OOM with sorting before
>> >> 2) Boost the most recent document - I'm not sure how to do this.
>> >> Basically, we want to have the most recent document score higher than
>> >> any other and then we can retrieve just 10 records and sort in the
>> >> application by time stamp field to get the most recent document
>> >> matching the keyword.
>> >>
>> >> Any suggestion on how can this be done?
>> >>
>> >> Thanks,
>> >> -vivek
>> >
>> >
>
>

Re: Boosting for most recent documents

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Ah, with multiple indices you can't rely on the max Lucene doc Id.  I think you have to do with the timestamp approach.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: vivek sar <vi...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, July 9, 2009 1:13:54 PM
> Subject: Re: Boosting for most recent documents
> 
> Thanks Otis. I got a distributed index - using Solr multi-core.
> Basically, I got 6 indexer instances running on 3 different boxes.
> Couple of questions,
> 
> 1)  Is it possible to sort on document id for multiple-shards? How is that done?
> 2) How would boost by most recent doc at index time?
> 
> Thanks,
> -vivek
> 
> 
> 
> On Wed, Jul 8, 2009 at 7:47 PM, Otis
> Gospodneticwrote:
> >
> > Sort by the internal Lucene document ID and pick the highest one.  That might 
> do the job for you.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: vivek sar 
> >> To: solr-user 
> >> Sent: Wednesday, July 8, 2009 8:34:16 PM
> >> Subject: Boosting for most recent documents
> >>
> >> Hi,
> >>
> >>   I'm trying to find a way to get the most recent entry for the
> >> searched word. For ex., if I have a document with field name "user".
> >> If I search for user:vivek, I want to get the document that was
> >> indexed most recently. Two ways I could think of,
> >>
> >> 1) Sort by some time stamp field - but with millions of documents this
> >> becomes a huge memory problem as we have seen OOM with sorting before
> >> 2) Boost the most recent document - I'm not sure how to do this.
> >> Basically, we want to have the most recent document score higher than
> >> any other and then we can retrieve just 10 records and sort in the
> >> application by time stamp field to get the most recent document
> >> matching the keyword.
> >>
> >> Any suggestion on how can this be done?
> >>
> >> Thanks,
> >> -vivek
> >
> >


Re: Boosting for most recent documents

Posted by vivek sar <vi...@gmail.com>.
Thanks Otis. I got a distributed index - using Solr multi-core.
Basically, I got 6 indexer instances running on 3 different boxes.
Couple of questions,

1)  Is it possible to sort on document id for multiple-shards? How is that done?
2) How would boost by most recent doc at index time?

Thanks,
-vivek



On Wed, Jul 8, 2009 at 7:47 PM, Otis
Gospodnetic<ot...@yahoo.com> wrote:
>
> Sort by the internal Lucene document ID and pick the highest one.  That might do the job for you.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: vivek sar <vi...@gmail.com>
>> To: solr-user <so...@lucene.apache.org>
>> Sent: Wednesday, July 8, 2009 8:34:16 PM
>> Subject: Boosting for most recent documents
>>
>> Hi,
>>
>>   I'm trying to find a way to get the most recent entry for the
>> searched word. For ex., if I have a document with field name "user".
>> If I search for user:vivek, I want to get the document that was
>> indexed most recently. Two ways I could think of,
>>
>> 1) Sort by some time stamp field - but with millions of documents this
>> becomes a huge memory problem as we have seen OOM with sorting before
>> 2) Boost the most recent document - I'm not sure how to do this.
>> Basically, we want to have the most recent document score higher than
>> any other and then we can retrieve just 10 records and sort in the
>> application by time stamp field to get the most recent document
>> matching the keyword.
>>
>> Any suggestion on how can this be done?
>>
>> Thanks,
>> -vivek
>
>

Re: Boosting for most recent documents

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Sort by the internal Lucene document ID and pick the highest one.  That might do the job for you.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: vivek sar <vi...@gmail.com>
> To: solr-user <so...@lucene.apache.org>
> Sent: Wednesday, July 8, 2009 8:34:16 PM
> Subject: Boosting for most recent documents
> 
> Hi,
> 
>   I'm trying to find a way to get the most recent entry for the
> searched word. For ex., if I have a document with field name "user".
> If I search for user:vivek, I want to get the document that was
> indexed most recently. Two ways I could think of,
> 
> 1) Sort by some time stamp field - but with millions of documents this
> becomes a huge memory problem as we have seen OOM with sorting before
> 2) Boost the most recent document - I'm not sure how to do this.
> Basically, we want to have the most recent document score higher than
> any other and then we can retrieve just 10 records and sort in the
> application by time stamp field to get the most recent document
> matching the keyword.
> 
> Any suggestion on how can this be done?
> 
> Thanks,
> -vivek