You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ra...@yahoo.com on 2006/10/12 21:04:42 UTC

Avoiding sort by date

Hi folks,

I am using Lucene 2.0

In our application, I am indexing a stream of documents. Each document is fairly small (< 1 KB), but there can be 10's of millions of documents. Each document has a Timestamp field. Users can enter free-form searches and a date/time range. They are most interested in the most recent documents (as indicated in the Timestamp field). An obvious way to do achieve this is to 
searcher = new IndexSearcher(indexDir);
RangeFilter rf = new RangeFilter("day", start, end, true, true);
hits = searcher.search(query,rf,new Sort(new SortField[]{
                    new SortField("timestamp",SortField.STRING,true )}));

Depending on the query, there may be millions of hits results. If the same query is executed several times in quick succession, the heap quickly runs out of memory. I suspect that this is because Lucene needs to load all the millions of hits in order to sort the results.

My idea is to avoid the Sort() entirely. Is there a way, during indexing (or by setting Weights inside the query) to automatically set the score for more recent documents higher?

Thanks
--
Solidguy



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Avoiding sort by date

Posted by Graham Stead <gs...@ieee.org>.

Given that you want to score new documents higher (implicitly sorting them),
I wonder whether Solr's FunctionQuery (specifically ReciprocalFloatFunction
-
http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/Re
ciprocalFloatFunction.html) may also be helpful. It gives newer documents
higher scores than older documents. 

I believe ReciprocalFloatFunction uses the document order within the index
to help accomplish this (see ReverseOrdFieldSource), so your code would have
to index new documents after older ones. Usually this is not a problem.

In your case, I'm not sure when it's better to use Sort or
ReciprocalFloatFunction. Perhaps someone with more knowledge than I could
advise?

-Graham

> -----Original Message-----
> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf 
> Of Yonik Seeley
> Sent: Sunday, October 15, 2006 8:32 PM
> To: java-user@lucene.apache.org
> Subject: Re: Avoiding sort by date
> 
> On 10/12/06, rayvittal-lists@yahoo.com 
> <ra...@yahoo.com> wrote:
> > Does the Sort function create some kind of internal cache?
> 
> Yes, it's called the FieldCache, and there is a cache with a 
> weak reference to the index reader as a key.  As long as 
> there is a reference to the index reader (even after close() 
> has been called) the cache data will exist.
> 
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene 
> search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoiding sort by date

Posted by Yonik Seeley <yo...@apache.org>.

On 10/12/06, rayvittal-lists@yahoo.com <ra...@yahoo.com> wrote:
> Does the Sort function create some kind of internal cache?

Yes, it's called the FieldCache, and there is a cache with a weak
reference to the index reader as a key.  As long as there is a
reference to the index reader (even after close() has been called) the
cache data will exist.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

> Observing the heap, it seems that a full garbage collection after calling
> IndexSearcher.close() still leaves a lot of memory occupied.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoiding sort by date

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Oct 12, 2006, at 9:25 PM, <ra...@yahoo.com> wrote:
> Does the Sort function create some kind of internal cache?  
> Observing the heap, it seems that a full garbage collection after  
> calling IndexSearcher.close() still leaves a lot of memory occupied.

Yes, sorting caches, potentially a lot.  I'm not sure what could be  
going on with the excessive memory usage, but certainly the sorting  
caches can be large.



>
> Thanks
> --
> Solidguy
>
> ----- Original Message ----
> From: Erik Hatcher <er...@ehatchersolutions.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, October 12, 2006 12:58:50 PM
> Subject: Re: Avoiding sort by date
>
> You really should be using the same IndexSearcher for successive
> searches.  Sorting works best when done with a "warm" searcher.  Have
> a look at Solr's warming strategy, and consider adopting that in some
> way.
>
>     Erik
>
>
> On Oct 12, 2006, at 3:04 PM, <ra...@yahoo.com> wrote:
>
>> Hi folks,
>>
>> I am using Lucene 2.0
>>
>> In our application, I am indexing a stream of documents. Each
>> document is fairly small (< 1 KB), but there can be 10's of
>> millions of documents. Each document has a Timestamp field. Users
>> can enter free-form searches and a date/time range. They are most
>> interested in the most recent documents (as indicated in the
>> Timestamp field). An obvious way to do achieve this is to
>> searcher = new IndexSearcher(indexDir);
>> RangeFilter rf = new RangeFilter("day", start, end, true, true);
>> hits = searcher.search(query,rf,new Sort(new SortField[]{
>>                     new SortField
>> ("timestamp",SortField.STRING,true )}));
>>
>> Depending on the query, there may be millions of hits results. If
>> the same query is executed several times in quick succession, the
>> heap quickly runs out of memory. I suspect that this is because
>> Lucene needs to load all the millions of hits in order to sort the
>> results.
>>
>> My idea is to avoid the Sort() entirely. Is there a way, during
>> indexing (or by setting Weights inside the query) to automatically
>> set the score for more recent documents higher?
>>
>> Thanks
>> --
>> Solidguy
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoiding sort by date

Posted by ra...@yahoo.com.

Thanks, Erik for the pointer to Solr.

Since the document index is added to frequently, creating new IndexSearchers is required anyway. We plan to 'age' out already created IndexSearcher and create new ones every few minutes. Solr's cache regeneration would be useful in this scenario.

Does the Sort function create some kind of internal cache? Observing the heap, it seems that a full garbage collection after calling IndexSearcher.close() still leaves a lot of memory occupied.

Thanks
--
Solidguy

----- Original Message ----
From: Erik Hatcher <er...@ehatchersolutions.com>
To: java-user@lucene.apache.org
Sent: Thursday, October 12, 2006 12:58:50 PM
Subject: Re: Avoiding sort by date

You really should be using the same IndexSearcher for successive  
searches.  Sorting works best when done with a "warm" searcher.  Have  
a look at Solr's warming strategy, and consider adopting that in some  
way.

    Erik


On Oct 12, 2006, at 3:04 PM, <ra...@yahoo.com> wrote:

> Hi folks,
>
> I am using Lucene 2.0
>
> In our application, I am indexing a stream of documents. Each  
> document is fairly small (< 1 KB), but there can be 10's of  
> millions of documents. Each document has a Timestamp field. Users  
> can enter free-form searches and a date/time range. They are most  
> interested in the most recent documents (as indicated in the  
> Timestamp field). An obvious way to do achieve this is to
> searcher = new IndexSearcher(indexDir);
> RangeFilter rf = new RangeFilter("day", start, end, true, true);
> hits = searcher.search(query,rf,new Sort(new SortField[]{
>                     new SortField 
> ("timestamp",SortField.STRING,true )}));
>
> Depending on the query, there may be millions of hits results. If  
> the same query is executed several times in quick succession, the  
> heap quickly runs out of memory. I suspect that this is because  
> Lucene needs to load all the millions of hits in order to sort the  
> results.
>
> My idea is to avoid the Sort() entirely. Is there a way, during  
> indexing (or by setting Weights inside the query) to automatically  
> set the score for more recent documents higher?
>
> Thanks
> --
> Solidguy
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoiding sort by date

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

You really should be using the same IndexSearcher for successive  
searches.  Sorting works best when done with a "warm" searcher.  Have  
a look at Solr's warming strategy, and consider adopting that in some  
way.

	Erik


On Oct 12, 2006, at 3:04 PM, <ra...@yahoo.com> wrote:

> Hi folks,
>
> I am using Lucene 2.0
>
> In our application, I am indexing a stream of documents. Each  
> document is fairly small (< 1 KB), but there can be 10's of  
> millions of documents. Each document has a Timestamp field. Users  
> can enter free-form searches and a date/time range. They are most  
> interested in the most recent documents (as indicated in the  
> Timestamp field). An obvious way to do achieve this is to
> searcher = new IndexSearcher(indexDir);
> RangeFilter rf = new RangeFilter("day", start, end, true, true);
> hits = searcher.search(query,rf,new Sort(new SortField[]{
>                     new SortField 
> ("timestamp",SortField.STRING,true )}));
>
> Depending on the query, there may be millions of hits results. If  
> the same query is executed several times in quick succession, the  
> heap quickly runs out of memory. I suspect that this is because  
> Lucene needs to load all the millions of hits in order to sort the  
> results.
>
> My idea is to avoid the Sort() entirely. Is there a way, during  
> indexing (or by setting Weights inside the query) to automatically  
> set the score for more recent documents higher?
>
> Thanks
> --
> Solidguy
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org