You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Keith Watson <ke...@gmail.com> on 2008/07/03 23:25:51 UTC

Memory Usage

Hello All,

I have something that's not exactly causing me a major problem, but I  
would appreciate help in understanding the behaviour here:

I have an internet message board, and I soon hope to revamp the code  
to be using Lucene for searching the threads and posts, as it's far  
better than the database's fulltext capability. However, one of the  
sort of things I want to be able to do is for a user to be able to  
request a list of posts, written by user x, ordered by the newest  
first (and it's this sorting of the items by date that is the issue  
here).

To do this, I have a timestamp in the index, along with each post,  
user etc.

I find that if I use the Java SimpleDateFormat class to encode the  
timestamp like this: yyMMdd (let's not worry about the year 2100  
problem for now!), then I can measure the index cache (which is fully  
loaded, since I need to sort the results) as taking somewhere in the  
region of 30M of memory.

Now, I noticed that obviously if I index like the above, I won't get  
the correct sort order for several posts having been posted on the  
same day, so I changed it to index yyMMddHHmmss to index down to the  
second, rather than just the day. I didn't pay much attention to  
memory usage until I started getting out of heap space errors... When  
I looked into the usage I found:

(there are around 6,000,000 posts on the message board database)

Date encoded as yyMMdd: appears to be using around 30M
Date encoded as yyMMddHHmmss:  appears to be using more than 400M!

I guess I would have understood if I was seeing the usage double for  
sure, or even a little more; no idea how you guys encode the indexes,  
if at all, but it's gone up over tenfold, which I can't explain.

For now, I have just moved it back to do it on a per day basis, as  
it's not a huge deal, but can anyone help with this? Is there  
something I might be doing wrong? That's all I changed between the two  
runs, and it certainly seems to be repeatable. I tried upgrading from  
the previous version of Lucene to the latest one, but no difference.

Many thanks,

Keith.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Memory Usage

Posted by Keith Watson <ke...@gmail.com>.
Thanks very much for this; I'll give it a shot.

Keith.


On 4 Jul 2008, at 00:02, Paul Smith wrote:

>>
>>
>> (there are around 6,000,000 posts on the message board database)
>>
>> Date encoded as yyMMdd: appears to be using around 30M
>> Date encoded as yyMMddHHmmss:  appears to be using more than 400M!
>>
>> I guess I would have understood if I was seeing the usage double  
>> for sure, or even a little more; no idea how you guys encode the  
>> indexes, if at all, but it's gone up over tenfold, which I can't  
>> explain.
>
> Sort memory cost is based on the total # of unique terms for the  
> given field (multiplied by the number of locale's involved if you  
> have to do that too! but in temporal sorting you don't).
>
> This is easier than you think, just use 2 fields (date, time) and  
> sort by both.  This means the Date field's unique term count grows  
> only 1 term per day.  The Time field can be set to minutes (if you  
> can get away with that) meaning that you only have fairly  
> insignificant total term count for the time field.  We use this at  
> Aconex,  and have indexes with millions of records (weekly 'work'  
> searcher refreshed every 5 seconds, archive searcher is held in  
> memory, with a Multisearcher done over the 2) and it works a treat.   
> We regularly need to return million+ results from a search (don't  
> ask) using this sort of sorting and the overall search time is only  
> a few seconds.
>
> On a related note, work hard not to need to use Locale sensitive  
> sorting if you can for any other fields, for large results the CPU  
> penalty is horrific (even once you get past the synchronization  
> bottleneck in the CollationKey stuff).
>
> cheers,
>
> Paul Smith
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Memory Usage

Posted by Paul Smith <ps...@aconex.com>.
>
>
> (there are around 6,000,000 posts on the message board database)
>
> Date encoded as yyMMdd: appears to be using around 30M
> Date encoded as yyMMddHHmmss:  appears to be using more than 400M!
>
> I guess I would have understood if I was seeing the usage double for  
> sure, or even a little more; no idea how you guys encode the  
> indexes, if at all, but it's gone up over tenfold, which I can't  
> explain.

Sort memory cost is based on the total # of unique terms for the given  
field (multiplied by the number of locale's involved if you have to do  
that too! but in temporal sorting you don't).

This is easier than you think, just use 2 fields (date, time) and sort  
by both.  This means the Date field's unique term count grows only 1  
term per day.  The Time field can be set to minutes (if you can get  
away with that) meaning that you only have fairly insignificant total  
term count for the time field.  We use this at Aconex,  and have  
indexes with millions of records (weekly 'work' searcher refreshed  
every 5 seconds, archive searcher is held in memory, with a  
Multisearcher done over the 2) and it works a treat.  We regularly  
need to return million+ results from a search (don't ask) using this  
sort of sorting and the overall search time is only a few seconds.

On a related note, work hard not to need to use Locale sensitive  
sorting if you can for any other fields, for large results the CPU  
penalty is horrific (even once you get past the synchronization  
bottleneck in the CollationKey stuff).

cheers,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org