You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Doron Yaacoby <do...@gingersoftware.com> on 2012/07/15 10:41:24 UTC

In memory Lucene configuration

Hi, I have the following situation:

I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).

Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).

I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.

Any ideas about what could be the ideal configuration for me?
Thanks.

RE: In memory Lucene configuration

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

just to clarify:

> In additional, i don't think load whole index to memory is good idea.
Since the
> index size will always increase.
> For me, i change lucene code to disable MMapDirectory, since the index
size is
> bigger and bigger.
> And MMapDirectory will call something like c++ share memory to load whole
> index to ram.

Please read:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: In memory Lucene configuration

Posted by Doron Yaacoby <do...@gingersoftware.com>.

Thanks for the input.
I am not using Solr.
Also, my index has a fixed size, I am not going to update it.

-----Original Message-----
From: googoo [mailto:liuyt1@gmail.com] 
Sent: 18 July 2012 15:21
To: java-user@lucene.apache.org
Subject: Re: In memory Lucene configuration

Doron,

To verify actual query speed, i think you may need:
1) do not run index job
2) in solrconfig.xml, set filterCache and queryResultCache value to 0
3) restart solr
4) run the query and check the qtime result

That may give you some idea what is actual query time.

To break down query time, you can run field1, field2, field3 query separately, to get some idea which field query take longer time.

In additional, i don't think load whole index to memory is good idea. Since the index size will always increase.
For me, i change lucene code to disable MMapDirectory, since the index size is bigger and bigger.
And MMapDirectory will call something like c++ share memory to load whole index to ram.

--
View this message in context: http://lucene.472066.n3.nabble.com/In-memory-Lucene-configuration-tp3995075p3995697.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: In memory Lucene configuration

Posted by googoo <li...@gmail.com>.

Doron,

To verify actual query speed, i think you may need:
1) do not run index job
2) in solrconfig.xml, set filterCache and queryResultCache value to 0
3) restart solr
4) run the query and check the qtime result

That may give you some idea what is actual query time.

To break down query time, you can run field1, field2, field3 query
separately, to get some idea which field query take longer time.

In additional, i don't think load whole index to memory is good idea. Since
the index size will always increase.
For me, i change lucene code to disable MMapDirectory, since the index size
is bigger and bigger.
And MMapDirectory will call something like c++ share memory to load whole
index to ram.




--
View this message in context: http://lucene.472066.n3.nabble.com/In-memory-Lucene-configuration-tp3995075p3995697.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: In memory Lucene configuration

Posted by Doron Yaacoby <do...@gingersoftware.com>.

Another interesting fact I just found out.
Up until now I measured query execution time via my application. Meaning, the application would log each query it sends to Lucene and the time it takes to run it. The nature of my application is that there will be a variable number of lucene queries per second (2-3 usually, but could be more or less), so there isn't constant 'pressure' on Lucene.
I now created a new test which runs the same queries but independently from my application.  This achieved much better results: MMap implementation ~17ms, and RAMDirectory ~19ms. Moreover, the results are now reproducible, meaning there aren't any spikes in the query times. When running through my application scenario, I got the occasional spike, where a query took 2-3 seconds. In the MMap case, I guess it could be that the OS sees some caches as unused for a while and reclaims them? I can't really explain this phenomenon in the RAMDirectory case. 

I'm currently trying to recreate this by sleeping a random time before each query, but still without success. Will update...

-----Original Message-----
From: Doron Yaacoby [mailto:dorony@gingersoftware.com] 
Sent: 15 July 2012 13:40
To: java-user@lucene.apache.org; simon.willnauer@gmail.com
Subject: RE: In memory Lucene configuration

Thanks for the quick input!
I ran a few more tests with your suggested configuration (-Xmx1G -Xms1G with MMapDirectory). At the third time I ran the same test I finally got an improvement - an average of ~30ms per query, although it's still not as fast as I need it to be. 
The test contains about 2200 different queries (well, some are repeated twice or thrice), and includes search time and doc loading (reading the two fields I mentioned). The queries are all straight boolean conjunctions, and yes, I am dropping the first few queries when calculating averages.

BTW, didn't mention before that I'm using Lucene 3.5 and Java 1.7.

-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@gmail.com] 
Sent: 15 July 2012 11:56
To: java-user@lucene.apache.org
Subject: Re: In memory Lucene configuration

hey there,

On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
> Hi, I have the following situation:
>
> I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).
>
> Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).
>
> I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.

its very hard to believe that you can't get this returning results faster though. I'd definitely recommend you MMapDirectory here or NIO should do too. When you measure this do you measure a large number of different queries or just a handful? Do you discard the first queries until caches are warmed up? What are you measuring, pure search time including doc loading?
If you use MMapDir how much memory do you grant to your JVM? I'd recommend you to sum up the term dictionary file size (.tii) and the norm file size (nrm) and give the JVM something like 3x the size as Xmx and Xms provided you don't need any more memory elsewhere. A guess from the given index is that Xmx1G Xms1G should do the job and let the Filesystem use the rest (that is important for lucene if you use MMap / NIOFS)

Your queries are straight boolean conjunctions or do you use positions ie phrase queries or spans?

simon
>
> Any ideas about what could be the ideal configuration for me?
> Thanks.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: In memory Lucene configuration

Posted by Simon Willnauer <si...@gmail.com>.

your spikes could be due to garbage collection. Since you are on java
1.7 you could try this commandline (blind shot):

  java -server
  -Xms1G
  -Xmx1G
  -Xss128k
  -XX:+UseParNewGC
  -XX:+UseConcMarkSweepGC
  -XX:CMSInitiatingOccupancyFraction=75
  -XX:+UseCMSInitiatingOccupancyOnly


or maybe try the new G1 collector while it usually only useful for larger heaps:

  java -server
  -Xms1G
  -Xmx1G
  -Xss128k
  -XX:+UseG1GC

simon




On Mon, Jul 16, 2012 at 8:43 AM, Doron Yaacoby
<do...@gingersoftware.com> wrote:
> I haven't tried that yet, but it's an option. The reason I'm waiting on this is that I am expecting many concurrent requests to my application anyway, so having multiple search threads per request might not be the best idea in production.
>
> -----Original Message-----
> From: Vitaly Funstein [mailto:vfunstein@gmail.com]
> Sent: 16 July 2012 08:26
> To: java-user@lucene.apache.org
> Subject: Re: In memory Lucene configuration
>
> Have you tried sharding your data? Since you have a fast multi-core box, why not split your indices N-ways, say the smaller one into 4, and the larger into 8. Then you can have a pool of dedicated search threads, executing the same query against separate physical indices within each "logical" one in parallel, then put the results together in the calling thread. Yes, it's more code to write and test in the app layer, but it may turn out to be well worth it. Due to GC overhead and poor synchronization characteristics, RAMDirectory is definitely not the way to go at this scale, as you probably already suspect.
>
> On Sun, Jul 15, 2012 at 3:40 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
>> Thanks for the quick input!
>> I ran a few more tests with your suggested configuration (-Xmx1G -Xms1G with MMapDirectory). At the third time I ran the same test I finally got an improvement - an average of ~30ms per query, although it's still not as fast as I need it to be.
>> The test contains about 2200 different queries (well, some are repeated twice or thrice), and includes search time and doc loading (reading the two fields I mentioned). The queries are all straight boolean conjunctions, and yes, I am dropping the first few queries when calculating averages.
>>
>> BTW, didn't mention before that I'm using Lucene 3.5 and Java 1.7.
>>
>> -----Original Message-----
>> From: Simon Willnauer [mailto:simon.willnauer@gmail.com]
>> Sent: 15 July 2012 11:56
>> To: java-user@lucene.apache.org
>> Subject: Re: In memory Lucene configuration
>>
>> hey there,
>>
>> On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
>>> Hi, I have the following situation:
>>>
>>> I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).
>>>
>>> Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).
>>>
>>> I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.
>>
>> its very hard to believe that you can't get this returning results faster though. I'd definitely recommend you MMapDirectory here or NIO should do too. When you measure this do you measure a large number of different queries or just a handful? Do you discard the first queries until caches are warmed up? What are you measuring, pure search time including doc loading?
>> If you use MMapDir how much memory do you grant to your JVM? I'd
>> recommend you to sum up the term dictionary file size (.tii) and the
>> norm file size (nrm) and give the JVM something like 3x the size as
>> Xmx and Xms provided you don't need any more memory elsewhere. A guess
>> from the given index is that Xmx1G Xms1G should do the job and let the
>> Filesystem use the rest (that is important for lucene if you use MMap
>> / NIOFS)
>>
>> Your queries are straight boolean conjunctions or do you use positions ie phrase queries or spans?
>>
>> simon
>>>
>>> Any ideas about what could be the ideal configuration for me?
>>> Thanks.
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: In memory Lucene configuration

Posted by Doron Yaacoby <do...@gingersoftware.com>.

I had a threading issue in the client code calling Lucene, really nothing that has anything to do with this list :)

-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@gmail.com] 
Sent: 18 July 2012 21:48
To: java-user@lucene.apache.org
Subject: Re: In memory Lucene configuration

doron, enlighten me please!

On Wed, Jul 18, 2012 at 1:32 PM, Doron Yaacoby <do...@gingersoftware.com> wrote:
> Glad to announce the problem was on my side, and had nothing to do with Lucene. Indeed, looks like that MMapDirectory is the best choice for me.
>
> Thanks again.
>
> -----Original Message-----
> From: Doron Yaacoby [mailto:dorony@gingersoftware.com]
> Sent: 16 July 2012 09:43
> To: java-user@lucene.apache.org
> Subject: RE: In memory Lucene configuration
>
> I haven't tried that yet, but it's an option. The reason I'm waiting on this is that I am expecting many concurrent requests to my application anyway, so having multiple search threads per request might not be the best idea in production.
>
> -----Original Message-----
> From: Vitaly Funstein [mailto:vfunstein@gmail.com]
> Sent: 16 July 2012 08:26
> To: java-user@lucene.apache.org
> Subject: Re: In memory Lucene configuration
>
> Have you tried sharding your data? Since you have a fast multi-core box, why not split your indices N-ways, say the smaller one into 4, and the larger into 8. Then you can have a pool of dedicated search threads, executing the same query against separate physical indices within each "logical" one in parallel, then put the results together in the calling thread. Yes, it's more code to write and test in the app layer, but it may turn out to be well worth it. Due to GC overhead and poor synchronization characteristics, RAMDirectory is definitely not the way to go at this scale, as you probably already suspect.
>
> On Sun, Jul 15, 2012 at 3:40 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
>> Thanks for the quick input!
>> I ran a few more tests with your suggested configuration (-Xmx1G -Xms1G with MMapDirectory). At the third time I ran the same test I finally got an improvement - an average of ~30ms per query, although it's still not as fast as I need it to be.
>> The test contains about 2200 different queries (well, some are repeated twice or thrice), and includes search time and doc loading (reading the two fields I mentioned). The queries are all straight boolean conjunctions, and yes, I am dropping the first few queries when calculating averages.
>>
>> BTW, didn't mention before that I'm using Lucene 3.5 and Java 1.7.
>>
>> -----Original Message-----
>> From: Simon Willnauer [mailto:simon.willnauer@gmail.com]
>> Sent: 15 July 2012 11:56
>> To: java-user@lucene.apache.org
>> Subject: Re: In memory Lucene configuration
>>
>> hey there,
>>
>> On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
>>> Hi, I have the following situation:
>>>
>>> I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).
>>>
>>> Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).
>>>
>>> I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.
>>
>> its very hard to believe that you can't get this returning results faster though. I'd definitely recommend you MMapDirectory here or NIO should do too. When you measure this do you measure a large number of different queries or just a handful? Do you discard the first queries until caches are warmed up? What are you measuring, pure search time including doc loading?
>> If you use MMapDir how much memory do you grant to your JVM? I'd 
>> recommend you to sum up the term dictionary file size (.tii) and the 
>> norm file size (nrm) and give the JVM something like 3x the size as 
>> Xmx and Xms provided you don't need any more memory elsewhere. A 
>> guess from the given index is that Xmx1G Xms1G should do the job and 
>> let the Filesystem use the rest (that is important for lucene if you 
>> use MMap / NIOFS)
>>
>> Your queries are straight boolean conjunctions or do you use positions ie phrase queries or spans?
>>
>> simon
>>>
>>> Any ideas about what could be the ideal configuration for me?
>>> Thanks.
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: In memory Lucene configuration

Posted by Simon Willnauer <si...@gmail.com>.

doron, enlighten me please!

On Wed, Jul 18, 2012 at 1:32 PM, Doron Yaacoby
<do...@gingersoftware.com> wrote:
> Glad to announce the problem was on my side, and had nothing to do with Lucene. Indeed, looks like that MMapDirectory is the best choice for me.
>
> Thanks again.
>
> -----Original Message-----
> From: Doron Yaacoby [mailto:dorony@gingersoftware.com]
> Sent: 16 July 2012 09:43
> To: java-user@lucene.apache.org
> Subject: RE: In memory Lucene configuration
>
> I haven't tried that yet, but it's an option. The reason I'm waiting on this is that I am expecting many concurrent requests to my application anyway, so having multiple search threads per request might not be the best idea in production.
>
> -----Original Message-----
> From: Vitaly Funstein [mailto:vfunstein@gmail.com]
> Sent: 16 July 2012 08:26
> To: java-user@lucene.apache.org
> Subject: Re: In memory Lucene configuration
>
> Have you tried sharding your data? Since you have a fast multi-core box, why not split your indices N-ways, say the smaller one into 4, and the larger into 8. Then you can have a pool of dedicated search threads, executing the same query against separate physical indices within each "logical" one in parallel, then put the results together in the calling thread. Yes, it's more code to write and test in the app layer, but it may turn out to be well worth it. Due to GC overhead and poor synchronization characteristics, RAMDirectory is definitely not the way to go at this scale, as you probably already suspect.
>
> On Sun, Jul 15, 2012 at 3:40 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
>> Thanks for the quick input!
>> I ran a few more tests with your suggested configuration (-Xmx1G -Xms1G with MMapDirectory). At the third time I ran the same test I finally got an improvement - an average of ~30ms per query, although it's still not as fast as I need it to be.
>> The test contains about 2200 different queries (well, some are repeated twice or thrice), and includes search time and doc loading (reading the two fields I mentioned). The queries are all straight boolean conjunctions, and yes, I am dropping the first few queries when calculating averages.
>>
>> BTW, didn't mention before that I'm using Lucene 3.5 and Java 1.7.
>>
>> -----Original Message-----
>> From: Simon Willnauer [mailto:simon.willnauer@gmail.com]
>> Sent: 15 July 2012 11:56
>> To: java-user@lucene.apache.org
>> Subject: Re: In memory Lucene configuration
>>
>> hey there,
>>
>> On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
>>> Hi, I have the following situation:
>>>
>>> I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).
>>>
>>> Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).
>>>
>>> I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.
>>
>> its very hard to believe that you can't get this returning results faster though. I'd definitely recommend you MMapDirectory here or NIO should do too. When you measure this do you measure a large number of different queries or just a handful? Do you discard the first queries until caches are warmed up? What are you measuring, pure search time including doc loading?
>> If you use MMapDir how much memory do you grant to your JVM? I'd
>> recommend you to sum up the term dictionary file size (.tii) and the
>> norm file size (nrm) and give the JVM something like 3x the size as
>> Xmx and Xms provided you don't need any more memory elsewhere. A guess
>> from the given index is that Xmx1G Xms1G should do the job and let the
>> Filesystem use the rest (that is important for lucene if you use MMap
>> / NIOFS)
>>
>> Your queries are straight boolean conjunctions or do you use positions ie phrase queries or spans?
>>
>> simon
>>>
>>> Any ideas about what could be the ideal configuration for me?
>>> Thanks.
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: In memory Lucene configuration

Posted by Doron Yaacoby <do...@gingersoftware.com>.

Glad to announce the problem was on my side, and had nothing to do with Lucene. Indeed, looks like that MMapDirectory is the best choice for me. 

Thanks again.

-----Original Message-----
From: Doron Yaacoby [mailto:dorony@gingersoftware.com] 
Sent: 16 July 2012 09:43
To: java-user@lucene.apache.org
Subject: RE: In memory Lucene configuration

I haven't tried that yet, but it's an option. The reason I'm waiting on this is that I am expecting many concurrent requests to my application anyway, so having multiple search threads per request might not be the best idea in production.

-----Original Message-----
From: Vitaly Funstein [mailto:vfunstein@gmail.com]
Sent: 16 July 2012 08:26
To: java-user@lucene.apache.org
Subject: Re: In memory Lucene configuration

Have you tried sharding your data? Since you have a fast multi-core box, why not split your indices N-ways, say the smaller one into 4, and the larger into 8. Then you can have a pool of dedicated search threads, executing the same query against separate physical indices within each "logical" one in parallel, then put the results together in the calling thread. Yes, it's more code to write and test in the app layer, but it may turn out to be well worth it. Due to GC overhead and poor synchronization characteristics, RAMDirectory is definitely not the way to go at this scale, as you probably already suspect.

On Sun, Jul 15, 2012 at 3:40 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
> Thanks for the quick input!
> I ran a few more tests with your suggested configuration (-Xmx1G -Xms1G with MMapDirectory). At the third time I ran the same test I finally got an improvement - an average of ~30ms per query, although it's still not as fast as I need it to be.
> The test contains about 2200 different queries (well, some are repeated twice or thrice), and includes search time and doc loading (reading the two fields I mentioned). The queries are all straight boolean conjunctions, and yes, I am dropping the first few queries when calculating averages.
>
> BTW, didn't mention before that I'm using Lucene 3.5 and Java 1.7.
>
> -----Original Message-----
> From: Simon Willnauer [mailto:simon.willnauer@gmail.com]
> Sent: 15 July 2012 11:56
> To: java-user@lucene.apache.org
> Subject: Re: In memory Lucene configuration
>
> hey there,
>
> On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
>> Hi, I have the following situation:
>>
>> I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).
>>
>> Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).
>>
>> I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.
>
> its very hard to believe that you can't get this returning results faster though. I'd definitely recommend you MMapDirectory here or NIO should do too. When you measure this do you measure a large number of different queries or just a handful? Do you discard the first queries until caches are warmed up? What are you measuring, pure search time including doc loading?
> If you use MMapDir how much memory do you grant to your JVM? I'd 
> recommend you to sum up the term dictionary file size (.tii) and the 
> norm file size (nrm) and give the JVM something like 3x the size as 
> Xmx and Xms provided you don't need any more memory elsewhere. A guess 
> from the given index is that Xmx1G Xms1G should do the job and let the 
> Filesystem use the rest (that is important for lucene if you use MMap 
> / NIOFS)
>
> Your queries are straight boolean conjunctions or do you use positions ie phrase queries or spans?
>
> simon
>>
>> Any ideas about what could be the ideal configuration for me?
>> Thanks.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: In memory Lucene configuration

Posted by Doron Yaacoby <do...@gingersoftware.com>.

I haven't tried that yet, but it's an option. The reason I'm waiting on this is that I am expecting many concurrent requests to my application anyway, so having multiple search threads per request might not be the best idea in production.

-----Original Message-----
From: Vitaly Funstein [mailto:vfunstein@gmail.com] 
Sent: 16 July 2012 08:26
To: java-user@lucene.apache.org
Subject: Re: In memory Lucene configuration

Have you tried sharding your data? Since you have a fast multi-core box, why not split your indices N-ways, say the smaller one into 4, and the larger into 8. Then you can have a pool of dedicated search threads, executing the same query against separate physical indices within each "logical" one in parallel, then put the results together in the calling thread. Yes, it's more code to write and test in the app layer, but it may turn out to be well worth it. Due to GC overhead and poor synchronization characteristics, RAMDirectory is definitely not the way to go at this scale, as you probably already suspect.

On Sun, Jul 15, 2012 at 3:40 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
> Thanks for the quick input!
> I ran a few more tests with your suggested configuration (-Xmx1G -Xms1G with MMapDirectory). At the third time I ran the same test I finally got an improvement - an average of ~30ms per query, although it's still not as fast as I need it to be.
> The test contains about 2200 different queries (well, some are repeated twice or thrice), and includes search time and doc loading (reading the two fields I mentioned). The queries are all straight boolean conjunctions, and yes, I am dropping the first few queries when calculating averages.
>
> BTW, didn't mention before that I'm using Lucene 3.5 and Java 1.7.
>
> -----Original Message-----
> From: Simon Willnauer [mailto:simon.willnauer@gmail.com]
> Sent: 15 July 2012 11:56
> To: java-user@lucene.apache.org
> Subject: Re: In memory Lucene configuration
>
> hey there,
>
> On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
>> Hi, I have the following situation:
>>
>> I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).
>>
>> Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).
>>
>> I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.
>
> its very hard to believe that you can't get this returning results faster though. I'd definitely recommend you MMapDirectory here or NIO should do too. When you measure this do you measure a large number of different queries or just a handful? Do you discard the first queries until caches are warmed up? What are you measuring, pure search time including doc loading?
> If you use MMapDir how much memory do you grant to your JVM? I'd 
> recommend you to sum up the term dictionary file size (.tii) and the 
> norm file size (nrm) and give the JVM something like 3x the size as 
> Xmx and Xms provided you don't need any more memory elsewhere. A guess 
> from the given index is that Xmx1G Xms1G should do the job and let the 
> Filesystem use the rest (that is important for lucene if you use MMap 
> / NIOFS)
>
> Your queries are straight boolean conjunctions or do you use positions ie phrase queries or spans?
>
> simon
>>
>> Any ideas about what could be the ideal configuration for me?
>> Thanks.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: In memory Lucene configuration

Posted by Vitaly Funstein <vf...@gmail.com>.

Have you tried sharding your data? Since you have a fast multi-core
box, why not split your indices N-ways, say the smaller one into 4,
and the larger into 8. Then you can have a pool of dedicated search
threads, executing the same query against separate physical indices
within each "logical" one in parallel, then put the results together
in the calling thread. Yes, it's more code to write and test in the
app layer, but it may turn out to be well worth it. Due to GC overhead
and poor synchronization characteristics, RAMDirectory is definitely
not the way to go at this scale, as you probably already suspect.

On Sun, Jul 15, 2012 at 3:40 AM, Doron Yaacoby
<do...@gingersoftware.com> wrote:
> Thanks for the quick input!
> I ran a few more tests with your suggested configuration (-Xmx1G -Xms1G with MMapDirectory). At the third time I ran the same test I finally got an improvement - an average of ~30ms per query, although it's still not as fast as I need it to be.
> The test contains about 2200 different queries (well, some are repeated twice or thrice), and includes search time and doc loading (reading the two fields I mentioned). The queries are all straight boolean conjunctions, and yes, I am dropping the first few queries when calculating averages.
>
> BTW, didn't mention before that I'm using Lucene 3.5 and Java 1.7.
>
> -----Original Message-----
> From: Simon Willnauer [mailto:simon.willnauer@gmail.com]
> Sent: 15 July 2012 11:56
> To: java-user@lucene.apache.org
> Subject: Re: In memory Lucene configuration
>
> hey there,
>
> On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
>> Hi, I have the following situation:
>>
>> I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).
>>
>> Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).
>>
>> I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.
>
> its very hard to believe that you can't get this returning results faster though. I'd definitely recommend you MMapDirectory here or NIO should do too. When you measure this do you measure a large number of different queries or just a handful? Do you discard the first queries until caches are warmed up? What are you measuring, pure search time including doc loading?
> If you use MMapDir how much memory do you grant to your JVM? I'd recommend you to sum up the term dictionary file size (.tii) and the norm file size (nrm) and give the JVM something like 3x the size as Xmx and Xms provided you don't need any more memory elsewhere. A guess from the given index is that Xmx1G Xms1G should do the job and let the Filesystem use the rest (that is important for lucene if you use MMap / NIOFS)
>
> Your queries are straight boolean conjunctions or do you use positions ie phrase queries or spans?
>
> simon
>>
>> Any ideas about what could be the ideal configuration for me?
>> Thanks.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: In memory Lucene configuration

Posted by Doron Yaacoby <do...@gingersoftware.com>.

Thanks for the quick input!
I ran a few more tests with your suggested configuration (-Xmx1G -Xms1G with MMapDirectory). At the third time I ran the same test I finally got an improvement - an average of ~30ms per query, although it's still not as fast as I need it to be. 
The test contains about 2200 different queries (well, some are repeated twice or thrice), and includes search time and doc loading (reading the two fields I mentioned). The queries are all straight boolean conjunctions, and yes, I am dropping the first few queries when calculating averages.

BTW, didn't mention before that I'm using Lucene 3.5 and Java 1.7.

-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@gmail.com] 
Sent: 15 July 2012 11:56
To: java-user@lucene.apache.org
Subject: Re: In memory Lucene configuration

hey there,

On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby <do...@gingersoftware.com> wrote:
> Hi, I have the following situation:
>
> I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).
>
> Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).
>
> I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.

its very hard to believe that you can't get this returning results faster though. I'd definitely recommend you MMapDirectory here or NIO should do too. When you measure this do you measure a large number of different queries or just a handful? Do you discard the first queries until caches are warmed up? What are you measuring, pure search time including doc loading?
If you use MMapDir how much memory do you grant to your JVM? I'd recommend you to sum up the term dictionary file size (.tii) and the norm file size (nrm) and give the JVM something like 3x the size as Xmx and Xms provided you don't need any more memory elsewhere. A guess from the given index is that Xmx1G Xms1G should do the job and let the Filesystem use the rest (that is important for lucene if you use MMap / NIOFS)

Your queries are straight boolean conjunctions or do you use positions ie phrase queries or spans?

simon
>
> Any ideas about what could be the ideal configuration for me?
> Thanks.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: In memory Lucene configuration

Posted by Simon Willnauer <si...@gmail.com>.

hey there,

On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby
<do...@gingersoftware.com> wrote:
> Hi, I have the following situation:
>
> I have two pretty large indices. One consists of about 1 billion documents (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on disk). The documents are very short (4-5 terms each in the text field, and one numeric field with a long value). This is a read only index - I'm only going to read from it and never write. There is only one segment in each index (At least there should be, I called forceMerge(1) on them).
>
> Search latency is the most important thing to me. I need it to be blazing fast, ~20ms per query. Queries are always of the type +term1 +term2 +term3, and I'm asking for 10 results from each index (searching is done simultaneously on both indices).
>
> I have a fast server (12 cores@3GHz each) with 32Gb RAM (running Linux) and I can keep both indices in-memory when using a RAMDirectory. This didn't achieve the expected result (average query time = ~43ms). I'm seeing latency spikes, where the same query is sometimes answered in 10ms, but in a different occasion takes 2-3 seconds. I'm guessing this is due to GC (as explained here<http://lucene.472066.n3.nabble.com/Plans-to-remove-RAMDirectory-td3601156.html>). Using a warmed up MMapDirectory didn't help; the average query time was a bit slower. I tried using InstantiatedIndex, but it has a huge memory consumption, I couldn't even load the smaller 6GB index.

its very hard to believe that you can't get this returning results
faster though. I'd definitely recommend you MMapDirectory here or NIO
should do too. When you measure this do you measure a large number of
different queries or just a handful? Do you discard the first queries
until caches are warmed up? What are you measuring, pure search time
including doc loading?
If you use MMapDir how much memory do you grant to your JVM? I'd
recommend you to sum up the term dictionary file size (.tii) and the
norm file size (nrm) and give the JVM something like 3x the size as
Xmx and Xms provided you don't need any more memory elsewhere. A guess
from the given index is that Xmx1G Xms1G should do the job and let the
Filesystem use the rest (that is important for lucene if you use MMap
/ NIOFS)

Your queries are straight boolean conjunctions or do you use positions
ie phrase queries or spans?

simon
>
> Any ideas about what could be the ideal configuration for me?
> Thanks.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org