You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by souravm <SO...@infosys.com> on 2008/11/25 00:26:25 UTC

Sorting and JVM heap size ....

Hi,

I have indexed data of size around 20GB. My JVM memory is 1.5GB.

For this data if I do a query with sort flag on (for a single field) I always get java out of memory exception even if the number of hit is 0. With no sorting (or default sorting with score) the query works perfectly fine.`

I can understand that JVM heap size can max out when the number of records hit is high, but why this is happening even when number of records hit is 0 ?

The same query with sort flag on does not give me problem till 1.5 GB of data.

Any explanation ?

Regards,
Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are not
to copy, disclose, or distribute this e-mail or its contents to any other person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
every reasonable precaution to minimize this risk, but is not liable for any damage
you may sustain as a result of any virus in this e-mail. You should carry out your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Sorting and JVM heap size ....

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Sorting and JVM heap size ....
: In-Reply-To: <2c...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss

Re: Sorting and JVM heap size ....

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Tue, Nov 25, 2008 at 9:37 PM, souravm <SO...@infosys.com> wrote:

>
> Could you please explain a bit more on how the new searcher can double the
> memory ?
>

Take a look at slide 13 of Yonik's presentation available at
http://people.apache.org/~yonik/ApacheConEU2006/Solr.ppt

Each searcher in Solr maintains various caches for performance reasons. When
a new one is created, its caches are empty. If one exposes this searcher to
live requests, response times can be very long because a lot of disk
accesses may be needed. Therefore, Solr warms the new searcher's caches by
re-executing queries whose results had been cached on the old searcher's
cache. If you sort on fields, then the new searcher will create its own
FieldCache for each field you sort. At this time, both the old and the new
searcher will have their field caches.

>
> Based on your explanation, when a new set of documents gets committed a new
> searcher is created. So what I understand is whenever a update/delete query
> and search query run in parallel then only this type of situation may occur.
>

Not during updates/deletes, but when you issue an commit or optimize
command.

>
> Also I am assuming that like commit optimization also happens during
> update/delete query only.

Commit or Optimize have to be called by you explicitly.

-- 
Regards,
Shalin Shekhar Mangar.

RE: Sorting and JVM heap size ....

Posted by souravm <SO...@infosys.com>.

Hi Shalin,

Thanks for the clarifications.

Could you please explain a bit more on how the new searcher can double the memory ?

Based on your explanation, when a new set of documents gets committed a new searcher is created. So what I understand is whenever a update/delete query and search query run in parallel then only this type of situation may occur. 

Also I am assuming that like commit optimization also happens during update/delete query only.

Regards,
Sourav

________________________________________
From: Shalin Shekhar Mangar [shalinmangar@gmail.com]
Sent: Tuesday, November 25, 2008 6:40 AM
To: solr-user@lucene.apache.org
Cc: souravm
Subject: Re: Sorting and JVM heap size ....

On Tue, Nov 25, 2008 at 7:49 AM, souravm <SO...@infosys.com>> wrote:

3. Another case is - if there are 2 search requests concurrently hitting the server, each with sorting on the same 20 character date field, then also it would need 2x2GB memory. So if I know that I need to support at least 4 concurrent search requests, I need to start the JVM at least with 8 GB heap size.

This is a misunderstanding. Yonik said "searchers", not "searches". A single searcher handles all live search requests. When a commit/optimize happens, a new searcher is created, it's caches are auto-warmed and then swapped with the live searcher. It may be a bit more complicated under the hoods, but that's pretty much how it works.

Considering that after commits and during auto-warming, another searcher might have been created which will have another field cache for each field you are sorting on, you'll need double the memory. The number of searchers can be controlled through the "maxWarmingSearchers" parameter in solrconfig.xml

--
Regards,
Shalin Shekhar Mangar.

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Sorting and JVM heap size ....

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Tue, Nov 25, 2008 at 7:49 AM, souravm <SO...@infosys.com> wrote:

>
> 3. Another case is - if there are 2 search requests concurrently hitting
> the server, each with sorting on the same 20 character date field, then also
> it would need 2x2GB memory. So if I know that I need to support at least 4
> concurrent search requests, I need to start the JVM at least with 8 GB heap
> size.
>

This is a misunderstanding. Yonik said "searchers", not "searches". A single
searcher handles all live search requests. When a commit/optimize happens, a
new searcher is created, it's caches are auto-warmed and then swapped with
the live searcher. It may be a bit more complicated under the hoods, but
that's pretty much how it works.

Considering that after commits and during auto-warming, another searcher
might have been created which will have another field cache for each field
you are sorting on, you'll need double the memory. The number of searchers
can be controlled through the "maxWarmingSearchers" parameter in
solrconfig.xml

-- 
Regards,
Shalin Shekhar Mangar.

RE: Sorting and JVM heap size ....

Posted by souravm <SO...@infosys.com>.

Thanks Yonik. It explains.

Regards,
Sourav

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
Sent: Monday, November 24, 2008 7:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Sorting and JVM heap size ....

On Mon, Nov 24, 2008 at 9:19 PM, souravm <SO...@infosys.com> wrote:
> Hi Yonik,
>
> Thanks again for the detail input.
>
> Let me try to re-confirm my understanding -
>
> 1. What you say is - if sorting is asked for a field, the same field from all documents, which are indexed, would be put in a memory in an un-inverted form. So given this if I have a field of String type with say 20 characters, then (assuming no multibyte characters - all ascii) for 200M documents I need to have at least 20x200 MB, i.e. 4GB memory.

That's the general idea, yes.
For Strings, it's actually just the unique values in a String[], plus
an int[200000000] of offsets into that String[] for each document.
See Lucene's FieldCache and StringIndex.

-Yonik


> 2. So, if I want to have sorting on 2 such fields I need to allocate at least 8 GB of memory.
>
> 3. Another case is - if there are 2 search requests concurrently hitting the server, each with sorting on the same 20 character date field, then also it would need 2x2GB memory. So if I know that I need to support at least 4 concurrent search requests, I need to start the JVM at least with 8 GB heap size.
>
> Please let me know if my understanding is correct.
>
> Regards,
> Sourav
>
> -----Original Message-----
> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, November 24, 2008 6:03 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Sorting and JVM heap size ....
>
> On Mon, Nov 24, 2008 at 8:48 PM, souravm <SO...@infosys.com> wrote:
>> I have around 200M documents in index. The field I'm sorting on is a date string (containing date and time in dd-mmm-yyyy  hh:mm:yy format) and the field is part of the search criteria.
>>
>> Also please note that the number of documents returned by the search criteria is much less than 200M. In fact even in case of 0 hit I found jvm out of memory exception.
>
> Right... that's just how the Lucene FieldCache used for sorting works right now.
> The entire field is un-inverted and held in memory.
>
> 200M docs is a *lot*... you might try indexing your date fields as
> integer types that would take only 4 bytes per doc - and that will
> still take up 800M.  Given that 2 searchers can overlap, that still
> adds up to more than your heap - you will need to up that.
>
> The other option is to split your index across multiple nodes and use
> distributed search.  If you want to do any faceting in the future, or
> sort on multiple fields, you will need to do this anyway.
>
> -Yonik
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are not
> to copy, disclose, or distribute this e-mail or its contents to any other person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
> every reasonable precaution to minimize this risk, but is not liable for any damage
> you may sustain as a result of any virus in this e-mail. You should carry out your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>

Re: Sorting and JVM heap size ....

Posted by Yonik Seeley <yo...@apache.org>.

On Mon, Nov 24, 2008 at 9:19 PM, souravm <SO...@infosys.com> wrote:
> Hi Yonik,
>
> Thanks again for the detail input.
>
> Let me try to re-confirm my understanding -
>
> 1. What you say is - if sorting is asked for a field, the same field from all documents, which are indexed, would be put in a memory in an un-inverted form. So given this if I have a field of String type with say 20 characters, then (assuming no multibyte characters - all ascii) for 200M documents I need to have at least 20x200 MB, i.e. 4GB memory.

That's the general idea, yes.
For Strings, it's actually just the unique values in a String[], plus
an int[200000000] of offsets into that String[] for each document.
See Lucene's FieldCache and StringIndex.

-Yonik


> 2. So, if I want to have sorting on 2 such fields I need to allocate at least 8 GB of memory.
>
> 3. Another case is - if there are 2 search requests concurrently hitting the server, each with sorting on the same 20 character date field, then also it would need 2x2GB memory. So if I know that I need to support at least 4 concurrent search requests, I need to start the JVM at least with 8 GB heap size.
>
> Please let me know if my understanding is correct.
>
> Regards,
> Sourav
>
> -----Original Message-----
> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, November 24, 2008 6:03 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Sorting and JVM heap size ....
>
> On Mon, Nov 24, 2008 at 8:48 PM, souravm <SO...@infosys.com> wrote:
>> I have around 200M documents in index. The field I'm sorting on is a date string (containing date and time in dd-mmm-yyyy  hh:mm:yy format) and the field is part of the search criteria.
>>
>> Also please note that the number of documents returned by the search criteria is much less than 200M. In fact even in case of 0 hit I found jvm out of memory exception.
>
> Right... that's just how the Lucene FieldCache used for sorting works right now.
> The entire field is un-inverted and held in memory.
>
> 200M docs is a *lot*... you might try indexing your date fields as
> integer types that would take only 4 bytes per doc - and that will
> still take up 800M.  Given that 2 searchers can overlap, that still
> adds up to more than your heap - you will need to up that.
>
> The other option is to split your index across multiple nodes and use
> distributed search.  If you want to do any faceting in the future, or
> sort on multiple fields, you will need to do this anyway.
>
> -Yonik
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are not
> to copy, disclose, or distribute this e-mail or its contents to any other person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
> every reasonable precaution to minimize this risk, but is not liable for any damage
> you may sustain as a result of any virus in this e-mail. You should carry out your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>

RE: Sorting and JVM heap size ....

Posted by souravm <SO...@infosys.com>.

Hi Yonik,

Thanks again for the detail input.

Let me try to re-confirm my understanding -

1. What you say is - if sorting is asked for a field, the same field from all documents, which are indexed, would be put in a memory in an un-inverted form. So given this if I have a field of String type with say 20 characters, then (assuming no multibyte characters - all ascii) for 200M documents I need to have at least 20x200 MB, i.e. 4GB memory.

2. So, if I want to have sorting on 2 such fields I need to allocate at least 8 GB of memory.

3. Another case is - if there are 2 search requests concurrently hitting the server, each with sorting on the same 20 character date field, then also it would need 2x2GB memory. So if I know that I need to support at least 4 concurrent search requests, I need to start the JVM at least with 8 GB heap size. 

Please let me know if my understanding is correct.

Regards,
Sourav

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
Sent: Monday, November 24, 2008 6:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Sorting and JVM heap size ....

On Mon, Nov 24, 2008 at 8:48 PM, souravm <SO...@infosys.com> wrote:
> I have around 200M documents in index. The field I'm sorting on is a date string (containing date and time in dd-mmm-yyyy  hh:mm:yy format) and the field is part of the search criteria.
>
> Also please note that the number of documents returned by the search criteria is much less than 200M. In fact even in case of 0 hit I found jvm out of memory exception.

Right... that's just how the Lucene FieldCache used for sorting works right now.
The entire field is un-inverted and held in memory.

200M docs is a *lot*... you might try indexing your date fields as
integer types that would take only 4 bytes per doc - and that will
still take up 800M.  Given that 2 searchers can overlap, that still
adds up to more than your heap - you will need to up that.

The other option is to split your index across multiple nodes and use
distributed search.  If you want to do any faceting in the future, or
sort on multiple fields, you will need to do this anyway.

-Yonik

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Sorting and JVM heap size ....

Posted by Yonik Seeley <yo...@apache.org>.

On Mon, Nov 24, 2008 at 8:48 PM, souravm <SO...@infosys.com> wrote:
> I have around 200M documents in index. The field I'm sorting on is a date string (containing date and time in dd-mmm-yyyy  hh:mm:yy format) and the field is part of the search criteria.
>
> Also please note that the number of documents returned by the search criteria is much less than 200M. In fact even in case of 0 hit I found jvm out of memory exception.

Right... that's just how the Lucene FieldCache used for sorting works right now.
The entire field is un-inverted and held in memory.

200M docs is a *lot*... you might try indexing your date fields as
integer types that would take only 4 bytes per doc - and that will
still take up 800M.  Given that 2 searchers can overlap, that still
adds up to more than your heap - you will need to up that.

The other option is to split your index across multiple nodes and use
distributed search.  If you want to do any faceting in the future, or
sort on multiple fields, you will need to do this anyway.

-Yonik

RE: Sorting and JVM heap size ....

Posted by souravm <SO...@infosys.com>.

Hi Yonik,

Thanks for the reply.

I have around 200M documents in index. The field I'm sorting on is a date string (containing date and time in dd-mmm-yyyy  hh:mm:yy format) and the field is part of the search criteria.

Also please note that the number of documents returned by the search criteria is much less than 200M. In fact even in case of 0 hit I found jvm out of memory exception.

Regards,
Sourav

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
Sent: Monday, November 24, 2008 5:40 PM
To: solr-user@lucene.apache.org
Subject: Re: Sorting and JVM heap size ....

On Mon, Nov 24, 2008 at 6:26 PM, souravm <SO...@infosys.com> wrote:
> I have indexed data of size around 20GB. My JVM memory is 1.5GB.
>
> For this data if I do a query with sort flag on (for a single field) I always get java out of memory exception even if the number of hit is 0. With no sorting (or default sorting with score) the query works perfectly fine.`
>
> I can understand that JVM heap size can max out when the number of records hit is high, but why this is happening even when number of records hit is 0 ?
>
> The same query with sort flag on does not give me problem till 1.5 GB of data.
>
> Any explanation ?

Sorting in lucene and solr uninverts the field and creates a
FieldCache entry the first time the sort is used.
How many documents are in your index, and what is the type of the
field you are sorting on?

-Yonik

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Sorting and JVM heap size ....

Posted by Yonik Seeley <yo...@apache.org>.

On Mon, Nov 24, 2008 at 6:26 PM, souravm <SO...@infosys.com> wrote:
> I have indexed data of size around 20GB. My JVM memory is 1.5GB.
>
> For this data if I do a query with sort flag on (for a single field) I always get java out of memory exception even if the number of hit is 0. With no sorting (or default sorting with score) the query works perfectly fine.`
>
> I can understand that JVM heap size can max out when the number of records hit is high, but why this is happening even when number of records hit is 0 ?
>
> The same query with sort flag on does not give me problem till 1.5 GB of data.
>
> Any explanation ?

Sorting in lucene and solr uninverts the field and creates a
FieldCache entry the first time the sort is used.
How many documents are in your index, and what is the type of the
field you are sorting on?

-Yonik