You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Woolf, Ross" <Ro...@BMC.com> on 2010/04/02 00:58:20 UTC

IndexWriter and memory usage

We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit. We are using the default RAMBufferSize of 16 mb. We are using Lucene 2.9.1. We are set at heap size of 512 mb.

We have a large number of documents that are run through Tika and then added to the index. The data from Tika is changed to a string, and then sent to Lucene. Heap dumps clearly show the data in the Lucene classes and not in Tika. Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl. In using both JConsole and VisualVM we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state. It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.

As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%. The profiling tools now show that the memory is cleaned up to some extent after each document. But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake. Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads. The instance counts, after running just 1,100 documents seems staggering

Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit? Why are we seeing the heap space all being used up?

A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state). Why would this be? Is the only way to totally clean up the memory is to close the writer? Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.

Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

This would be a very good thing to try, given that you have some huge
documents that, indexed alone, use far more than your RAM buffer.

Mike

On Tue, Apr 13, 2010 at 12:19 AM, Lance Norskog <go...@gmail.com> wrote:
> There is some bugs where the writer data structures retain data after
> it is flushed. They are committed as of maybe the past week. If you
> can pull the trunk and try it with your use case, that would be great.
>
> On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
>> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>>
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Saturday, April 03, 2010 3:39 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Hmm why is the heap dump so immense?  Normally it contains the top N
>> (eg 100) object types and their count/aggregate RAM usage.
>>
>> Can you attach the infoStream output to an email (to java-user)?
>>
>> Mike
>>
>> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>>
>>> We are not using the NRT readers
>>>
>>> -----Original Message-----
>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>> Sent: Thursday, April 01, 2010 5:21 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: IndexWriter and memory usage
>>>
>>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>>
>>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>>> does allocate and then recycle this RAM buffer, so even in an idle
>>> state (having indexed enough docs to fill up the RAM buffer at least
>>> once) it'll hold onto those 16 MB.
>>>
>>> Are you using getReader (to get your NRT readers)?  If so, are you
>>> really sure you're eventually closing the previous reader after
>>> opening a new one?
>>>
>>> Mike
>>>
>>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>>
>>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>>
>>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>>
>>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>>
>>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>>
>>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

Phew!  Thank for bringing closure Ross.  Happy indexing,

Mike

On Wed, May 19, 2010 at 12:50 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> Just wanted to report that Michael was able to find the issue that was plaguing us, He has checked fixes into the 2.9.x, 3.0.x, 3.1.x, 4.0.x branches.  Most of the issues were related to indexing documents larger than the indexing buffer size (16mb by default).  Now we no longer run out of memory during our large document indexing runs.
>
> Thanks for your help Michael in resolving this.
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Friday, May 14, 2010 11:23 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> The patch looks correct.
>
> The 16 MB RAM buffer means the sum of the shared char[], byte[] and
> PostingList/RawPostingList memory will be kept under 16 MB.  There are
> definitely other things that require memory beyond this -- eg during a
> segment merge, SegmentReaders are opened for each segment being
> merged.  Also, if there are pending deletions, 4 bytes per doc is
> allocated.
>
> Applying deletions also opens SegmentReaders.
>
> Also: a single very large document will cause IW to blow way past the
> 16 MB limit, using up as much as is required to index that one doc.
> When that doc is finished, it will then flush and free objects until
> it's back under the 16 MB limit.  If several threads happen to index
> large docs at the same time, the problem is that much worse (they all
> must finish before IW can flush).
>
> Can you print the size of the documents you're indexing and see if
> that correlates to when you see the memory growth?
>
> Mike
>
> On Tue, May 11, 2010 at 2:57 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> Still working on some of the things you ask, namely searching without indexing.  I need to modify our code and the general indexing process takes 2 hours, so I won't have a quick turn around on that.  We also have a hard time answering the question about items that are normal but do they use more than the 16 MB.  The heap dump does not allow us to quickly identify specifics on objects like we show in images below, so we really don't know what the amount of memory is used up in objects of this sort.  We only know that byte[] total for all is at 197891887.
>>
>> However, I have provided another image that breaks down the memory usage from the heap.   A big question we have is that we talk about the 16 mb buffer, but is there other memory used by Lucene beyond that that we should expect to see?
>>
>> http://i39.tinypic.com/o0o560.jpg
>>
>> we have 197891887 used in byte[] (anyone we look at is related in some way to the index writer)
>> we have 169263904 used in char[] (these are related to the index writer too)
>> we have 72658944 used in FreqProxTermsWriter$PostingList
>> we have 37722668 used in RawPostingList[]
>>
>> All of these are well over the 16mb.  So we are a little lost as to what we should expect to see when we look at the memory usage
>>
>> I've attached the patch and the CheckIndex files.  Unfortunately on the patch I guess my editor made some space line changes, so you get a lot of extra items in the patch that really are not any changes other than tab/space.
>>
>> If you are open to a live share again, then maybe you could look at this data quicker than the screen shots I send.
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Monday, May 10, 2010 2:27 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Hmmmm...
>>
>> Your usage (searching for old doc & updating it, to add new fields) is fine.
>>
>> But: what memory usage do you see if you open a searcher, and search
>> for all docs, but don't open an IndexWriter?  We need to tease apart
>> the IndexReader vs IndexWriter memory usage you are seeing.  Also, can
>> you post the output of CheckIndex (java
>> org.apache.lucene.index.CheckIndex /path/to/index) of your fully built
>> index?  That may give some hints about expected memory usage of IR (eg
>> if # unique terms is large).
>>
>> More comments below:
>>
>> On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> Sorry to be so long in getting back on this. The patch you provided has improved the situation but we are still seeing some memory loss.  The following are some images from the heap dump.  I'll share with you what we are seeing now.
>>>
>>> This first image shows the memory pattern.  Our fist commit takes place at about 3:54 when the steady trend up takes a drop and the new cycle begins.   What we have found is the 2422 fix has made the memory in the first phase before the commit much better (and I'm sure throughout the entire run).  But as you can see after the commit then we then again begin to lose memory.  One of the pieces of info to know about this is what you are seeing, we have 5 threads that are pushing data to our Lucene plugin.  If we drop it down to 1 thread then we are much more successful and can actually index all of our data without running out of memory but at 5 threads it gets into trouble.  We still see a trend up in memory usage, but not as severe as when we use the multiple threads.
>>> http://tinypic.com/view.php?pic=2w6bf68&s=5
>>
>> Can you post the output of "svn diff" on the 2.9 code base you're
>> using?  I just want to look & verify all issues we've discussed are
>> included in your changes.  The fact that 1 thread is fine and 5
>> threads are not still sounds like a symptom of LUCENE-2283.
>>
>> Also, does that heap usage graph exclude garbage?  Or, alternatively,
>> can you provoke an OOME w/ 512 MB heap and then capture the heap dump
>> at that point?
>>
>>> There is another piece of the picture that I think might be coming into play.  We have plugged Lucene into a legacy app and are subject to how we can get it to deliver the data that we are indexing.  In some scenarios (like the one we are having this problem with) we are building our documents progressively (adding fields to the document through the process).  What you see before the first commit is the legacy system handing us the first field for many documents. Once we have gotten all of "field 1" for all documents then we commit that data into the index.  Then the system starts feeding us "field 2."  So we perform a search to see if the document already exists (for the scenario you are seeing it does) and so it retrieves the original document (we store a document ID) and it then adds the new field of data to the existing document and we "update" the document in the index.  After the first commit, the rest of the process is one where a document already exist and so the new field is added and and the document is updated.  It is in this process that we start rapidly losing memory.  The following images show some examples of common areas where memory is being held.
>>>
>>> http://tinypic.com/view.php?pic=11vkwnb&s=5
>>
>> This looks like "normal" memory usage of IndexWriter -- these are the
>> recycled buffers used for holding stored fields.  However: the net RAM
>> used by this allocation should not exceed your 16 MB IW ram buffer
>> size -- does it?
>>
>>> http://tinypic.com/view.php?pic=abq9fp&s=5
>>
>> This one is the byte[] buffer used by CompoundFileReader, opened by
>> IndexReader.  It's odd that you have so many of these (if I'm reading
>> this correctly) -- are you certain all opened readers are being
>> closed?  How many segments do you have in your index?  Or... are there
>> many unique threads doing the searching?  EG do you create a new
>> thread for every search or update?
>>
>>> http://tinypic.com/view.php?pic=25pskyp&s=5
>>
>> This one is also normal memory used by IndexWriter, but as above, the
>> net RAM used by this allocation (summed w/ the above one) should not
>> exceed your 16 MB IW ram buffer size.
>>
>>> As mentioned, we are subject to how we can have the legacy app feed us the data and so this is why we do it this way.  We treat this system as a real time system and at anytime the legacy system may send us a field that needs to be added or updated to a document.  So we search for the document and if found we either add or update a field if the field is already existing in the document.  So I started to wonder if a clue in this memory loss comes from the fact that we are retrieving an existing document and then adding to it and updating.
>>>
>>> Now, if we eliminate the updating and simply add each item as a new document (which we did just to test but won't be adequate for our running system), then we still see a slight trend upward in memory usage and the following images show that now most of the  memory is consumed in char[] rather than the byte[] we saw before.  We don't know if this is normal and expected, or if it is something to be concerned about as well.
>>>
>>> http://tinypic.com/view.php?pic=vfgkyt&s=5
>>
>> That memory usage is normal -- it's used by the in-RAM terms index of
>> your opened IndexReader.  But I'd like to see the memory usage of
>> simply opening your IndexReader and searching for documents to update,
>> but not opening an IndexWriter at all.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

Just wanted to report that Michael was able to find the issue that was plaguing us, He has checked fixes into the 2.9.x, 3.0.x, 3.1.x, 4.0.x branches.  Most of the issues were related to indexing documents larger than the indexing buffer size (16mb by default).  Now we no longer run out of memory during our large document indexing runs.

Thanks for your help Michael in resolving this.

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Friday, May 14, 2010 11:23 AM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

The patch looks correct.

The 16 MB RAM buffer means the sum of the shared char[], byte[] and
PostingList/RawPostingList memory will be kept under 16 MB.  There are
definitely other things that require memory beyond this -- eg during a
segment merge, SegmentReaders are opened for each segment being
merged.  Also, if there are pending deletions, 4 bytes per doc is
allocated.

Applying deletions also opens SegmentReaders.

Also: a single very large document will cause IW to blow way past the
16 MB limit, using up as much as is required to index that one doc.
When that doc is finished, it will then flush and free objects until
it's back under the 16 MB limit.  If several threads happen to index
large docs at the same time, the problem is that much worse (they all
must finish before IW can flush).

Can you print the size of the documents you're indexing and see if
that correlates to when you see the memory growth?

Mike

On Tue, May 11, 2010 at 2:57 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> Still working on some of the things you ask, namely searching without indexing.  I need to modify our code and the general indexing process takes 2 hours, so I won't have a quick turn around on that.  We also have a hard time answering the question about items that are normal but do they use more than the 16 MB.  The heap dump does not allow us to quickly identify specifics on objects like we show in images below, so we really don't know what the amount of memory is used up in objects of this sort.  We only know that byte[] total for all is at 197891887.
>
> However, I have provided another image that breaks down the memory usage from the heap.   A big question we have is that we talk about the 16 mb buffer, but is there other memory used by Lucene beyond that that we should expect to see?
>
> http://i39.tinypic.com/o0o560.jpg
>
> we have 197891887 used in byte[] (anyone we look at is related in some way to the index writer)
> we have 169263904 used in char[] (these are related to the index writer too)
> we have 72658944 used in FreqProxTermsWriter$PostingList
> we have 37722668 used in RawPostingList[]
>
> All of these are well over the 16mb.  So we are a little lost as to what we should expect to see when we look at the memory usage
>
> I've attached the patch and the CheckIndex files.  Unfortunately on the patch I guess my editor made some space line changes, so you get a lot of extra items in the patch that really are not any changes other than tab/space.
>
> If you are open to a live share again, then maybe you could look at this data quicker than the screen shots I send.
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Monday, May 10, 2010 2:27 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Hmmmm...
>
> Your usage (searching for old doc & updating it, to add new fields) is fine.
>
> But: what memory usage do you see if you open a searcher, and search
> for all docs, but don't open an IndexWriter?  We need to tease apart
> the IndexReader vs IndexWriter memory usage you are seeing.  Also, can
> you post the output of CheckIndex (java
> org.apache.lucene.index.CheckIndex /path/to/index) of your fully built
> index?  That may give some hints about expected memory usage of IR (eg
> if # unique terms is large).
>
> More comments below:
>
> On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> Sorry to be so long in getting back on this. The patch you provided has improved the situation but we are still seeing some memory loss.  The following are some images from the heap dump.  I'll share with you what we are seeing now.
>>
>> This first image shows the memory pattern.  Our fist commit takes place at about 3:54 when the steady trend up takes a drop and the new cycle begins.   What we have found is the 2422 fix has made the memory in the first phase before the commit much better (and I'm sure throughout the entire run).  But as you can see after the commit then we then again begin to lose memory.  One of the pieces of info to know about this is what you are seeing, we have 5 threads that are pushing data to our Lucene plugin.  If we drop it down to 1 thread then we are much more successful and can actually index all of our data without running out of memory but at 5 threads it gets into trouble.  We still see a trend up in memory usage, but not as severe as when we use the multiple threads.
>> http://tinypic.com/view.php?pic=2w6bf68&s=5
>
> Can you post the output of "svn diff" on the 2.9 code base you're
> using?  I just want to look & verify all issues we've discussed are
> included in your changes.  The fact that 1 thread is fine and 5
> threads are not still sounds like a symptom of LUCENE-2283.
>
> Also, does that heap usage graph exclude garbage?  Or, alternatively,
> can you provoke an OOME w/ 512 MB heap and then capture the heap dump
> at that point?
>
>> There is another piece of the picture that I think might be coming into play.  We have plugged Lucene into a legacy app and are subject to how we can get it to deliver the data that we are indexing.  In some scenarios (like the one we are having this problem with) we are building our documents progressively (adding fields to the document through the process).  What you see before the first commit is the legacy system handing us the first field for many documents. Once we have gotten all of "field 1" for all documents then we commit that data into the index.  Then the system starts feeding us "field 2."  So we perform a search to see if the document already exists (for the scenario you are seeing it does) and so it retrieves the original document (we store a document ID) and it then adds the new field of data to the existing document and we "update" the document in the index.  After the first commit, the rest of the process is one where a document already exist and so the new field is added and and the document is updated.  It is in this process that we start rapidly losing memory.  The following images show some examples of common areas where memory is being held.
>>
>> http://tinypic.com/view.php?pic=11vkwnb&s=5
>
> This looks like "normal" memory usage of IndexWriter -- these are the
> recycled buffers used for holding stored fields.  However: the net RAM
> used by this allocation should not exceed your 16 MB IW ram buffer
> size -- does it?
>
>> http://tinypic.com/view.php?pic=abq9fp&s=5
>
> This one is the byte[] buffer used by CompoundFileReader, opened by
> IndexReader.  It's odd that you have so many of these (if I'm reading
> this correctly) -- are you certain all opened readers are being
> closed?  How many segments do you have in your index?  Or... are there
> many unique threads doing the searching?  EG do you create a new
> thread for every search or update?
>
>> http://tinypic.com/view.php?pic=25pskyp&s=5
>
> This one is also normal memory used by IndexWriter, but as above, the
> net RAM used by this allocation (summed w/ the above one) should not
> exceed your 16 MB IW ram buffer size.
>
>> As mentioned, we are subject to how we can have the legacy app feed us the data and so this is why we do it this way.  We treat this system as a real time system and at anytime the legacy system may send us a field that needs to be added or updated to a document.  So we search for the document and if found we either add or update a field if the field is already existing in the document.  So I started to wonder if a clue in this memory loss comes from the fact that we are retrieving an existing document and then adding to it and updating.
>>
>> Now, if we eliminate the updating and simply add each item as a new document (which we did just to test but won't be adequate for our running system), then we still see a slight trend upward in memory usage and the following images show that now most of the  memory is consumed in char[] rather than the byte[] we saw before.  We don't know if this is normal and expected, or if it is something to be concerned about as well.
>>
>> http://tinypic.com/view.php?pic=vfgkyt&s=5
>
> That memory usage is normal -- it's used by the in-RAM terms index of
> your opened IndexReader.  But I'd like to see the memory usage of
> simply opening your IndexReader and searching for documents to update,
> but not opening an IndexWriter at all.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

The patch looks correct.

The 16 MB RAM buffer means the sum of the shared char[], byte[] and
PostingList/RawPostingList memory will be kept under 16 MB.  There are
definitely other things that require memory beyond this -- eg during a
segment merge, SegmentReaders are opened for each segment being
merged.  Also, if there are pending deletions, 4 bytes per doc is
allocated.

Applying deletions also opens SegmentReaders.

Also: a single very large document will cause IW to blow way past the
16 MB limit, using up as much as is required to index that one doc.
When that doc is finished, it will then flush and free objects until
it's back under the 16 MB limit.  If several threads happen to index
large docs at the same time, the problem is that much worse (they all
must finish before IW can flush).

Can you print the size of the documents you're indexing and see if
that correlates to when you see the memory growth?

Mike

On Tue, May 11, 2010 at 2:57 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> Still working on some of the things you ask, namely searching without indexing.  I need to modify our code and the general indexing process takes 2 hours, so I won't have a quick turn around on that.  We also have a hard time answering the question about items that are normal but do they use more than the 16 MB.  The heap dump does not allow us to quickly identify specifics on objects like we show in images below, so we really don't know what the amount of memory is used up in objects of this sort.  We only know that byte[] total for all is at 197891887.
>
> However, I have provided another image that breaks down the memory usage from the heap.   A big question we have is that we talk about the 16 mb buffer, but is there other memory used by Lucene beyond that that we should expect to see?
>
> http://i39.tinypic.com/o0o560.jpg
>
> we have 197891887 used in byte[] (anyone we look at is related in some way to the index writer)
> we have 169263904 used in char[] (these are related to the index writer too)
> we have 72658944 used in FreqProxTermsWriter$PostingList
> we have 37722668 used in RawPostingList[]
>
> All of these are well over the 16mb.  So we are a little lost as to what we should expect to see when we look at the memory usage
>
> I've attached the patch and the CheckIndex files.  Unfortunately on the patch I guess my editor made some space line changes, so you get a lot of extra items in the patch that really are not any changes other than tab/space.
>
> If you are open to a live share again, then maybe you could look at this data quicker than the screen shots I send.
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Monday, May 10, 2010 2:27 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Hmmmm...
>
> Your usage (searching for old doc & updating it, to add new fields) is fine.
>
> But: what memory usage do you see if you open a searcher, and search
> for all docs, but don't open an IndexWriter?  We need to tease apart
> the IndexReader vs IndexWriter memory usage you are seeing.  Also, can
> you post the output of CheckIndex (java
> org.apache.lucene.index.CheckIndex /path/to/index) of your fully built
> index?  That may give some hints about expected memory usage of IR (eg
> if # unique terms is large).
>
> More comments below:
>
> On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> Sorry to be so long in getting back on this. The patch you provided has improved the situation but we are still seeing some memory loss.  The following are some images from the heap dump.  I'll share with you what we are seeing now.
>>
>> This first image shows the memory pattern.  Our fist commit takes place at about 3:54 when the steady trend up takes a drop and the new cycle begins.   What we have found is the 2422 fix has made the memory in the first phase before the commit much better (and I'm sure throughout the entire run).  But as you can see after the commit then we then again begin to lose memory.  One of the pieces of info to know about this is what you are seeing, we have 5 threads that are pushing data to our Lucene plugin.  If we drop it down to 1 thread then we are much more successful and can actually index all of our data without running out of memory but at 5 threads it gets into trouble.  We still see a trend up in memory usage, but not as severe as when we use the multiple threads.
>> http://tinypic.com/view.php?pic=2w6bf68&s=5
>
> Can you post the output of "svn diff" on the 2.9 code base you're
> using?  I just want to look & verify all issues we've discussed are
> included in your changes.  The fact that 1 thread is fine and 5
> threads are not still sounds like a symptom of LUCENE-2283.
>
> Also, does that heap usage graph exclude garbage?  Or, alternatively,
> can you provoke an OOME w/ 512 MB heap and then capture the heap dump
> at that point?
>
>> There is another piece of the picture that I think might be coming into play.  We have plugged Lucene into a legacy app and are subject to how we can get it to deliver the data that we are indexing.  In some scenarios (like the one we are having this problem with) we are building our documents progressively (adding fields to the document through the process).  What you see before the first commit is the legacy system handing us the first field for many documents. Once we have gotten all of "field 1" for all documents then we commit that data into the index.  Then the system starts feeding us "field 2."  So we perform a search to see if the document already exists (for the scenario you are seeing it does) and so it retrieves the original document (we store a document ID) and it then adds the new field of data to the existing document and we "update" the document in the index.  After the first commit, the rest of the process is one where a document already exist and so the new field is added and and the document is updated.  It is in this process that we start rapidly losing memory.  The following images show some examples of common areas where memory is being held.
>>
>> http://tinypic.com/view.php?pic=11vkwnb&s=5
>
> This looks like "normal" memory usage of IndexWriter -- these are the
> recycled buffers used for holding stored fields.  However: the net RAM
> used by this allocation should not exceed your 16 MB IW ram buffer
> size -- does it?
>
>> http://tinypic.com/view.php?pic=abq9fp&s=5
>
> This one is the byte[] buffer used by CompoundFileReader, opened by
> IndexReader.  It's odd that you have so many of these (if I'm reading
> this correctly) -- are you certain all opened readers are being
> closed?  How many segments do you have in your index?  Or... are there
> many unique threads doing the searching?  EG do you create a new
> thread for every search or update?
>
>> http://tinypic.com/view.php?pic=25pskyp&s=5
>
> This one is also normal memory used by IndexWriter, but as above, the
> net RAM used by this allocation (summed w/ the above one) should not
> exceed your 16 MB IW ram buffer size.
>
>> As mentioned, we are subject to how we can have the legacy app feed us the data and so this is why we do it this way.  We treat this system as a real time system and at anytime the legacy system may send us a field that needs to be added or updated to a document.  So we search for the document and if found we either add or update a field if the field is already existing in the document.  So I started to wonder if a clue in this memory loss comes from the fact that we are retrieving an existing document and then adding to it and updating.
>>
>> Now, if we eliminate the updating and simply add each item as a new document (which we did just to test but won't be adequate for our running system), then we still see a slight trend upward in memory usage and the following images show that now most of the  memory is consumed in char[] rather than the byte[] we saw before.  We don't know if this is normal and expected, or if it is something to be concerned about as well.
>>
>> http://tinypic.com/view.php?pic=vfgkyt&s=5
>
> That memory usage is normal -- it's used by the in-RAM terms index of
> your opened IndexReader.  But I'd like to see the memory usage of
> simply opening your IndexReader and searching for documents to update,
> but not opening an IndexWriter at all.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

Still working on some of the things you ask, namely searching without indexing.  I need to modify our code and the general indexing process takes 2 hours, so I won't have a quick turn around on that.  We also have a hard time answering the question about items that are normal but do they use more than the 16 MB.  The heap dump does not allow us to quickly identify specifics on objects like we show in images below, so we really don't know what the amount of memory is used up in objects of this sort.  We only know that byte[] total for all is at 197891887.

However, I have provided another image that breaks down the memory usage from the heap.   A big question we have is that we talk about the 16 mb buffer, but is there other memory used by Lucene beyond that that we should expect to see? 

http://i39.tinypic.com/o0o560.jpg

we have 197891887 used in byte[] (anyone we look at is related in some way to the index writer)
we have 169263904 used in char[] (these are related to the index writer too)
we have 72658944 used in FreqProxTermsWriter$PostingList
we have 37722668 used in RawPostingList[]

All of these are well over the 16mb.  So we are a little lost as to what we should expect to see when we look at the memory usage

I've attached the patch and the CheckIndex files.  Unfortunately on the patch I guess my editor made some space line changes, so you get a lot of extra items in the patch that really are not any changes other than tab/space.  

If you are open to a live share again, then maybe you could look at this data quicker than the screen shots I send.  

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Monday, May 10, 2010 2:27 AM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

Hmmmm...

Your usage (searching for old doc & updating it, to add new fields) is fine.

But: what memory usage do you see if you open a searcher, and search
for all docs, but don't open an IndexWriter?  We need to tease apart
the IndexReader vs IndexWriter memory usage you are seeing.  Also, can
you post the output of CheckIndex (java
org.apache.lucene.index.CheckIndex /path/to/index) of your fully built
index?  That may give some hints about expected memory usage of IR (eg
if # unique terms is large).

More comments below:

On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> Sorry to be so long in getting back on this. The patch you provided has improved the situation but we are still seeing some memory loss. �The following are some images from the heap dump. �I'll share with you what we are seeing now.
>
> This first image shows the memory pattern. �Our fist commit takes place at about 3:54 when the steady trend up takes a drop and the new cycle begins. � What we have found is the 2422 fix has made the memory in the first phase before the commit much better (and I'm sure throughout the entire run). �But as you can see after the commit then we then again begin to lose memory. �One of the pieces of info to know about this is what you are seeing, we have 5 threads that are pushing data to our Lucene plugin. �If we drop it down to 1 thread then we are much more successful and can actually index all of our data without running out of memory but at 5 threads it gets into trouble. �We still see a trend up in memory usage, but not as severe as when we use the multiple threads.
> http://tinypic.com/view.php?pic=2w6bf68&s=5

Can you post the output of "svn diff" on the 2.9 code base you're
using?  I just want to look & verify all issues we've discussed are
included in your changes.  The fact that 1 thread is fine and 5
threads are not still sounds like a symptom of LUCENE-2283.

Also, does that heap usage graph exclude garbage?  Or, alternatively,
can you provoke an OOME w/ 512 MB heap and then capture the heap dump
at that point?

> There is another piece of the picture that I think might be coming into play. �We have plugged Lucene into a legacy app and are subject to how we can get it to deliver the data that we are indexing. �In some scenarios (like the one we are having this problem with) we are building our documents progressively (adding fields to the document through the process). �What you see before the first commit is the legacy system handing us the first field for many documents. Once we have gotten all of "field 1" for all documents then we commit that data into the index. �Then the system starts feeding us "field 2." �So we perform a search to see if the document already exists (for the scenario you are seeing it does) and so it retrieves the original document (we store a document ID) and it then adds the new field of data to the existing document and we "update" the document in the index. �After the first commit, the rest of the process is one where a document already exist and so the new field is added and and the document is updated. �It is in this process that we start rapidly losing memory. �The following images show some examples of common areas where memory is being held.
>
> http://tinypic.com/view.php?pic=11vkwnb&s=5

This looks like "normal" memory usage of IndexWriter -- these are the
recycled buffers used for holding stored fields.  However: the net RAM
used by this allocation should not exceed your 16 MB IW ram buffer
size -- does it?

> http://tinypic.com/view.php?pic=abq9fp&s=5

This one is the byte[] buffer used by CompoundFileReader, opened by
IndexReader.  It's odd that you have so many of these (if I'm reading
this correctly) -- are you certain all opened readers are being
closed?  How many segments do you have in your index?  Or... are there
many unique threads doing the searching?  EG do you create a new
thread for every search or update?

> http://tinypic.com/view.php?pic=25pskyp&s=5

This one is also normal memory used by IndexWriter, but as above, the
net RAM used by this allocation (summed w/ the above one) should not
exceed your 16 MB IW ram buffer size.

> As mentioned, we are subject to how we can have the legacy app feed us the data and so this is why we do it this way. �We treat this system as a real time system and at anytime the legacy system may send us a field that needs to be added or updated to a document. �So we search for the document and if found we either add or update a field if the field is already existing in the document. �So I started to wonder if a clue in this memory loss comes from the fact that we are retrieving an existing document and then adding to it and updating.
>
> Now, if we eliminate the updating and simply add each item as a new document (which we did just to test but won't be adequate for our running system), then we still see a slight trend upward in memory usage and the following images show that now most of the �memory is consumed in char[] rather than the byte[] we saw before. �We don't know if this is normal and expected, or if it is something to be concerned about as well.
>
> http://tinypic.com/view.php?pic=vfgkyt&s=5

That memory usage is normal -- it's used by the in-RAM terms index of
your opened IndexReader.  But I'd like to see the memory usage of
simply opening your IndexReader and searching for documents to update,
but not opening an IndexWriter at all.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hmmmm...

Your usage (searching for old doc & updating it, to add new fields) is fine.

But: what memory usage do you see if you open a searcher, and search
for all docs, but don't open an IndexWriter?  We need to tease apart
the IndexReader vs IndexWriter memory usage you are seeing.  Also, can
you post the output of CheckIndex (java
org.apache.lucene.index.CheckIndex /path/to/index) of your fully built
index?  That may give some hints about expected memory usage of IR (eg
if # unique terms is large).

More comments below:

On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> Sorry to be so long in getting back on this. The patch you provided has improved the situation but we are still seeing some memory loss.  The following are some images from the heap dump.  I'll share with you what we are seeing now.
>
> This first image shows the memory pattern.  Our fist commit takes place at about 3:54 when the steady trend up takes a drop and the new cycle begins.   What we have found is the 2422 fix has made the memory in the first phase before the commit much better (and I'm sure throughout the entire run).  But as you can see after the commit then we then again begin to lose memory.  One of the pieces of info to know about this is what you are seeing, we have 5 threads that are pushing data to our Lucene plugin.  If we drop it down to 1 thread then we are much more successful and can actually index all of our data without running out of memory but at 5 threads it gets into trouble.  We still see a trend up in memory usage, but not as severe as when we use the multiple threads.
> http://tinypic.com/view.php?pic=2w6bf68&s=5

Can you post the output of "svn diff" on the 2.9 code base you're
using?  I just want to look & verify all issues we've discussed are
included in your changes.  The fact that 1 thread is fine and 5
threads are not still sounds like a symptom of LUCENE-2283.

Also, does that heap usage graph exclude garbage?  Or, alternatively,
can you provoke an OOME w/ 512 MB heap and then capture the heap dump
at that point?

> There is another piece of the picture that I think might be coming into play.  We have plugged Lucene into a legacy app and are subject to how we can get it to deliver the data that we are indexing.  In some scenarios (like the one we are having this problem with) we are building our documents progressively (adding fields to the document through the process).  What you see before the first commit is the legacy system handing us the first field for many documents. Once we have gotten all of "field 1" for all documents then we commit that data into the index.  Then the system starts feeding us "field 2."  So we perform a search to see if the document already exists (for the scenario you are seeing it does) and so it retrieves the original document (we store a document ID) and it then adds the new field of data to the existing document and we "update" the document in the index.  After the first commit, the rest of the process is one where a document already exist and so the new field is added and and the document is updated.  It is in this process that we start rapidly losing memory.  The following images show some examples of common areas where memory is being held.
>
> http://tinypic.com/view.php?pic=11vkwnb&s=5

This looks like "normal" memory usage of IndexWriter -- these are the
recycled buffers used for holding stored fields.  However: the net RAM
used by this allocation should not exceed your 16 MB IW ram buffer
size -- does it?

> http://tinypic.com/view.php?pic=abq9fp&s=5

This one is the byte[] buffer used by CompoundFileReader, opened by
IndexReader.  It's odd that you have so many of these (if I'm reading
this correctly) -- are you certain all opened readers are being
closed?  How many segments do you have in your index?  Or... are there
many unique threads doing the searching?  EG do you create a new
thread for every search or update?

> http://tinypic.com/view.php?pic=25pskyp&s=5

This one is also normal memory used by IndexWriter, but as above, the
net RAM used by this allocation (summed w/ the above one) should not
exceed your 16 MB IW ram buffer size.

> As mentioned, we are subject to how we can have the legacy app feed us the data and so this is why we do it this way.  We treat this system as a real time system and at anytime the legacy system may send us a field that needs to be added or updated to a document.  So we search for the document and if found we either add or update a field if the field is already existing in the document.  So I started to wonder if a clue in this memory loss comes from the fact that we are retrieving an existing document and then adding to it and updating.
>
> Now, if we eliminate the updating and simply add each item as a new document (which we did just to test but won't be adequate for our running system), then we still see a slight trend upward in memory usage and the following images show that now most of the  memory is consumed in char[] rather than the byte[] we saw before.  We don't know if this is normal and expected, or if it is something to be concerned about as well.
>
> http://tinypic.com/view.php?pic=vfgkyt&s=5

That memory usage is normal -- it's used by the in-RAM terms index of
your opened IndexReader.  But I'd like to see the memory usage of
simply opening your IndexReader and searching for documents to update,
but not opening an IndexWriter at all.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

Sorry to be so long in getting back on this. The patch you provided has improved the situation but we are still seeing some memory loss.  The following are some images from the heap dump.  I'll share with you what we are seeing now.

This first image shows the memory pattern.  Our fist commit takes place at about 3:54 when the steady trend up takes a drop and the new cycle begins.   What we have found is the 2422 fix has made the memory in the first phase before the commit much better (and I'm sure throughout the entire run).  But as you can see after the commit then we then again begin to lose memory.  One of the pieces of info to know about this is what you are seeing, we have 5 threads that are pushing data to our Lucene plugin.  If we drop it down to 1 thread then we are much more successful and can actually index all of our data without running out of memory but at 5 threads it gets into trouble.  We still see a trend up in memory usage, but not as severe as when we use the multiple threads.
http://tinypic.com/view.php?pic=2w6bf68&s=5

There is another piece of the picture that I think might be coming into play.  We have plugged Lucene into a legacy app and are subject to how we can get it to deliver the data that we are indexing.  In some scenarios (like the one we are having this problem with) we are building our documents progressively (adding fields to the document through the process).  What you see before the first commit is the legacy system handing us the first field for many documents. Once we have gotten all of "field 1" for all documents then we commit that data into the index.  Then the system starts feeding us "field 2."  So we perform a search to see if the document already exists (for the scenario you are seeing it does) and so it retrieves the original document (we store a document ID) and it then adds the new field of data to the existing document and we "update" the document in the index.  After the first commit, the rest of the process is one where a document already exist and so the new field is added and and the document is updated.  It is in this process that we start rapidly losing memory.  The following images show some examples of common areas where memory is being held.

http://tinypic.com/view.php?pic=11vkwnb&s=5
http://tinypic.com/view.php?pic=abq9fp&s=5
http://tinypic.com/view.php?pic=25pskyp&s=5

As mentioned, we are subject to how we can have the legacy app feed us the data and so this is why we do it this way.  We treat this system as a real time system and at anytime the legacy system may send us a field that needs to be added or updated to a document.  So we search for the document and if found we either add or update a field if the field is already existing in the document.  So I started to wonder if a clue in this memory loss comes from the fact that we are retrieving an existing document and then adding to it and updating.

Now, if we eliminate the updating and simply add each item as a new document (which we did just to test but won't be adequate for our running system), then we still see a slight trend upward in memory usage and the following images show that now most of the  memory is consumed in char[] rather than the byte[] we saw before.  We don't know if this is normal and expected, or if it is something to be concerned about as well.

http://tinypic.com/view.php?pic=vfgkyt&s=5

Any thoughts?

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Thursday, April 29, 2010 2:07 PM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

OK I think you may be hitting this:

    https://issues.apache.org/jira/browse/LUCENE-2422

Since you have very large docs, the reuse that's done by
IndexInput/Output is tying up alot of memory.

Ross can you try the patch I just attached on that issue (merge it w/
the other issues) and see if that fixes it?  Thanks.

Mike

On Thu, Apr 29, 2010 at 11:58 AM, Woolf, Ross <Ro...@bmc.com> wrote:
> I ported the patch to 2.9.2 dev but it did not seem to help.  Attached is my port of the patch.  This patch contains both 2283 and 2387, both of which I have applied in trying to resolving this issue.
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Tuesday, April 27, 2010 4:40 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Oooh -- I suspect you are hitting this issue:
>
>    https://issues.apache.org/jira/browse/LUCENE-2283
>
> Your 3rd image ("fdt") jogged my memory on this one.  Can you try
> testing the trunk JAR from after that issue landed?  (Or, apply that
> patch against 3.0.x -- let me know if it does not apply cleanly and
> I'll try to back port it).
>
> But: it's spooky that you cannot repro this issue in your dev
> environment.  Are you matching the # thread and exact sequence of
> docs?
>
> Mike
>
> On Mon, Apr 26, 2010 at 4:14 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> We are still plagued by this issue.  I tried applying the patch mentioned but this did not resolve the issue.
>>
>> I once tried to attach images from the heap dump to send out to the group but the server removed them so I have posted the images on a public service with links this time.  I would appreciate someone looking at them to see if they provide any insight into what is occurring with this issue.
>>
>> When you follow the link click on the image and then once you see the image click on a link in the lower left hand corner that says "View Raw Image."  This will let you view the images at 100% resolution.
>>
>> This first image shows what we are seeing within VisualVM in regards to the memory.  As you can see, over time the memory gets consumed.  Finally we are at a point where there is no more memory available.
>> Graph
>> http://tinypic.com/view.php?pic=2ltk0h3&s=5
>>
>> This second image in VisualVM shows the classes sorted by size.  As you can see, about 70% of all memory is consumed in the bytes array.
>> Bytes
>> http://tinypic.com/view.php?pic=s10mqs&s=5
>>
>> This third image is where the real info is.  This is where one of the bytes is being examined and the option to go to nearest GC is chosen.  What you see here is what the majority of the bytes show if selected, so this one is representative of most all.  As you can see this one byte is associated with the index writer as you look at the chain of objects (and thus so too are all the other bytes that have not been released for GC).
>> Garbage Collection
>> http://tinypic.com/view.php?pic=5obalj&s=5
>>
>> I'm hoping that as you look at this that it might mean something to you or give you a clue as to what is holding on to all the memory.
>>
>> Now the mysterious thing in all of this is that our use of Lucene has been developed into a "plug-in" that we use within an application that we have.  If I just run JUnit tests around this plugin, indexing some of the same files that the actual application is indexing, I cannot ever get the memory loss in my dev environment.  Everything seems to work as expected.  However, once we are in our real situation, then we see this behavior.  Because of this I would expect that the problem lays with the application, but once we examine the heap dumps it then goes back to showing that the consumed bytes are "owned" by the index writer process.  It makes no sense to me that we see this as we do, but none the less we do.  We see that the Index Writer process is hanging onto a lot of data in byte arrays and it doesn't ever seam to release it.
>>
>> In addition, we would love to show this to someone via a webex if that would help in seeing what is going on.
>>
>> Please, any help appreciated and any suggestions on how to resolve or even troubleshoot.  I can provide an actual heap dump but it is 63mb in size (compressed) so we would need to work out some FTP where we can provide it if someone is willing to look at it in VisualVM (or any other profiling tool).
>>
>> BTW - If we open and close the index writer on a regular basis then we don't run into this problem.  It is only when we run continuously with an open index writer that we do see this problem (we altered the code to open/close the writer a lot, but this slows things down, so we don't want to run like this, but we wanted to test the behavior if we did so).
>>
>> Thanks,
>> Ross
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Wednesday, April 14, 2010 2:52 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Run this:
>>
>>    svn co https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9
>> lucene.29x
>>
>> Then apply the patch, then, run "ant jar-core", and in that should
>> create the lucene-core-2.9.2-dev.jar.
>>
>> Mike
>>
>> On Wed, Apr 14, 2010 at 1:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> How do I get to the 2.9.x branch?  Every link I take from the Lucene site takes me to the trunk which I assume is the 3.x version.  I've tried to look around svn but can't find anything labeled 2.9.x.  Is there a daily build of 2.9.x or do I need to build it myself.  I would like to try out the fix you put into it, but I'm not sure where I get it from.
>>>
>>> -----Original Message-----
>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>> Sent: Wednesday, April 14, 2010 4:12 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: IndexWriter and memory usage
>>>
>>> It looks like the mailing list software stripped your image attachments...
>>>
>>> Alas these fixes are only committed on 3.1.
>>>
>>> But I just posted the patch on LUCENE-2387 for 2.9.x -- it's a tiny
>>> fix.  I think the other issue was part of LUCENE-2074 (though this
>>> issue included many other changes) -- Uwe can you peel out just a
>>> 2.9.x patch for resetting JFlex's zzBuffer?
>>>
>>> You could also try switching analyzers (eg to WhitespaceAnalyzer) to
>>> see if in fact LUCENE-2074 (which affects StandandAnalyzer, since it
>>> uses JFlex) is [part of] your problem.
>>>
>>> Mike
>>>
>>> On Tue, Apr 13, 2010 at 6:42 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>> Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump.  In the first image you can see that at the time our memory has become very tight most of it is held up in bytes.  In the second image I examine one of those instances and navigate to the nearest garbage collection root.  In looking at very many of these objects, they all end up being instantiated through the IndexWriter process.
>>>>
>>>> This heap dump is the same one correlating to the infoStream that was attached in a prior message.  So while the infoStream shows the buffer being flushed, what we experience is that our memory gets consumed over time by these bytes in the IndexWriter.
>>
>>>>
>>>> I wanted to provide these images to see if they might correlate to the fixes mentioned below.  Hopefully those fixes mentioned below have rectified this problem.  And as I state in the prior message, I'm hoping these fixes are in a 2.9x branch and hoping for someone to point me to where I can get those fixes to try out.
>>>>
>>>> Thanks
>>>>
>>>> -----Original Message-----
>>>> From: Woolf, Ross [mailto:Ross_Woolf@BMC.com]
>>>> Sent: Tuesday, April 13, 2010 1:29 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: RE: IndexWriter and memory usage
>>>>
>>>> Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?
>>>>
>>>> -----Original Message-----
>>>> From: Lance Norskog [mailto:goksron@gmail.com]
>>>> Sent: Monday, April 12, 2010 10:20 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: IndexWriter and memory usage
>>>>
>>>> There is some bugs where the writer data structures retain data after
>>>> it is flushed. They are committed as of maybe the past week. If you
>>>> can pull the trunk and try it with your use case, that would be great.
>>>>
>>>> On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>>> Sent: Saturday, April 03, 2010 3:39 AM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: IndexWriter and memory usage
>>>>>
>>>>> Hmm why is the heap dump so immense?  Normally it contains the top N
>>>>> (eg 100) object types and their count/aggregate RAM usage.
>>>>>
>>>>> Can you attach the infoStream output to an email (to java-user)?
>>>>>
>>>>> Mike
>>>>>
>>>>> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>>>>>
>>>>>> We are not using the NRT readers
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>>>> Sent: Thursday, April 01, 2010 5:21 PM
>>>>>> To: java-user@lucene.apache.org
>>>>>> Subject: Re: IndexWriter and memory usage
>>>>>>
>>>>>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>>>>>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>>>>>
>>>>>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>>>>>> does allocate and then recycle this RAM buffer, so even in an idle
>>>>>> state (having indexed enough docs to fill up the RAM buffer at least
>>>>>> once) it'll hold onto those 16 MB.
>>>>>>
>>>>>> Are you using getReader (to get your NRT readers)?  If so, are you
>>>>>> really sure you're eventually closing the previous reader after
>>>>>> opening a new one?
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>>>>>
>>>>>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>>>>>
>>>>>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>>>>>
>>>>>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>>>>>
>>>>>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>>>>>
>>>>>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

OK I think you may be hitting this:

    https://issues.apache.org/jira/browse/LUCENE-2422

Since you have very large docs, the reuse that's done by
IndexInput/Output is tying up alot of memory.

Ross can you try the patch I just attached on that issue (merge it w/
the other issues) and see if that fixes it?  Thanks.

Mike

On Thu, Apr 29, 2010 at 11:58 AM, Woolf, Ross <Ro...@bmc.com> wrote:
> I ported the patch to 2.9.2 dev but it did not seem to help.  Attached is my port of the patch.  This patch contains both 2283 and 2387, both of which I have applied in trying to resolving this issue.
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Tuesday, April 27, 2010 4:40 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Oooh -- I suspect you are hitting this issue:
>
>    https://issues.apache.org/jira/browse/LUCENE-2283
>
> Your 3rd image ("fdt") jogged my memory on this one.  Can you try
> testing the trunk JAR from after that issue landed?  (Or, apply that
> patch against 3.0.x -- let me know if it does not apply cleanly and
> I'll try to back port it).
>
> But: it's spooky that you cannot repro this issue in your dev
> environment.  Are you matching the # thread and exact sequence of
> docs?
>
> Mike
>
> On Mon, Apr 26, 2010 at 4:14 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> We are still plagued by this issue.  I tried applying the patch mentioned but this did not resolve the issue.
>>
>> I once tried to attach images from the heap dump to send out to the group but the server removed them so I have posted the images on a public service with links this time.  I would appreciate someone looking at them to see if they provide any insight into what is occurring with this issue.
>>
>> When you follow the link click on the image and then once you see the image click on a link in the lower left hand corner that says "View Raw Image."  This will let you view the images at 100% resolution.
>>
>> This first image shows what we are seeing within VisualVM in regards to the memory.  As you can see, over time the memory gets consumed.  Finally we are at a point where there is no more memory available.
>> Graph
>> http://tinypic.com/view.php?pic=2ltk0h3&s=5
>>
>> This second image in VisualVM shows the classes sorted by size.  As you can see, about 70% of all memory is consumed in the bytes array.
>> Bytes
>> http://tinypic.com/view.php?pic=s10mqs&s=5
>>
>> This third image is where the real info is.  This is where one of the bytes is being examined and the option to go to nearest GC is chosen.  What you see here is what the majority of the bytes show if selected, so this one is representative of most all.  As you can see this one byte is associated with the index writer as you look at the chain of objects (and thus so too are all the other bytes that have not been released for GC).
>> Garbage Collection
>> http://tinypic.com/view.php?pic=5obalj&s=5
>>
>> I'm hoping that as you look at this that it might mean something to you or give you a clue as to what is holding on to all the memory.
>>
>> Now the mysterious thing in all of this is that our use of Lucene has been developed into a "plug-in" that we use within an application that we have.  If I just run JUnit tests around this plugin, indexing some of the same files that the actual application is indexing, I cannot ever get the memory loss in my dev environment.  Everything seems to work as expected.  However, once we are in our real situation, then we see this behavior.  Because of this I would expect that the problem lays with the application, but once we examine the heap dumps it then goes back to showing that the consumed bytes are "owned" by the index writer process.  It makes no sense to me that we see this as we do, but none the less we do.  We see that the Index Writer process is hanging onto a lot of data in byte arrays and it doesn't ever seam to release it.
>>
>> In addition, we would love to show this to someone via a webex if that would help in seeing what is going on.
>>
>> Please, any help appreciated and any suggestions on how to resolve or even troubleshoot.  I can provide an actual heap dump but it is 63mb in size (compressed) so we would need to work out some FTP where we can provide it if someone is willing to look at it in VisualVM (or any other profiling tool).
>>
>> BTW - If we open and close the index writer on a regular basis then we don't run into this problem.  It is only when we run continuously with an open index writer that we do see this problem (we altered the code to open/close the writer a lot, but this slows things down, so we don't want to run like this, but we wanted to test the behavior if we did so).
>>
>> Thanks,
>> Ross
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Wednesday, April 14, 2010 2:52 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Run this:
>>
>>    svn co https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9
>> lucene.29x
>>
>> Then apply the patch, then, run "ant jar-core", and in that should
>> create the lucene-core-2.9.2-dev.jar.
>>
>> Mike
>>
>> On Wed, Apr 14, 2010 at 1:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> How do I get to the 2.9.x branch?  Every link I take from the Lucene site takes me to the trunk which I assume is the 3.x version.  I've tried to look around svn but can't find anything labeled 2.9.x.  Is there a daily build of 2.9.x or do I need to build it myself.  I would like to try out the fix you put into it, but I'm not sure where I get it from.
>>>
>>> -----Original Message-----
>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>> Sent: Wednesday, April 14, 2010 4:12 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: IndexWriter and memory usage
>>>
>>> It looks like the mailing list software stripped your image attachments...
>>>
>>> Alas these fixes are only committed on 3.1.
>>>
>>> But I just posted the patch on LUCENE-2387 for 2.9.x -- it's a tiny
>>> fix.  I think the other issue was part of LUCENE-2074 (though this
>>> issue included many other changes) -- Uwe can you peel out just a
>>> 2.9.x patch for resetting JFlex's zzBuffer?
>>>
>>> You could also try switching analyzers (eg to WhitespaceAnalyzer) to
>>> see if in fact LUCENE-2074 (which affects StandandAnalyzer, since it
>>> uses JFlex) is [part of] your problem.
>>>
>>> Mike
>>>
>>> On Tue, Apr 13, 2010 at 6:42 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>> Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump.  In the first image you can see that at the time our memory has become very tight most of it is held up in bytes.  In the second image I examine one of those instances and navigate to the nearest garbage collection root.  In looking at very many of these objects, they all end up being instantiated through the IndexWriter process.
>>>>
>>>> This heap dump is the same one correlating to the infoStream that was attached in a prior message.  So while the infoStream shows the buffer being flushed, what we experience is that our memory gets consumed over time by these bytes in the IndexWriter.
>>
>>>>
>>>> I wanted to provide these images to see if they might correlate to the fixes mentioned below.  Hopefully those fixes mentioned below have rectified this problem.  And as I state in the prior message, I'm hoping these fixes are in a 2.9x branch and hoping for someone to point me to where I can get those fixes to try out.
>>>>
>>>> Thanks
>>>>
>>>> -----Original Message-----
>>>> From: Woolf, Ross [mailto:Ross_Woolf@BMC.com]
>>>> Sent: Tuesday, April 13, 2010 1:29 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: RE: IndexWriter and memory usage
>>>>
>>>> Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?
>>>>
>>>> -----Original Message-----
>>>> From: Lance Norskog [mailto:goksron@gmail.com]
>>>> Sent: Monday, April 12, 2010 10:20 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: IndexWriter and memory usage
>>>>
>>>> There is some bugs where the writer data structures retain data after
>>>> it is flushed. They are committed as of maybe the past week. If you
>>>> can pull the trunk and try it with your use case, that would be great.
>>>>
>>>> On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>>> Sent: Saturday, April 03, 2010 3:39 AM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: IndexWriter and memory usage
>>>>>
>>>>> Hmm why is the heap dump so immense?  Normally it contains the top N
>>>>> (eg 100) object types and their count/aggregate RAM usage.
>>>>>
>>>>> Can you attach the infoStream output to an email (to java-user)?
>>>>>
>>>>> Mike
>>>>>
>>>>> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>>>>>
>>>>>> We are not using the NRT readers
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>>>> Sent: Thursday, April 01, 2010 5:21 PM
>>>>>> To: java-user@lucene.apache.org
>>>>>> Subject: Re: IndexWriter and memory usage
>>>>>>
>>>>>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>>>>>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>>>>>
>>>>>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>>>>>> does allocate and then recycle this RAM buffer, so even in an idle
>>>>>> state (having indexed enough docs to fill up the RAM buffer at least
>>>>>> once) it'll hold onto those 16 MB.
>>>>>>
>>>>>> Are you using getReader (to get your NRT readers)?  If so, are you
>>>>>> really sure you're eventually closing the previous reader after
>>>>>> opening a new one?
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>>>>>
>>>>>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>>>>>
>>>>>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>>>>>
>>>>>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>>>>>
>>>>>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>>>>>
>>>>>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

I ported the patch to 2.9.2 dev but it did not seem to help.  Attached is my port of the patch.  This patch contains both 2283 and 2387, both of which I have applied in trying to resolving this issue.

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Tuesday, April 27, 2010 4:40 AM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

Oooh -- I suspect you are hitting this issue:

    https://issues.apache.org/jira/browse/LUCENE-2283

Your 3rd image ("fdt") jogged my memory on this one.  Can you try
testing the trunk JAR from after that issue landed?  (Or, apply that
patch against 3.0.x -- let me know if it does not apply cleanly and
I'll try to back port it).

But: it's spooky that you cannot repro this issue in your dev
environment.  Are you matching the # thread and exact sequence of
docs?

Mike

On Mon, Apr 26, 2010 at 4:14 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> We are still plagued by this issue.  I tried applying the patch mentioned but this did not resolve the issue.
>
> I once tried to attach images from the heap dump to send out to the group but the server removed them so I have posted the images on a public service with links this time.  I would appreciate someone looking at them to see if they provide any insight into what is occurring with this issue.
>
> When you follow the link click on the image and then once you see the image click on a link in the lower left hand corner that says "View Raw Image."  This will let you view the images at 100% resolution.
>
> This first image shows what we are seeing within VisualVM in regards to the memory.  As you can see, over time the memory gets consumed.  Finally we are at a point where there is no more memory available.
> Graph
> http://tinypic.com/view.php?pic=2ltk0h3&s=5
>
> This second image in VisualVM shows the classes sorted by size.  As you can see, about 70% of all memory is consumed in the bytes array.
> Bytes
> http://tinypic.com/view.php?pic=s10mqs&s=5
>
> This third image is where the real info is.  This is where one of the bytes is being examined and the option to go to nearest GC is chosen.  What you see here is what the majority of the bytes show if selected, so this one is representative of most all.  As you can see this one byte is associated with the index writer as you look at the chain of objects (and thus so too are all the other bytes that have not been released for GC).
> Garbage Collection
> http://tinypic.com/view.php?pic=5obalj&s=5
>
> I'm hoping that as you look at this that it might mean something to you or give you a clue as to what is holding on to all the memory.
>
> Now the mysterious thing in all of this is that our use of Lucene has been developed into a "plug-in" that we use within an application that we have.  If I just run JUnit tests around this plugin, indexing some of the same files that the actual application is indexing, I cannot ever get the memory loss in my dev environment.  Everything seems to work as expected.  However, once we are in our real situation, then we see this behavior.  Because of this I would expect that the problem lays with the application, but once we examine the heap dumps it then goes back to showing that the consumed bytes are "owned" by the index writer process.  It makes no sense to me that we see this as we do, but none the less we do.  We see that the Index Writer process is hanging onto a lot of data in byte arrays and it doesn't ever seam to release it.
>
> In addition, we would love to show this to someone via a webex if that would help in seeing what is going on.
>
> Please, any help appreciated and any suggestions on how to resolve or even troubleshoot.  I can provide an actual heap dump but it is 63mb in size (compressed) so we would need to work out some FTP where we can provide it if someone is willing to look at it in VisualVM (or any other profiling tool).
>
> BTW - If we open and close the index writer on a regular basis then we don't run into this problem.  It is only when we run continuously with an open index writer that we do see this problem (we altered the code to open/close the writer a lot, but this slows things down, so we don't want to run like this, but we wanted to test the behavior if we did so).
>
> Thanks,
> Ross
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Wednesday, April 14, 2010 2:52 PM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Run this:
>
>    svn co https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9
> lucene.29x
>
> Then apply the patch, then, run "ant jar-core", and in that should
> create the lucene-core-2.9.2-dev.jar.
>
> Mike
>
> On Wed, Apr 14, 2010 at 1:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> How do I get to the 2.9.x branch?  Every link I take from the Lucene site takes me to the trunk which I assume is the 3.x version.  I've tried to look around svn but can't find anything labeled 2.9.x.  Is there a daily build of 2.9.x or do I need to build it myself.  I would like to try out the fix you put into it, but I'm not sure where I get it from.
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Wednesday, April 14, 2010 4:12 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> It looks like the mailing list software stripped your image attachments...
>>
>> Alas these fixes are only committed on 3.1.
>>
>> But I just posted the patch on LUCENE-2387 for 2.9.x -- it's a tiny
>> fix.  I think the other issue was part of LUCENE-2074 (though this
>> issue included many other changes) -- Uwe can you peel out just a
>> 2.9.x patch for resetting JFlex's zzBuffer?
>>
>> You could also try switching analyzers (eg to WhitespaceAnalyzer) to
>> see if in fact LUCENE-2074 (which affects StandandAnalyzer, since it
>> uses JFlex) is [part of] your problem.
>>
>> Mike
>>
>> On Tue, Apr 13, 2010 at 6:42 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump.  In the first image you can see that at the time our memory has become very tight most of it is held up in bytes.  In the second image I examine one of those instances and navigate to the nearest garbage collection root.  In looking at very many of these objects, they all end up being instantiated through the IndexWriter process.
>>>
>>> This heap dump is the same one correlating to the infoStream that was attached in a prior message.  So while the infoStream shows the buffer being flushed, what we experience is that our memory gets consumed over time by these bytes in the IndexWriter.
>
>>>
>>> I wanted to provide these images to see if they might correlate to the fixes mentioned below.  Hopefully those fixes mentioned below have rectified this problem.  And as I state in the prior message, I'm hoping these fixes are in a 2.9x branch and hoping for someone to point me to where I can get those fixes to try out.
>>>
>>> Thanks
>>>
>>> -----Original Message-----
>>> From: Woolf, Ross [mailto:Ross_Woolf@BMC.com]
>>> Sent: Tuesday, April 13, 2010 1:29 PM
>>> To: java-user@lucene.apache.org
>>> Subject: RE: IndexWriter and memory usage
>>>
>>> Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?
>>>
>>> -----Original Message-----
>>> From: Lance Norskog [mailto:goksron@gmail.com]
>>> Sent: Monday, April 12, 2010 10:20 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: IndexWriter and memory usage
>>>
>>> There is some bugs where the writer data structures retain data after
>>> it is flushed. They are committed as of maybe the past week. If you
>>> can pull the trunk and try it with your use case, that would be great.
>>>
>>> On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>> Sent: Saturday, April 03, 2010 3:39 AM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: IndexWriter and memory usage
>>>>
>>>> Hmm why is the heap dump so immense?  Normally it contains the top N
>>>> (eg 100) object types and their count/aggregate RAM usage.
>>>>
>>>> Can you attach the infoStream output to an email (to java-user)?
>>>>
>>>> Mike
>>>>
>>>> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>>>>
>>>>> We are not using the NRT readers
>>>>>
>>>>> -----Original Message-----
>>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>>> Sent: Thursday, April 01, 2010 5:21 PM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: IndexWriter and memory usage
>>>>>
>>>>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>>>>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>>>>
>>>>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>>>>> does allocate and then recycle this RAM buffer, so even in an idle
>>>>> state (having indexed enough docs to fill up the RAM buffer at least
>>>>> once) it'll hold onto those 16 MB.
>>>>>
>>>>> Are you using getReader (to get your NRT readers)?  If so, are you
>>>>> really sure you're eventually closing the previous reader after
>>>>> opening a new one?
>>>>>
>>>>> Mike
>>>>>
>>>>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>>>>
>>>>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>>>>
>>>>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>>>>
>>>>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>>>>
>>>>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>>>>
>>>>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

Oooh -- I suspect you are hitting this issue:

    https://issues.apache.org/jira/browse/LUCENE-2283

Your 3rd image ("fdt") jogged my memory on this one.  Can you try
testing the trunk JAR from after that issue landed?  (Or, apply that
patch against 3.0.x -- let me know if it does not apply cleanly and
I'll try to back port it).

But: it's spooky that you cannot repro this issue in your dev
environment.  Are you matching the # thread and exact sequence of
docs?

Mike

On Mon, Apr 26, 2010 at 4:14 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> We are still plagued by this issue.  I tried applying the patch mentioned but this did not resolve the issue.
>
> I once tried to attach images from the heap dump to send out to the group but the server removed them so I have posted the images on a public service with links this time.  I would appreciate someone looking at them to see if they provide any insight into what is occurring with this issue.
>
> When you follow the link click on the image and then once you see the image click on a link in the lower left hand corner that says "View Raw Image."  This will let you view the images at 100% resolution.
>
> This first image shows what we are seeing within VisualVM in regards to the memory.  As you can see, over time the memory gets consumed.  Finally we are at a point where there is no more memory available.
> Graph
> http://tinypic.com/view.php?pic=2ltk0h3&s=5
>
> This second image in VisualVM shows the classes sorted by size.  As you can see, about 70% of all memory is consumed in the bytes array.
> Bytes
> http://tinypic.com/view.php?pic=s10mqs&s=5
>
> This third image is where the real info is.  This is where one of the bytes is being examined and the option to go to nearest GC is chosen.  What you see here is what the majority of the bytes show if selected, so this one is representative of most all.  As you can see this one byte is associated with the index writer as you look at the chain of objects (and thus so too are all the other bytes that have not been released for GC).
> Garbage Collection
> http://tinypic.com/view.php?pic=5obalj&s=5
>
> I'm hoping that as you look at this that it might mean something to you or give you a clue as to what is holding on to all the memory.
>
> Now the mysterious thing in all of this is that our use of Lucene has been developed into a "plug-in" that we use within an application that we have.  If I just run JUnit tests around this plugin, indexing some of the same files that the actual application is indexing, I cannot ever get the memory loss in my dev environment.  Everything seems to work as expected.  However, once we are in our real situation, then we see this behavior.  Because of this I would expect that the problem lays with the application, but once we examine the heap dumps it then goes back to showing that the consumed bytes are "owned" by the index writer process.  It makes no sense to me that we see this as we do, but none the less we do.  We see that the Index Writer process is hanging onto a lot of data in byte arrays and it doesn't ever seam to release it.
>
> In addition, we would love to show this to someone via a webex if that would help in seeing what is going on.
>
> Please, any help appreciated and any suggestions on how to resolve or even troubleshoot.  I can provide an actual heap dump but it is 63mb in size (compressed) so we would need to work out some FTP where we can provide it if someone is willing to look at it in VisualVM (or any other profiling tool).
>
> BTW - If we open and close the index writer on a regular basis then we don't run into this problem.  It is only when we run continuously with an open index writer that we do see this problem (we altered the code to open/close the writer a lot, but this slows things down, so we don't want to run like this, but we wanted to test the behavior if we did so).
>
> Thanks,
> Ross
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Wednesday, April 14, 2010 2:52 PM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Run this:
>
>    svn co https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9
> lucene.29x
>
> Then apply the patch, then, run "ant jar-core", and in that should
> create the lucene-core-2.9.2-dev.jar.
>
> Mike
>
> On Wed, Apr 14, 2010 at 1:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> How do I get to the 2.9.x branch?  Every link I take from the Lucene site takes me to the trunk which I assume is the 3.x version.  I've tried to look around svn but can't find anything labeled 2.9.x.  Is there a daily build of 2.9.x or do I need to build it myself.  I would like to try out the fix you put into it, but I'm not sure where I get it from.
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Wednesday, April 14, 2010 4:12 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> It looks like the mailing list software stripped your image attachments...
>>
>> Alas these fixes are only committed on 3.1.
>>
>> But I just posted the patch on LUCENE-2387 for 2.9.x -- it's a tiny
>> fix.  I think the other issue was part of LUCENE-2074 (though this
>> issue included many other changes) -- Uwe can you peel out just a
>> 2.9.x patch for resetting JFlex's zzBuffer?
>>
>> You could also try switching analyzers (eg to WhitespaceAnalyzer) to
>> see if in fact LUCENE-2074 (which affects StandandAnalyzer, since it
>> uses JFlex) is [part of] your problem.
>>
>> Mike
>>
>> On Tue, Apr 13, 2010 at 6:42 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump.  In the first image you can see that at the time our memory has become very tight most of it is held up in bytes.  In the second image I examine one of those instances and navigate to the nearest garbage collection root.  In looking at very many of these objects, they all end up being instantiated through the IndexWriter process.
>>>
>>> This heap dump is the same one correlating to the infoStream that was attached in a prior message.  So while the infoStream shows the buffer being flushed, what we experience is that our memory gets consumed over time by these bytes in the IndexWriter.
>
>>>
>>> I wanted to provide these images to see if they might correlate to the fixes mentioned below.  Hopefully those fixes mentioned below have rectified this problem.  And as I state in the prior message, I'm hoping these fixes are in a 2.9x branch and hoping for someone to point me to where I can get those fixes to try out.
>>>
>>> Thanks
>>>
>>> -----Original Message-----
>>> From: Woolf, Ross [mailto:Ross_Woolf@BMC.com]
>>> Sent: Tuesday, April 13, 2010 1:29 PM
>>> To: java-user@lucene.apache.org
>>> Subject: RE: IndexWriter and memory usage
>>>
>>> Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?
>>>
>>> -----Original Message-----
>>> From: Lance Norskog [mailto:goksron@gmail.com]
>>> Sent: Monday, April 12, 2010 10:20 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: IndexWriter and memory usage
>>>
>>> There is some bugs where the writer data structures retain data after
>>> it is flushed. They are committed as of maybe the past week. If you
>>> can pull the trunk and try it with your use case, that would be great.
>>>
>>> On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>> Sent: Saturday, April 03, 2010 3:39 AM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: IndexWriter and memory usage
>>>>
>>>> Hmm why is the heap dump so immense?  Normally it contains the top N
>>>> (eg 100) object types and their count/aggregate RAM usage.
>>>>
>>>> Can you attach the infoStream output to an email (to java-user)?
>>>>
>>>> Mike
>>>>
>>>> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>>>>
>>>>> We are not using the NRT readers
>>>>>
>>>>> -----Original Message-----
>>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>>> Sent: Thursday, April 01, 2010 5:21 PM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: IndexWriter and memory usage
>>>>>
>>>>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>>>>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>>>>
>>>>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>>>>> does allocate and then recycle this RAM buffer, so even in an idle
>>>>> state (having indexed enough docs to fill up the RAM buffer at least
>>>>> once) it'll hold onto those 16 MB.
>>>>>
>>>>> Are you using getReader (to get your NRT readers)?  If so, are you
>>>>> really sure you're eventually closing the previous reader after
>>>>> opening a new one?
>>>>>
>>>>> Mike
>>>>>
>>>>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>>>>
>>>>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>>>>
>>>>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>>>>
>>>>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>>>>
>>>>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>>>>
>>>>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

We are still plagued by this issue.  I tried applying the patch mentioned but this did not resolve the issue.  

I once tried to attach images from the heap dump to send out to the group but the server removed them so I have posted the images on a public service with links this time.  I would appreciate someone looking at them to see if they provide any insight into what is occurring with this issue.

When you follow the link click on the image and then once you see the image click on a link in the lower left hand corner that says "View Raw Image."  This will let you view the images at 100% resolution.

This first image shows what we are seeing within VisualVM in regards to the memory.  As you can see, over time the memory gets consumed.  Finally we are at a point where there is no more memory available.  
Graph
http://tinypic.com/view.php?pic=2ltk0h3&s=5

This second image in VisualVM shows the classes sorted by size.  As you can see, about 70% of all memory is consumed in the bytes array.  
Bytes
http://tinypic.com/view.php?pic=s10mqs&s=5

This third image is where the real info is.  This is where one of the bytes is being examined and the option to go to nearest GC is chosen.  What you see here is what the majority of the bytes show if selected, so this one is representative of most all.  As you can see this one byte is associated with the index writer as you look at the chain of objects (and thus so too are all the other bytes that have not been released for GC).  
Garbage Collection
http://tinypic.com/view.php?pic=5obalj&s=5

I'm hoping that as you look at this that it might mean something to you or give you a clue as to what is holding on to all the memory.

Now the mysterious thing in all of this is that our use of Lucene has been developed into a "plug-in" that we use within an application that we have.  If I just run JUnit tests around this plugin, indexing some of the same files that the actual application is indexing, I cannot ever get the memory loss in my dev environment.  Everything seems to work as expected.  However, once we are in our real situation, then we see this behavior.  Because of this I would expect that the problem lays with the application, but once we examine the heap dumps it then goes back to showing that the consumed bytes are "owned" by the index writer process.  It makes no sense to me that we see this as we do, but none the less we do.  We see that the Index Writer process is hanging onto a lot of data in byte arrays and it doesn't ever seam to release it.

In addition, we would love to show this to someone via a webex if that would help in seeing what is going on.  

Please, any help appreciated and any suggestions on how to resolve or even troubleshoot.  I can provide an actual heap dump but it is 63mb in size (compressed) so we would need to work out some FTP where we can provide it if someone is willing to look at it in VisualVM (or any other profiling tool).

BTW - If we open and close the index writer on a regular basis then we don't run into this problem.  It is only when we run continuously with an open index writer that we do see this problem (we altered the code to open/close the writer a lot, but this slows things down, so we don't want to run like this, but we wanted to test the behavior if we did so). 

Thanks,
Ross 

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Wednesday, April 14, 2010 2:52 PM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

Run this:

    svn co https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9
lucene.29x

Then apply the patch, then, run "ant jar-core", and in that should
create the lucene-core-2.9.2-dev.jar.

Mike

On Wed, Apr 14, 2010 at 1:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> How do I get to the 2.9.x branch?  Every link I take from the Lucene site takes me to the trunk which I assume is the 3.x version.  I've tried to look around svn but can't find anything labeled 2.9.x.  Is there a daily build of 2.9.x or do I need to build it myself.  I would like to try out the fix you put into it, but I'm not sure where I get it from.
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Wednesday, April 14, 2010 4:12 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> It looks like the mailing list software stripped your image attachments...
>
> Alas these fixes are only committed on 3.1.
>
> But I just posted the patch on LUCENE-2387 for 2.9.x -- it's a tiny
> fix.  I think the other issue was part of LUCENE-2074 (though this
> issue included many other changes) -- Uwe can you peel out just a
> 2.9.x patch for resetting JFlex's zzBuffer?
>
> You could also try switching analyzers (eg to WhitespaceAnalyzer) to
> see if in fact LUCENE-2074 (which affects StandandAnalyzer, since it
> uses JFlex) is [part of] your problem.
>
> Mike
>
> On Tue, Apr 13, 2010 at 6:42 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump.  In the first image you can see that at the time our memory has become very tight most of it is held up in bytes.  In the second image I examine one of those instances and navigate to the nearest garbage collection root.  In looking at very many of these objects, they all end up being instantiated through the IndexWriter process.
>>
>> This heap dump is the same one correlating to the infoStream that was attached in a prior message.  So while the infoStream shows the buffer being flushed, what we experience is that our memory gets consumed over time by these bytes in the IndexWriter.

>>
>> I wanted to provide these images to see if they might correlate to the fixes mentioned below.  Hopefully those fixes mentioned below have rectified this problem.  And as I state in the prior message, I'm hoping these fixes are in a 2.9x branch and hoping for someone to point me to where I can get those fixes to try out.
>>
>> Thanks
>>
>> -----Original Message-----
>> From: Woolf, Ross [mailto:Ross_Woolf@BMC.com]
>> Sent: Tuesday, April 13, 2010 1:29 PM
>> To: java-user@lucene.apache.org
>> Subject: RE: IndexWriter and memory usage
>>
>> Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?
>>
>> -----Original Message-----
>> From: Lance Norskog [mailto:goksron@gmail.com]
>> Sent: Monday, April 12, 2010 10:20 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> There is some bugs where the writer data structures retain data after
>> it is flushed. They are committed as of maybe the past week. If you
>> can pull the trunk and try it with your use case, that would be great.
>>
>> On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>>>
>>>
>>> -----Original Message-----
>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>> Sent: Saturday, April 03, 2010 3:39 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: IndexWriter and memory usage
>>>
>>> Hmm why is the heap dump so immense?  Normally it contains the top N
>>> (eg 100) object types and their count/aggregate RAM usage.
>>>
>>> Can you attach the infoStream output to an email (to java-user)?
>>>
>>> Mike
>>>
>>> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>>>
>>>> We are not using the NRT readers
>>>>
>>>> -----Original Message-----
>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>> Sent: Thursday, April 01, 2010 5:21 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: IndexWriter and memory usage
>>>>
>>>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>>>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>>>
>>>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>>>> does allocate and then recycle this RAM buffer, so even in an idle
>>>> state (having indexed enough docs to fill up the RAM buffer at least
>>>> once) it'll hold onto those 16 MB.
>>>>
>>>> Are you using getReader (to get your NRT readers)?  If so, are you
>>>> really sure you're eventually closing the previous reader after
>>>> opening a new one?
>>>>
>>>> Mike
>>>>
>>>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>>>
>>>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>>>
>>>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>>>
>>>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>>>
>>>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>>>
>>>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

Run this:

    svn co https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9
lucene.29x

Then apply the patch, then, run "ant jar-core", and in that should
create the lucene-core-2.9.2-dev.jar.

Mike

On Wed, Apr 14, 2010 at 1:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> How do I get to the 2.9.x branch?  Every link I take from the Lucene site takes me to the trunk which I assume is the 3.x version.  I've tried to look around svn but can't find anything labeled 2.9.x.  Is there a daily build of 2.9.x or do I need to build it myself.  I would like to try out the fix you put into it, but I'm not sure where I get it from.
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Wednesday, April 14, 2010 4:12 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> It looks like the mailing list software stripped your image attachments...
>
> Alas these fixes are only committed on 3.1.
>
> But I just posted the patch on LUCENE-2387 for 2.9.x -- it's a tiny
> fix.  I think the other issue was part of LUCENE-2074 (though this
> issue included many other changes) -- Uwe can you peel out just a
> 2.9.x patch for resetting JFlex's zzBuffer?
>
> You could also try switching analyzers (eg to WhitespaceAnalyzer) to
> see if in fact LUCENE-2074 (which affects StandandAnalyzer, since it
> uses JFlex) is [part of] your problem.
>
> Mike
>
> On Tue, Apr 13, 2010 at 6:42 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump.  In the first image you can see that at the time our memory has become very tight most of it is held up in bytes.  In the second image I examine one of those instances and navigate to the nearest garbage collection root.  In looking at very many of these objects, they all end up being instantiated through the IndexWriter process.
>>
>> This heap dump is the same one correlating to the infoStream that was attached in a prior message.  So while the infoStream shows the buffer being flushed, what we experience is that our memory gets consumed over time by these bytes in the IndexWriter.
>>
>> I wanted to provide these images to see if they might correlate to the fixes mentioned below.  Hopefully those fixes mentioned below have rectified this problem.  And as I state in the prior message, I'm hoping these fixes are in a 2.9x branch and hoping for someone to point me to where I can get those fixes to try out.
>>
>> Thanks
>>
>> -----Original Message-----
>> From: Woolf, Ross [mailto:Ross_Woolf@BMC.com]
>> Sent: Tuesday, April 13, 2010 1:29 PM
>> To: java-user@lucene.apache.org
>> Subject: RE: IndexWriter and memory usage
>>
>> Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?
>>
>> -----Original Message-----
>> From: Lance Norskog [mailto:goksron@gmail.com]
>> Sent: Monday, April 12, 2010 10:20 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> There is some bugs where the writer data structures retain data after
>> it is flushed. They are committed as of maybe the past week. If you
>> can pull the trunk and try it with your use case, that would be great.
>>
>> On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>>>
>>>
>>> -----Original Message-----
>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>> Sent: Saturday, April 03, 2010 3:39 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: IndexWriter and memory usage
>>>
>>> Hmm why is the heap dump so immense?  Normally it contains the top N
>>> (eg 100) object types and their count/aggregate RAM usage.
>>>
>>> Can you attach the infoStream output to an email (to java-user)?
>>>
>>> Mike
>>>
>>> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>>>
>>>> We are not using the NRT readers
>>>>
>>>> -----Original Message-----
>>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>>> Sent: Thursday, April 01, 2010 5:21 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: IndexWriter and memory usage
>>>>
>>>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>>>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>>>
>>>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>>>> does allocate and then recycle this RAM buffer, so even in an idle
>>>> state (having indexed enough docs to fill up the RAM buffer at least
>>>> once) it'll hold onto those 16 MB.
>>>>
>>>> Are you using getReader (to get your NRT readers)?  If so, are you
>>>> really sure you're eventually closing the previous reader after
>>>> opening a new one?
>>>>
>>>> Mike
>>>>
>>>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>>>
>>>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>>>
>>>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>>>
>>>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>>>
>>>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>>>
>>>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

How do I get to the 2.9.x branch?  Every link I take from the Lucene site takes me to the trunk which I assume is the 3.x version.  I've tried to look around svn but can't find anything labeled 2.9.x.  Is there a daily build of 2.9.x or do I need to build it myself.  I would like to try out the fix you put into it, but I'm not sure where I get it from.

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Wednesday, April 14, 2010 4:12 AM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

It looks like the mailing list software stripped your image attachments...

Alas these fixes are only committed on 3.1.

But I just posted the patch on LUCENE-2387 for 2.9.x -- it's a tiny
fix.  I think the other issue was part of LUCENE-2074 (though this
issue included many other changes) -- Uwe can you peel out just a
2.9.x patch for resetting JFlex's zzBuffer?

You could also try switching analyzers (eg to WhitespaceAnalyzer) to
see if in fact LUCENE-2074 (which affects StandandAnalyzer, since it
uses JFlex) is [part of] your problem.

Mike

On Tue, Apr 13, 2010 at 6:42 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump.  In the first image you can see that at the time our memory has become very tight most of it is held up in bytes.  In the second image I examine one of those instances and navigate to the nearest garbage collection root.  In looking at very many of these objects, they all end up being instantiated through the IndexWriter process.
>
> This heap dump is the same one correlating to the infoStream that was attached in a prior message.  So while the infoStream shows the buffer being flushed, what we experience is that our memory gets consumed over time by these bytes in the IndexWriter.
>
> I wanted to provide these images to see if they might correlate to the fixes mentioned below.  Hopefully those fixes mentioned below have rectified this problem.  And as I state in the prior message, I'm hoping these fixes are in a 2.9x branch and hoping for someone to point me to where I can get those fixes to try out.
>
> Thanks
>
> -----Original Message-----
> From: Woolf, Ross [mailto:Ross_Woolf@BMC.com]
> Sent: Tuesday, April 13, 2010 1:29 PM
> To: java-user@lucene.apache.org
> Subject: RE: IndexWriter and memory usage
>
> Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?
>
> -----Original Message-----
> From: Lance Norskog [mailto:goksron@gmail.com]
> Sent: Monday, April 12, 2010 10:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> There is some bugs where the writer data structures retain data after
> it is flushed. They are committed as of maybe the past week. If you
> can pull the trunk and try it with your use case, that would be great.
>
> On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
>> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>>
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Saturday, April 03, 2010 3:39 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Hmm why is the heap dump so immense?  Normally it contains the top N
>> (eg 100) object types and their count/aggregate RAM usage.
>>
>> Can you attach the infoStream output to an email (to java-user)?
>>
>> Mike
>>
>> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>>
>>> We are not using the NRT readers
>>>
>>> -----Original Message-----
>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>> Sent: Thursday, April 01, 2010 5:21 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: IndexWriter and memory usage
>>>
>>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>>
>>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>>> does allocate and then recycle this RAM buffer, so even in an idle
>>> state (having indexed enough docs to fill up the RAM buffer at least
>>> once) it'll hold onto those 16 MB.
>>>
>>> Are you using getReader (to get your NRT readers)?  If so, are you
>>> really sure you're eventually closing the previous reader after
>>> opening a new one?
>>>
>>> Mike
>>>
>>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>>
>>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>>
>>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>>
>>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>>
>>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>>
>>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

It looks like the mailing list software stripped your image attachments...

Alas these fixes are only committed on 3.1.

But I just posted the patch on LUCENE-2387 for 2.9.x -- it's a tiny
fix.  I think the other issue was part of LUCENE-2074 (though this
issue included many other changes) -- Uwe can you peel out just a
2.9.x patch for resetting JFlex's zzBuffer?

You could also try switching analyzers (eg to WhitespaceAnalyzer) to
see if in fact LUCENE-2074 (which affects StandandAnalyzer, since it
uses JFlex) is [part of] your problem.

Mike

On Tue, Apr 13, 2010 at 6:42 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump.  In the first image you can see that at the time our memory has become very tight most of it is held up in bytes.  In the second image I examine one of those instances and navigate to the nearest garbage collection root.  In looking at very many of these objects, they all end up being instantiated through the IndexWriter process.
>
> This heap dump is the same one correlating to the infoStream that was attached in a prior message.  So while the infoStream shows the buffer being flushed, what we experience is that our memory gets consumed over time by these bytes in the IndexWriter.
>
> I wanted to provide these images to see if they might correlate to the fixes mentioned below.  Hopefully those fixes mentioned below have rectified this problem.  And as I state in the prior message, I'm hoping these fixes are in a 2.9x branch and hoping for someone to point me to where I can get those fixes to try out.
>
> Thanks
>
> -----Original Message-----
> From: Woolf, Ross [mailto:Ross_Woolf@BMC.com]
> Sent: Tuesday, April 13, 2010 1:29 PM
> To: java-user@lucene.apache.org
> Subject: RE: IndexWriter and memory usage
>
> Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?
>
> -----Original Message-----
> From: Lance Norskog [mailto:goksron@gmail.com]
> Sent: Monday, April 12, 2010 10:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> There is some bugs where the writer data structures retain data after
> it is flushed. They are committed as of maybe the past week. If you
> can pull the trunk and try it with your use case, that would be great.
>
> On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
>> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>>
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Saturday, April 03, 2010 3:39 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Hmm why is the heap dump so immense?  Normally it contains the top N
>> (eg 100) object types and their count/aggregate RAM usage.
>>
>> Can you attach the infoStream output to an email (to java-user)?
>>
>> Mike
>>
>> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>>
>>> We are not using the NRT readers
>>>
>>> -----Original Message-----
>>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>>> Sent: Thursday, April 01, 2010 5:21 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: IndexWriter and memory usage
>>>
>>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>>
>>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>>> does allocate and then recycle this RAM buffer, so even in an idle
>>> state (having indexed enough docs to fill up the RAM buffer at least
>>> once) it'll hold onto those 16 MB.
>>>
>>> Are you using getReader (to get your NRT readers)?  If so, are you
>>> really sure you're eventually closing the previous reader after
>>> opening a new one?
>>>
>>> Mike
>>>
>>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>>
>>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>>
>>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>>
>>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>>
>>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>>
>>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

Since the heap dump was so big and can't be attached, I have taken a few screen shots from Java VisualVM of the heap dump.  In the first image you can see that at the time our memory has become very tight most of it is held up in bytes.  In the second image I examine one of those instances and navigate to the nearest garbage collection root.  In looking at very many of these objects, they all end up being instantiated through the IndexWriter process.

This heap dump is the same one correlating to the infoStream that was attached in a prior message.  So while the infoStream shows the buffer being flushed, what we experience is that our memory gets consumed over time by these bytes in the IndexWriter.  

I wanted to provide these images to see if they might correlate to the fixes mentioned below.  Hopefully those fixes mentioned below have rectified this problem.  And as I state in the prior message, I'm hoping these fixes are in a 2.9x branch and hoping for someone to point me to where I can get those fixes to try out.

Thanks

-----Original Message-----
From: Woolf, Ross [mailto:Ross_Woolf@BMC.com] 
Sent: Tuesday, April 13, 2010 1:29 PM
To: java-user@lucene.apache.org
Subject: RE: IndexWriter and memory usage

Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?

-----Original Message-----
From: Lance Norskog [mailto:goksron@gmail.com] 
Sent: Monday, April 12, 2010 10:20 PM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

There is some bugs where the writer data structures retain data after
it is flushed. They are committed as of maybe the past week. If you
can pull the trunk and try it with your use case, that would be great.

On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Saturday, April 03, 2010 3:39 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Hmm why is the heap dump so immense?  Normally it contains the top N
> (eg 100) object types and their count/aggregate RAM usage.
>
> Can you attach the infoStream output to an email (to java-user)?
>
> Mike
>
> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>
>> We are not using the NRT readers
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Thursday, April 01, 2010 5:21 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>
>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>> does allocate and then recycle this RAM buffer, so even in an idle
>> state (having indexed enough docs to fill up the RAM buffer at least
>> once) it'll hold onto those 16 MB.
>>
>> Are you using getReader (to get your NRT readers)?  If so, are you
>> really sure you're eventually closing the previous reader after
>> opening a new one?
>>
>> Mike
>>
>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>
>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>
>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>
>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>
>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>
>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

Are these fixes in 2.9x branch?  We are using 2.9x and can't move to 3x just yet.  If so, where do I specifically pick this up from?

-----Original Message-----
From: Lance Norskog [mailto:goksron@gmail.com] 
Sent: Monday, April 12, 2010 10:20 PM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

There is some bugs where the writer data structures retain data after
it is flushed. They are committed as of maybe the past week. If you
can pull the trunk and try it with your use case, that would be great.

On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Saturday, April 03, 2010 3:39 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Hmm why is the heap dump so immense?  Normally it contains the top N
> (eg 100) object types and their count/aggregate RAM usage.
>
> Can you attach the infoStream output to an email (to java-user)?
>
> Mike
>
> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>
>> We are not using the NRT readers
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Thursday, April 01, 2010 5:21 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>
>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>> does allocate and then recycle this RAM buffer, so even in an idle
>> state (having indexed enough docs to fill up the RAM buffer at least
>> once) it'll hold onto those 16 MB.
>>
>> Are you using getReader (to get your NRT readers)?  If so, are you
>> really sure you're eventually closing the previous reader after
>> opening a new one?
>>
>> Mike
>>
>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>
>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>
>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>
>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>
>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>
>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Lance Norskog <go...@gmail.com>.

There is some bugs where the writer data structures retain data after
it is flushed. They are committed as of maybe the past week. If you
can pull the trunk and try it with your use case, that would be great.

On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Saturday, April 03, 2010 3:39 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Hmm why is the heap dump so immense?  Normally it contains the top N
> (eg 100) object types and their count/aggregate RAM usage.
>
> Can you attach the infoStream output to an email (to java-user)?
>
> Mike
>
> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>
>> We are not using the NRT readers
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Thursday, April 01, 2010 5:21 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>
>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>> does allocate and then recycle this RAM buffer, so even in an idle
>> state (having indexed enough docs to fill up the RAM buffer at least
>> once) it'll hold onto those 16 MB.
>>
>> Are you using getReader (to get your NRT readers)?  If so, are you
>> really sure you're eventually closing the previous reader after
>> opening a new one?
>>
>> Mike
>>
>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>
>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>
>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>
>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>
>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>
>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

The infoStream generally looks healthy.  You seem to have a contained
set of unique field names.

The one thing that's interesting is... your docs are quite large.  If
you grep for "flush: segment=" in your infoStream you see how many
docs "fit" in 16 MB before flushing, and it's lowish (as high as ~300
and as low as only 1).

The "only 1" case looks like it could be the cause of your OOME, ie,
there are some docs that are so large that indexing that 1 doc causes
IndexWriter to use too much RAM and then immediately flush... Lucene
cannot flush mid-document.  So this means IndexWriter will allocate
however much RAM is needed for a single document, and flush right
after that, very easily needing to exceed the 16 MB temporarily.

Can you test whether you are hitting OOME on certain specific [very
large] documents?  The worst case poison-pill document is one that has
many many unique terms.

Mike

On Mon, Apr 12, 2010 at 11:54 AM, Woolf, Ross <Ro...@bmc.com> wrote:
> I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague).
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Saturday, April 03, 2010 3:39 AM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Hmm why is the heap dump so immense?  Normally it contains the top N
> (eg 100) object types and their count/aggregate RAM usage.
>
> Can you attach the infoStream output to an email (to java-user)?
>
> Mike
>
> On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>>
>> We are not using the NRT readers
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Thursday, April 01, 2010 5:21 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: IndexWriter and memory usage
>>
>> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
>> infoStream, index up to the OOM @ 512 MB, and post the output?
>>
>> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
>> does allocate and then recycle this RAM buffer, so even in an idle
>> state (having indexed enough docs to fill up the RAM buffer at least
>> once) it'll hold onto those 16 MB.
>>
>> Are you using getReader (to get your NRT readers)?  If so, are you
>> really sure you're eventually closing the previous reader after
>> opening a new one?
>>
>> Mike
>>
>> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>>
>>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>>
>>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>>
>>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>>
>>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>>
>>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

I was on vacation last week so just getting back to this...  Here is the info stream (as an attachment).  I'll see what I can do about reducing the heap dump (It was supplied by a colleague). 


-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Saturday, April 03, 2010 3:39 AM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

Hmm why is the heap dump so immense?  Normally it contains the top N
(eg 100) object types and their count/aggregate RAM usage.

Can you attach the infoStream output to an email (to java-user)?

Mike

On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> I have this and the heap dump is 63mb zipped. �The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>
> We are not using the NRT readers
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Thursday, April 01, 2010 5:21 PM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Hmm, not good. �Can you post a heap dump? �Also, can you turn on
> infoStream, index up to the OOM @ 512 MB, and post the output?
>
> IndexWriter should not hang onto much beyond the RAM buffer. �But, it
> does allocate and then recycle this RAM buffer, so even in an idle
> state (having indexed enough docs to fill up the RAM buffer at least
> once) it'll hold onto those 16 MB.
>
> Are you using getReader (to get your NRT readers)? �If so, are you
> really sure you're eventually closing the previous reader after
> opening a new one?
>
> Mike
>
> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit. � We are using the default RAMBufferSize of 16 mb. �We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>
>> We have a large number of documents that are run through Tika and then added to the index. �The data from Tika is changed to a string, and then sent to Lucene. �Heap dumps clearly show the data in the Lucene classes and not in Tika. �Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl. �In using both JConsole and VisualVM �we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state. �It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>
>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%. �The profiling tools now show that the memory is cleaned up to some extent after each document. �But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake. �Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads. �The instance counts, after running just 1,100 documents seems staggering
>>
>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit? �Why are we seeing the heap space all being used up?
>>
>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state). �Why would this be? �Is the only way to totally clean up the memory is to close the writer? �Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>
>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hmm why is the heap dump so immense?  Normally it contains the top N
(eg 100) object types and their count/aggregate RAM usage.

Can you attach the infoStream output to an email (to java-user)?

Mike

On Fri, Apr 2, 2010 at 5:28 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.
>
> We are not using the NRT readers
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Thursday, April 01, 2010 5:21 PM
> To: java-user@lucene.apache.org
> Subject: Re: IndexWriter and memory usage
>
> Hmm, not good.  Can you post a heap dump?  Also, can you turn on
> infoStream, index up to the OOM @ 512 MB, and post the output?
>
> IndexWriter should not hang onto much beyond the RAM buffer.  But, it
> does allocate and then recycle this RAM buffer, so even in an idle
> state (having indexed enough docs to fill up the RAM buffer at least
> once) it'll hold onto those 16 MB.
>
> Are you using getReader (to get your NRT readers)?  If so, are you
> really sure you're eventually closing the previous reader after
> opening a new one?
>
> Mike
>
> On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
>> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>>
>> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>>
>> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>>
>> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>>
>> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>>
>> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: IndexWriter and memory usage

Posted by "Woolf, Ross" <Ro...@BMC.com>.

I have this and the heap dump is 63mb zipped.  The info stream is much smaller (31 kb zipped), but I don't know how to get them to you.  

We are not using the NRT readers

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Thursday, April 01, 2010 5:21 PM
To: java-user@lucene.apache.org
Subject: Re: IndexWriter and memory usage

Hmm, not good.  Can you post a heap dump?  Also, can you turn on
infoStream, index up to the OOM @ 512 MB, and post the output?

IndexWriter should not hang onto much beyond the RAM buffer.  But, it
does allocate and then recycle this RAM buffer, so even in an idle
state (having indexed enough docs to fill up the RAM buffer at least
once) it'll hold onto those 16 MB.

Are you using getReader (to get your NRT readers)?  If so, are you
really sure you're eventually closing the previous reader after
opening a new one?

Mike

On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>
> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>
> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>
> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>
> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>
> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter and memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hmm, not good.  Can you post a heap dump?  Also, can you turn on
infoStream, index up to the OOM @ 512 MB, and post the output?

IndexWriter should not hang onto much beyond the RAM buffer.  But, it
does allocate and then recycle this RAM buffer, so even in an idle
state (having indexed enough docs to fill up the RAM buffer at least
once) it'll hold onto those 16 MB.

Are you using getReader (to get your NRT readers)?  If so, are you
really sure you're eventually closing the previous reader after
opening a new one?

Mike

On Thu, Apr 1, 2010 at 6:58 PM, Woolf, Ross <Ro...@bmc.com> wrote:
> We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit.   We are using the default RAMBufferSize of 16 mb.  We are using Lucene 2.9.1. We are set at heap size of 512 mb.
>
> We have a large number of documents that are run through Tika and then added to the index.  The data from Tika is changed to a string, and then sent to Lucene.  Heap dumps clearly show the data in the Lucene classes and not in Tika.  Our intent is to only perform a commit once the entire indexing run is complete, but several hours into the process everything comes to a crawl.  In using both JConsole and VisualVM  we can see that the heap space is maxed out and garbage collection is not able to clean up any memory once we get into this state.  It is our understanding that the IndexWriter should be only holding onto 16 mb of data before it flushes it, but what we are seeing is that while it is in fact writing data to disk when it hits the 16 mb limit, it is also holding onto some data in memory and not allowing garbage collection to take place, and this continues until garbage collection is unable to free up enough space to all things to move faster than a crawl.
>
> As a test we caused a commit to occur after each document is indexed and we see the total amount of memory reduced from nearly 100% of the Java Heap to around 70-75%.  The profiling tools now show that the memory is cleaned up to some extent after each document.  But of course this completely defeats the whole reason why we want to only commit at the end of the run for performance sake.  Most of the data, as seen using Heap analasis, is held in Byte, Character, and Integer classes whos GC roots are tied back to the Writer Objects and threads.  The instance counts, after running just 1,100 documents seems staggering
>
> Is there additional data that the IndexWriter hangs onto regardless of when it hits the RAMBufferSize limit?  Why are we seeing the heap space all being used up?
>
> A side question to this is the fact that we always see a large amount of memory used by the IndexWriter even after our indexing has been completed and all commits have taken place (basically in an idle state).  Why would this be?  Is the only way to totally clean up the memory is to close the writer?  Our index is also used for real time indexing so the IndexWriter is intended to remain open for the lifetime of the app.
>
> Any help in understanding why the IndexWriter is maxing out our heap space or what is expected from memory usage of the IndexWriter would be appreciated.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org