You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by mark harwood <ma...@yahoo.co.uk> on 2009/03/09 11:44:42 UTC

A model for predicting indexing memory costs?

I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values.
I've been hitting out of memory issues when doing periodic commits/closes which I suspect is down to the sheer number of terms.

I set the IndexWriter..setTermIndexInterval to 8 times the normal size of 128 (an intervalof 1024) which delayed the onset of the issue but still failed.
I'd like to get a little more scientific about what to set here rather than simply experimenting with settings and hoping it doesn't fail again.

Does anyone have a decent model worked out for how much memory is consumed at peak? I'm guessing the contributing factors are:

* Numbers of fields
* Numbers of unique terms per field
* Numbers of segments?

Cheers,
Mark



      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Florian Weimer <fw...@deneb.enyo.de>.

* mark harwood:

> Thanks, I have a heap dump now from a run with reduced JVM memory
> (in order to speed up a failure point) and am working through it
> offline with VisualVm.

> This test induced a proper OOM as opposed to one of those "timed out
> waiting for GC " type OOMs so may be misleading.

It wouldn't be unusual if the same root cause sometimes results in a
straight OOM, sometimes in the time-out variant.

> I have another run soldiering on with the -XX:-UseGCOverheadLimit
> setting to avoid GC -related timeouts and this has not hit OOM but
> is slowing to a crawl.

You could try to enable hprof and get a heap dump when you terminate
the VM with SIGINT/^C.

In the end, you probably have to do this with a realisticly-sized run
to get good data.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by mark harwood <ma...@yahoo.co.uk>.

OK, it's early days and I'm holding my breath but I'm currently progressing further through my content without an OOM just by using a different GC setting.

Thanks to advice here and colleagues at work I've gone with a GC setting of -XX:+UseSerialGC for this indexing task.

The rationale that is emerging is there are potentially 2 different GC strategies that should be applied for Lucene work - one for indexing and one for search

1) Search GC model
Search typically generates much less state than indexing tasks and each search can often have an SLA associated with it (e.g. results returned in <0.5 seconds). 
In this environment a non-invasive background GC is useful (e.g. the parallel collector) to avoid long interrupts on individual searches. ParallelCollector will throw OOM if these background GC tasks appear to lock-up ( http://java.sun.com/j2se/1.5.0/docs/guide/vm/gc-ergonomics.html ). I think this is the problem I was getting originally.
2) Index GC model
Indexing generates large volumes of objects which are kept in RAM while indexing then flushed. Individual document adds are typically not subject to an SLA but overall indexing time for a batch is. In this scenario a "stop-the-world" garbage collect mid-indexing is a welcome cleansing and has no business impact if it takes a while. This is what I beleive I'm now getting from the -UseSerialGC setting.

This does suggest that it might be a good idea to use a different VM for indexing to the one used for searching so that you can impose these different GC models.

That's the theory at least and I'm hoping this works out for me in practice......

----- Original Message ----
From: Michael McCandless <lu...@mikemccandless.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 11 March, 2009 13:04:56
Subject: Re: A model for predicting indexing memory costs?

Mark Miller wrote:

> Michael McCandless wrote:
>> 
>> Ie, it's still not clear if you are running out of memory vs hitting some weird "it's too hard for GC to deal" kind of massive heap fragmentation situation or something.  It reminds me of the special ("I cannot be played on record player X") record (your application) that cannot be played on a given record player X (your JRE) in Gödel, Escher, Bach ;)
>> 
> Perhaps its been too long since I've seen that book, but if I remember right, he only has to get himself a JRE Omega version and he should be all set...

That's right ;)  But JRE Omega will still have a [different] record that causes its GC to grind to a halt... (or maybe I don't remember the book very well!).

Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Mark Miller wrote:

> Michael McCandless wrote:
>>
>> Ie, it's still not clear if you are running out of memory vs  
>> hitting some weird "it's too hard for GC to deal" kind of massive  
>> heap fragmentation situation or something.  It reminds me of the  
>> special ("I cannot be played on record player X") record (your  
>> application) that cannot be played on a given record player X (your  
>> JRE) in Gödel, Escher, Bach ;)
>>
> Perhaps its been too long since I've seen that book, but if I  
> remember right, he only has to get himself a JRE Omega version and  
> he should be all set...


That's right ;)  But JRE Omega will still have a [different] record  
that causes its GC to grind to a halt... (or maybe I don't remember  
the book very well!).

Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Mark Miller <ma...@gmail.com>.

Michael McCandless wrote:
>
> Ie, it's still not clear if you are running out of memory vs hitting 
> some weird "it's too hard for GC to deal" kind of massive heap 
> fragmentation situation or something.  It reminds me of the special 
> ("I cannot be played on record player X") record (your application) 
> that cannot be played on a given record player X (your JRE) in Gödel, 
> Escher, Bach ;)
>
Perhaps its been too long since I've seen that book, but if I remember 
right, he only has to get himself a JRE Omega version and he should be 
all set...

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

mark harwood wrote:

>
> Thanks, I have a heap dump now from a run with reduced JVM memory  
> (in order to speed up a failure point) and am working through it  
> offline with VisualVm.
> This test induced a proper OOM as opposed to one of those "timed out  
> waiting for GC " type OOMs so may be misleading.

Hmm -- it's not good that the problem changed on you ;)  You may  
simply be not giving the app enough memory now?

> The main culprit in this particular dump looks to be
> FreqProxTermsWriter$PostingsList but the number of instances is in  
> line
> with the volumes of terms I would have expected at that stage of
> indexing.
> I'll report back on my findings as I discover more.

Yeah this is expected to be a top user of RAM when you have mostly  
unique terms.

> I have another run soldiering on with the -XX:-UseGCOverheadLimit   
> setting to avoid GC -related timeouts and this has not hit OOM but  
> is slowing to a crawl.
>
> I'll try capturing InfoStream too if it doesn't generate terabytes  
> of data.

Be careful, though: if your root cause is "GC takes too long to  
run" (that "timed out waiting for GC" exception), then running with  
this setting changes the game.

Ie, it's still not clear if you are running out of memory vs hitting  
some weird "it's too hard for GC to deal" kind of massive heap  
fragmentation situation or something.  It reminds me of the special  
("I cannot be played on record player X") record (your application)  
that cannot be played on a given record player X (your JRE) in Gödel,  
Escher, Bach ;)

Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by mark harwood <ma...@yahoo.co.uk>.

Thanks, I have a heap dump now from a run with reduced JVM memory (in order to speed up a failure point) and am working through it offline with VisualVm.
This test induced a proper OOM as opposed to one of those "timed out waiting for GC " type OOMs so may be misleading.
The main culprit in this particular dump looks to be
FreqProxTermsWriter$PostingsList but the number of instances is in line
with the volumes of terms I would have expected at that stage of
indexing.
I'll report back on my findings as I discover more.

I have another run soldiering on with the -XX:-UseGCOverheadLimit  setting to avoid GC -related timeouts and this has not hit OOM but is slowing to a crawl.

I'll try capturing InfoStream too if it doesn't generate terabytes of data.

Cheers
Mark







----- Original Message ----
From: Florian Weimer <fw...@deneb.enyo.de>
To: java-user@lucene.apache.org
Sent: Wednesday, 11 March, 2009 10:42:33
Subject: Re: A model for predicting indexing memory costs?

* mark harwood:

>>>Could you get a heap dump (eg with YourKit) of what's using up all the memory when you hit OOM?
>
> On this particular machine I have a JRE, no admin rights and
> therefore limited profiling capability :(

Maybe this could give you a heap dump which you can analyze on a
different box?

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/file.dump

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Florian Weimer <fw...@deneb.enyo.de>.

* mark harwood:

>>>Could you get a heap dump (eg with YourKit) of what's using up all the memory when you hit OOM?
>
> On this particular machine I have a JRE, no admin rights and
> therefore limited profiling capability :(

Maybe this could give you a heap dump which you can analyze on a
different box?

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/file.dump

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

mark harwood wrote:

>
>>> Could you get a heap dump (eg with YourKit) of what's using up all  
>>> the memory when you hit OOM?
>
> On this particular machine I have a JRE, no admin rights and  
> therefore limited profiling capability :(
> That's why I was trying to come up with some formula for estimating  
> memory usage.

Hmm, OK.  infoStream?

Or maybe you could use jmap -histo <pid>?

>>> When you say "write session", are you closing & opening a new  
>>> IndexWriter each time?
>
> Yes, commit then close.

OK.

>>> It seems likely this has something to do with merging,
>
> Presumably (norms + deleted arrays discounted) RAM usage for merges  
> is not proportional to number of terms or docs? I imagine the  
> structures being merged are streamed rather than loaded in whole or  
> as a fixed percentage of the whole.

Oh yeah, right.  No deleted docs to load, and no docMaps to create.
And no norms.  So merging should use negligible amounts of RAM.

So the exception really means "GC is working too hard" (and your
graph, and the super-slow GC when you run with
-XX:-UseGCOverheadLimit) seem to indicate GC is taking alot of time to
collect.  But: how much RAM is actually in use?  Can you plot that?

I suppose you could be hitting some kind of horrible degenerate
fragmentation craziness worst-case-scenario situation, that's swamping
the GC...

Maybe start binary-searching -- turn features off (like Trie*) and see
if it the problem goes away, then go back and drill down?

If you take that 98th batch of docs and index it into a fresh index, do
you also hit the GC-working-too-hard exception?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

mark harwood wrote:

> I think a modelling/sizing spreadsheetwould be a useful addition to  
> our documentation.

IW will simply use up to the RAM you told it to, and then flush.

Add onto that RAM consumed by merging, which in the presence of
deletes is totDocCount * 4.125 bytes, plus numberOfFieldsWithNorms *
totDocCount * 1 byte.

I guess plus all the BufferedIndexInput buffers held open during  
merging,
though that's likely smallish (4 KB per).

I think that's roughly it?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by mark harwood <ma...@yahoo.co.uk>.

> I get really belligerent when being told to solve problems while wearing a ball-and-chain.

I seem to have touched quite a nerve there then, Erick ;)
I appreciate your sympathy.

To be fair I haven't exhausted all possible avenues in changing the environment  but I do remain interested in understanding more about memory consumption.
I think a modelling/sizing spreadsheetwould be a useful addition to our documentation.




----- Original Message ----
From: Erick Erickson <er...@gmail.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 10 March, 2009 15:08:24
Subject: Re: A model for predicting indexing memory costs?

You have my sympathy. Let's see, you're being told "we can't give
you the tools you need to diagnose/fix the problem, but fix it anyway".
Probably with the addendum "And fix it by Friday".

You might want to consider staging a mutiny until "the powers that be"
can give you a solution. Perhaps working with the system admins to
set up profiling. Perhaps a temporary ID that has enough privileges to
do what you need that can then be deleted. Perhaps.......

In case you're wondering, I've been in that situation often
enough that I get really belligerent when being told to solve problems
while wearing a ball-and-chain. Invariably it takes more company
time/resources than getting what I need to go forward directly.

I mean, you'll be in a situation where you'll only be able to say
"Well, I changed some things. Whether they were the right things
I really can't say because I don't understand the root of the problem
because you silly people won't let me use a profiler". And, even worse,
if the problem goes away after you change something, you won't be
able to say very much about whether or not it'll come back since you
really don't know whether you've actually fixed anything or just masked
the problem temporarily with the possibility that it'll come back when
you add document N+1. By which time you (and others) will have to
reconstruct all you know now, which is expensive.

So, be much more diplomatic than this note, please <G>. But you may
want to point out that without appropriate tools, your company may
well spend significant time, yours and others, repeatedly trying to
fix this issue. Through n+1 rounds. I've actually had good results
by pointing out that it's not only *your* time that's at risk, but
customers' time too. Whether you define customers as internal
or external is irrelevant. Every round of diagnosis/fix carries the risk
that N people waste time (and get paid for it). All to avoid a little
up-front costs due to admin privileges (in this case).


OK, enough ranting..

Best
Erick

On Tue, Mar 10, 2009 at 10:50 AM, mark harwood <ma...@yahoo.co.uk>wrote:

>
> >>Could you get a heap dump (eg with YourKit) of what's using up all the
> memory when you hit OOM?
>
> On this particular machine I have a JRE, no admin rights and therefore
> limited profiling capability :(
> That's why I was trying to come up with some formula for estimating memory
> usage.
>
> >>When you say "write session", are you closing & opening a new IndexWriter
> each time?
>
> Yes, commit then close.
>
> >>It seems likely this has something to do with merging,
>
> Presumably (norms + deleted arrays discounted) RAM usage for merges is not
> proportional to number of terms or docs? I imagine the structures being
> merged are streamed rather than loaded in whole or as a fixed percentage of
> the whole.
>
>
>
> ----- Original Message ----
> From: Michael McCandless <lu...@mikemccandless.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 14:23:59
> Subject: Re: A model for predicting indexing memory costs?
>
>
> Mark,
>
> Could you get a heap dump (eg with YourKit) of what's using up all the
> memory when you hit OOM?
>
> Also, can you turn on infoStream and post the output leading up to the OOM?
>
> When you say "write session", are you closing & opening a new IndexWriter
> each time?  Or, just calling .commit() and then re-using the same writer?
>
> It seems likely this has something to do with merging, though from your
> listing I count 14 segments which shouldn't have been doing any merging at
> mergeFactor=20, so that's confusing.
>
> Mike
>
> mark harwood wrote:
>
> >
> >>> But.... how come setting IW's RAM buffer doesn't prevent the OOMs?
> >
> > I've been setting the IndexWriter RAM buffer to 300 meg and giving the
> JVM 1gig.
> >
> > Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300
> meg, merge factor=20, term interval=8192, usecompound=false. All fields are
> ANALYZED_NO_NORMS.
> > Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
> >
> > This graphic shows timings for 100 consecutive write sessions, each
> adding 30,000 documents, committing and then closing :
> >    http://tinyurl.com/anzcjw
> > You can see the periodic merge costs and then a big spike towards the end
> before it crashed.
> >
> > The crash details are here after adding ~3 million documents in 98 write
> sessions:
> >
> > This batch index session added 3000 of 30000 docs : 10% complete
> > Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> >    at java.util..Arrays.copyOf(Unknown Source)
> >    at java.lang.String...<init>(Unknown Source)
> >    at
> org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:148)
> >    at
> org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
> >    at
> test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
> >    at
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159)
> >    at
> org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
> >    at
> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
> >    at
> org.apache.lucene.index..DocumentsWriter.updateDocument(DocumentsWriter.java:762)
> >    at
> org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter..java:740)
> >    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
> >    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
> >    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> > Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> >    at org.apache.commons.csv.CharBuffer.toString(CharBuffer.java:177)
> >    at org.apache.commons.csv.CSVParser.getLine(CSVParser.java:242)
> >    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
> >    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:314)
> > Committing
> > Closing
> > Exception in thread "main" java.lang.IllegalStateException: this writer
> hit an OutOfMemoryError; cannot commit
> >    at
> org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
> >    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3660)
> >    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3634)
> >    at test.IndexMarksFile.run(IndexMarksFile.java:176)
> >    at test..IndexMarksFile.main(IndexMarksFile.java:101)
> >    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
> >
> >
> > For each write session I have a single writer, and 2 indexing threads
> adding documents through this writer. There are no updates/deletes - only
> adds. When both indexing threads complete the primary thread commits and
> closes the writer..
> > I then open a searcher run some search benchmarks, close the searcher and
> start another write session.
> > The documents have ~12 fields and are all the same size so I don't think
> this OOM is down to rogue data. Each field has 100 near-unique tokens.
> >
> > The files on disk after the crash are as follows:
> > 1930004059 Mar  9 13:32 _106.fdt
> >    2731084 Mar  9 13:32 _106.fdx
> >        175 Mar  9 13:30 _106.fnm
> > 1190042394 Mar  9 13:39 _106.frq
> >  814748995 Mar  9 13:39 _106.prx
> >   16512596 Mar  9 13:39 _106.tii
> > 1151364311 Mar  9 13:39 _106.tis
> > 1949444533 Mar  9 14:53 _139.fdt
> >    2758580 Mar  9 14:53 _139.fdx
> >        175 Mar  9 14:51 _139.fnm
> > 1202044423 Mar  9 15:00 _139.frq
> >  822954002 Mar  9 15:00 _139.prx
> >   16629104 Mar  9 15:00 _139.tii
> > 1159392207 Mar  9 15:00 _139.tis
> > 1930102055 Mar  9 16:15 _16c.fdt
> >    2731084 Mar  9 16:15 _16c.fdx
> >        175 Mar  9 16:13 _16c.fnm
> > 1190090014 Mar  9 16:22 _16c.frq
> >  814763781 Mar  9 16:22 _16c.prx
> >   16514967 Mar  9 16:22 _16c.tii
> > 1151524173 Mar  9 16:22 _16c.tis
> > 1928053697 Mar  9 17:52 _19e.fdt
> >    2728260 Mar  9 17:52 _19e.fdx
> >        175 Mar  9 17:46 _19e.fnm
> > 1188837093 Mar  9 18:08 _19e.frq
> >  813915820 Mar  9 18:08 _19e.prx
> >   16501902 Mar  9 18:08 _19e.tii
> > 1150623773 Mar  9 18:08 _19e.tis
> > 1951474247 Mar  9 20:22 _1cj.fdt
> >    2761396 Mar  9 20:22 _1cj.fdx
> >        175 Mar  9 20:18 _1cj.fnm
> > 1203285781 Mar  9 20:39 _1cj.frq
> >  823797656 Mar  9 20:39 _1cj.prx
> >   16639997 Mar  9 20:39 _1cj.tii
> > 1160143978 Mar  9 20:39 _1cj.tis
> > 1929978366 Mar 10 01:02 _1fm.fdt
> >    2731060 Mar 10 01:02 _1fm.fdx
> >        175 Mar 10 00:43 _1fm.fnm
> > 1190031780 Mar 10 02:36 _1fm.frq
> >  814741146 Mar 10 02:36 _1fm.prx
> >   16513189 Mar 10 02:36 _1fm.tii
> > 1151399139 Mar 10 02:36 _1fm.tis
> >  189073186 Mar 10 01:51 _1ft.fdt
> >     267556 Mar 10 01:51 _1ft.fdx
> >        175 Mar 10 01:50 _1ft.fnm
> >  110750150 Mar 10 02:04 _1ft.frq
> >   79818488 Mar 10 02:04 _1ft.prx
> >    2326691 Mar 10 02:04 _1ft.tii
> >  165932844 Mar 10 02:04 _1ft.tis
> >  212500024 Mar 10 03:16 _1g5..fdt
> >     300684 Mar 10 03:16 _1g5.fdx
> >        175 Mar 10 03:16 _1g5.fnm
> >  125179984 Mar 10 03:28 _1g5.frq
> >   89703062 Mar 10 03:28 _1g5.prx
> >    2594360 Mar 10 03:28 _1g5.tii
> >  184495760 Mar 10 03:28 _1g5.tis
> >   64323505 Mar 10 04:09 _1gc.fdt
> >      91020 Mar 10 04:09 _1gc.fdx
> >  105283820 Mar 10 04:48 _1gf.fdt
> >     148988 Mar 10 04:48 _1gf.fdx
> >        175 Mar 10 04:09 _1gf.fnm
> >       1491 Mar 10 04:09 _1gf.frq
> >          4 Mar 10 04:09 _1gf.nrm
> >       2388 Mar 10 04:09 _1gf.prx
> >        254 Mar 10 04:09 _1gf.tii
> >      15761 Mar 10 04:09 _1gf.tis
> >  191035191 Mar 10 04:09 _1gg.fdt
> >     270332 Mar 10 04:09 _1gg.fdx
> >        175 Mar 10 04:09 _1gg.fnm
> >  111958741 Mar 10 04:24 _1gg.frq
> >   80645411 Mar 10 04:24 _1gg.prx
> >    2349153 Mar 10 04:24 _1gg.tii
> >  167494232 Mar 10 04:24 _1gg.tis
> >        175 Mar 10 04:20 _1gh.fnm
> >   10223275 Mar 10 04:20 _1gh.frq
> >          4 Mar 10 04:20 _1gh.nrm
> >    9056546 Mar 10 04:20 _1gh.prx
> >     329012 Mar 10 04:20 _1gh.tii
> >   23846511 Mar 10 04:20 _1gh.tis
> >        175 Mar 10 04:28 _1gi.fnm
> >   10221888 Mar 10 04:28 _1gi.frq
> >          4 Mar 10 04:28 _1gi.nrm
> >    9054280 Mar 10 04:28 _1gi.prx
> >     328980 Mar 10 04:28 _1gi.tii
> >   23843209 Mar 10 04:28 _1gi.tis
> >        175 Mar 10 04:35 _1gj.fnm
> >   10222776 Mar 10 04:35 _1gj.frq
> >          4 Mar 10 04:35 _1gj.nrm
> >    9054943 Mar 10 04:35 _1gj.prx
> >     329060 Mar 10 04:35 _1gj.tii
> >   23849395 Mar 10 04:35 _1gj.tis
> >        175 Mar 10 04:42 _1gk.fnm
> >   10220381 Mar 10 04:42 _1gk.frq
> >          4 Mar 10 04:42 _1gk.nrm
> >    9052810 Mar 10 04:42 _1gk.prx
> >     329029 Mar 10 04:42 _1gk.tii
> >   23845373 Mar 10 04:42 _1gk.tis
> >        175 Mar 10 04:48 _1gl.fnm
> >    9274170 Mar 10 04:48 _1gl.frq
> >          4 Mar 10 04:48 _1gl.nrm
> >    8226681 Mar 10 04:48 _1gl.prx
> >     303327 Mar 10 04:48 _1gl.tii
> >   21996826 Mar 10 04:48 _1gl.tis
> >   22418126 Mar 10 04:58 _1gm.fdt
> >      31732 Mar 10 04:58 _1gm.fdx
> >        175 Mar 10 04:57 _1gm.fnm
> >   10216672 Mar 10 04:57 _1gm.frq
> >          4 Mar 10 04:57 _1gm.nrm
> >    9049487 Mar 10 04:57 _1gm.prx
> >     328813 Mar 10 04:57 _1gm.tii
> >   23829627 Mar 10 04:57 _1gm.tis
> >        175 Mar 10 04:58 _1gn.fnm
> >     392014 Mar 10 04:58 _1gn.frq
> >          4 Mar 10 04:58 _1gn.nrm
> >     415225 Mar 10 04:58 _1gn.prx
> >      24695 Mar 10 04:58 _1gn.tii
> >    1816750 Mar 10 04:58 _1gn.tis
> >        683 Mar 10 04:58 segments_7t
> >         20 Mar 10 04:58 segments.gen
> > 1935727800 Mar  9 11:17 _u1..fdt
> >    2739180 Mar  9 11:17 _u1.fdx
> >        175 Mar  9 11:15 _u1.fnm
> > 1193583522 Mar  9 11:25 _u1.frq
> >  817164507 Mar  9 11:25 _u1.prx
> >   16547464 Mar  9 11:25 _u1..tii
> > 1153764013 Mar  9 11:25 _u1.tis
> > 1949493315 Mar  9 12:21 _x3.fdt
> >    2758580 Mar  9 12:21 _x3.fdx
> >        175 Mar  9 12:18 _x3.fnm
> > 1202068425 Mar  9 12:29 _x3.frq
> >  822963200 Mar  9 12:29 _x3.prx
> >   16629485 Mar  9 12:29 _x3.tii
> > 1159419149 Mar  9 12:29 _x3.tis
> >
> >
> > Any ideas? I'm out of settings to tweak here.
> >
> > Cheers,
> > Mark
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Michael McCandless <lu...@mikemccandless.com>
> > To: java-user@lucene.apache.org
> > Sent: Tuesday, 10 March, 2009 0:01:30
> > Subject: Re: A model for predicting indexing memory costs?
> >
> >
> > mark harwood wrote:
> >
> >>
> >> I've been building a large index (hundreds of millions) with mainly
> structured data which consists of several fields with mostly unique values.
> >> I've been hitting out of memory issues when doing periodic
> commits/closes which I suspect is down to the sheer number of terms.
> >>
> >> I set the IndexWriter..setTermIndexInterval to 8 times the normal size
> of 128 (an intervalof 1024) which delayed the onset of the issue but still
> failed.
> >
> > I think that setting won't change how much RAM is used when writing.
> >
> >> I'd like to get a little more scientific about what to set here rather
> than simply experimenting with settings and hoping it doesn't fail again.
> >>
> >> Does anyone have a decent model worked out for how much memory is
> consumed at peak? I'm guessing the contributing factors are:
> >>
> >> * Numbers of fields
> >> * Numbers of unique terms per field
> >> * Numbers of segments?
> >
> > Number of net unique terms (across all fields) is a big driver, but also
> net number of term occurrences, and how many docs.  Lots of tiny docs take
> more RAM than fewer large docs, when # occurrences are equal.
> >
> > But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW
> should simply flush when it's used that much RAM.
> >
> > I don't think number of segments is a factor.
> >
> > Though mergeFactor is, since during merging the SegmentMerger holds
> SegmentReaders open, and int[] maps (if there are any deletes) for each
> segment.  Do you have a large merge taking place when you hit the OOMs?
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Erick Erickson <er...@gmail.com>.

You have my sympathy. Let's see, you're being told "we can't give
you the tools you need to diagnose/fix the problem, but fix it anyway".
Probably with the addendum "And fix it by Friday".

You might want to consider staging a mutiny until "the powers that be"
can give you a solution. Perhaps working with the system admins to
set up profiling. Perhaps a temporary ID that has enough privileges to
do what you need that can then be deleted. Perhaps.......

In case you're wondering, I've been in that situation often
enough that I get really belligerent when being told to solve problems
while wearing a ball-and-chain. Invariably it takes more company
time/resources than getting what I need to go forward directly.

I mean, you'll be in a situation where you'll only be able to say
"Well, I changed some things. Whether they were the right things
I really can't say because I don't understand the root of the problem
because you silly people won't let me use a profiler". And, even worse,
if the problem goes away after you change something, you won't be
able to say very much about whether or not it'll come back since you
really don't know whether you've actually fixed anything or just masked
the problem temporarily with the possibility that it'll come back when
you add document N+1. By which time you (and others) will have to
reconstruct all you know now, which is expensive.

So, be much more diplomatic than this note, please <G>. But you may
want to point out that without appropriate tools, your company may
well spend significant time, yours and others, repeatedly trying to
fix this issue. Through n+1 rounds. I've actually had good results
by pointing out that it's not only *your* time that's at risk, but
customers' time too. Whether you define customers as internal
or external is irrelevant. Every round of diagnosis/fix carries the risk
that N people waste time (and get paid for it). All to avoid a little
up-front costs due to admin privileges (in this case).


OK, enough ranting..

Best
Erick

On Tue, Mar 10, 2009 at 10:50 AM, mark harwood <ma...@yahoo.co.uk>wrote:

>
> >>Could you get a heap dump (eg with YourKit) of what's using up all the
> memory when you hit OOM?
>
> On this particular machine I have a JRE, no admin rights and therefore
> limited profiling capability :(
> That's why I was trying to come up with some formula for estimating memory
> usage.
>
> >>When you say "write session", are you closing & opening a new IndexWriter
> each time?
>
> Yes, commit then close.
>
> >>It seems likely this has something to do with merging,
>
> Presumably (norms + deleted arrays discounted) RAM usage for merges is not
> proportional to number of terms or docs? I imagine the structures being
> merged are streamed rather than loaded in whole or as a fixed percentage of
> the whole.
>
>
>
> ----- Original Message ----
> From: Michael McCandless <lu...@mikemccandless.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 14:23:59
> Subject: Re: A model for predicting indexing memory costs?
>
>
> Mark,
>
> Could you get a heap dump (eg with YourKit) of what's using up all the
> memory when you hit OOM?
>
> Also, can you turn on infoStream and post the output leading up to the OOM?
>
> When you say "write session", are you closing & opening a new IndexWriter
> each time?  Or, just calling .commit() and then re-using the same writer?
>
> It seems likely this has something to do with merging, though from your
> listing I count 14 segments which shouldn't have been doing any merging at
> mergeFactor=20, so that's confusing.
>
> Mike
>
> mark harwood wrote:
>
> >
> >>> But... how come setting IW's RAM buffer doesn't prevent the OOMs?
> >
> > I've been setting the IndexWriter RAM buffer to 300 meg and giving the
> JVM 1gig.
> >
> > Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300
> meg, merge factor=20, term interval=8192, usecompound=false. All fields are
> ANALYZED_NO_NORMS.
> > Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
> >
> > This graphic shows timings for 100 consecutive write sessions, each
> adding 30,000 documents, committing and then closing :
> >    http://tinyurl.com/anzcjw
> > You can see the periodic merge costs and then a big spike towards the end
> before it crashed.
> >
> > The crash details are here after adding ~3 million documents in 98 write
> sessions:
> >
> > This batch index session added 3000 of 30000 docs : 10% complete
> > Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> >    at java.util..Arrays.copyOf(Unknown Source)
> >    at java.lang.String..<init>(Unknown Source)
> >    at
> org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:148)
> >    at
> org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
> >    at
> test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
> >    at
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159)
> >    at
> org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
> >    at
> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
> >    at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:762)
> >    at
> org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter..java:740)
> >    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
> >    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
> >    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> > Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> >    at org.apache.commons.csv.CharBuffer.toString(CharBuffer.java:177)
> >    at org.apache.commons.csv.CSVParser.getLine(CSVParser.java:242)
> >    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
> >    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:314)
> > Committing
> > Closing
> > Exception in thread "main" java.lang.IllegalStateException: this writer
> hit an OutOfMemoryError; cannot commit
> >    at
> org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
> >    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3660)
> >    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3634)
> >    at test.IndexMarksFile.run(IndexMarksFile.java:176)
> >    at test.IndexMarksFile.main(IndexMarksFile.java:101)
> >    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
> >
> >
> > For each write session I have a single writer, and 2 indexing threads
> adding documents through this writer. There are no updates/deletes - only
> adds. When both indexing threads complete the primary thread commits and
> closes the writer.
> > I then open a searcher run some search benchmarks, close the searcher and
> start another write session.
> > The documents have ~12 fields and are all the same size so I don't think
> this OOM is down to rogue data. Each field has 100 near-unique tokens.
> >
> > The files on disk after the crash are as follows:
> > 1930004059 Mar  9 13:32 _106.fdt
> >    2731084 Mar  9 13:32 _106.fdx
> >        175 Mar  9 13:30 _106.fnm
> > 1190042394 Mar  9 13:39 _106.frq
> >  814748995 Mar  9 13:39 _106.prx
> >   16512596 Mar  9 13:39 _106.tii
> > 1151364311 Mar  9 13:39 _106.tis
> > 1949444533 Mar  9 14:53 _139.fdt
> >    2758580 Mar  9 14:53 _139.fdx
> >        175 Mar  9 14:51 _139.fnm
> > 1202044423 Mar  9 15:00 _139.frq
> >  822954002 Mar  9 15:00 _139.prx
> >   16629104 Mar  9 15:00 _139.tii
> > 1159392207 Mar  9 15:00 _139.tis
> > 1930102055 Mar  9 16:15 _16c.fdt
> >    2731084 Mar  9 16:15 _16c.fdx
> >        175 Mar  9 16:13 _16c.fnm
> > 1190090014 Mar  9 16:22 _16c.frq
> >  814763781 Mar  9 16:22 _16c.prx
> >   16514967 Mar  9 16:22 _16c.tii
> > 1151524173 Mar  9 16:22 _16c.tis
> > 1928053697 Mar  9 17:52 _19e.fdt
> >    2728260 Mar  9 17:52 _19e.fdx
> >        175 Mar  9 17:46 _19e.fnm
> > 1188837093 Mar  9 18:08 _19e.frq
> >  813915820 Mar  9 18:08 _19e.prx
> >   16501902 Mar  9 18:08 _19e.tii
> > 1150623773 Mar  9 18:08 _19e.tis
> > 1951474247 Mar  9 20:22 _1cj.fdt
> >    2761396 Mar  9 20:22 _1cj.fdx
> >        175 Mar  9 20:18 _1cj.fnm
> > 1203285781 Mar  9 20:39 _1cj.frq
> >  823797656 Mar  9 20:39 _1cj.prx
> >   16639997 Mar  9 20:39 _1cj.tii
> > 1160143978 Mar  9 20:39 _1cj.tis
> > 1929978366 Mar 10 01:02 _1fm.fdt
> >    2731060 Mar 10 01:02 _1fm.fdx
> >        175 Mar 10 00:43 _1fm.fnm
> > 1190031780 Mar 10 02:36 _1fm.frq
> >  814741146 Mar 10 02:36 _1fm.prx
> >   16513189 Mar 10 02:36 _1fm.tii
> > 1151399139 Mar 10 02:36 _1fm.tis
> >  189073186 Mar 10 01:51 _1ft.fdt
> >     267556 Mar 10 01:51 _1ft.fdx
> >        175 Mar 10 01:50 _1ft.fnm
> >  110750150 Mar 10 02:04 _1ft.frq
> >   79818488 Mar 10 02:04 _1ft.prx
> >    2326691 Mar 10 02:04 _1ft.tii
> >  165932844 Mar 10 02:04 _1ft.tis
> >  212500024 Mar 10 03:16 _1g5..fdt
> >     300684 Mar 10 03:16 _1g5.fdx
> >        175 Mar 10 03:16 _1g5.fnm
> >  125179984 Mar 10 03:28 _1g5.frq
> >   89703062 Mar 10 03:28 _1g5.prx
> >    2594360 Mar 10 03:28 _1g5.tii
> >  184495760 Mar 10 03:28 _1g5.tis
> >   64323505 Mar 10 04:09 _1gc.fdt
> >      91020 Mar 10 04:09 _1gc.fdx
> >  105283820 Mar 10 04:48 _1gf.fdt
> >     148988 Mar 10 04:48 _1gf.fdx
> >        175 Mar 10 04:09 _1gf.fnm
> >       1491 Mar 10 04:09 _1gf.frq
> >          4 Mar 10 04:09 _1gf.nrm
> >       2388 Mar 10 04:09 _1gf.prx
> >        254 Mar 10 04:09 _1gf.tii
> >      15761 Mar 10 04:09 _1gf.tis
> >  191035191 Mar 10 04:09 _1gg.fdt
> >     270332 Mar 10 04:09 _1gg.fdx
> >        175 Mar 10 04:09 _1gg.fnm
> >  111958741 Mar 10 04:24 _1gg.frq
> >   80645411 Mar 10 04:24 _1gg.prx
> >    2349153 Mar 10 04:24 _1gg.tii
> >  167494232 Mar 10 04:24 _1gg.tis
> >        175 Mar 10 04:20 _1gh.fnm
> >   10223275 Mar 10 04:20 _1gh.frq
> >          4 Mar 10 04:20 _1gh.nrm
> >    9056546 Mar 10 04:20 _1gh.prx
> >     329012 Mar 10 04:20 _1gh.tii
> >   23846511 Mar 10 04:20 _1gh.tis
> >        175 Mar 10 04:28 _1gi.fnm
> >   10221888 Mar 10 04:28 _1gi.frq
> >          4 Mar 10 04:28 _1gi.nrm
> >    9054280 Mar 10 04:28 _1gi.prx
> >     328980 Mar 10 04:28 _1gi.tii
> >   23843209 Mar 10 04:28 _1gi.tis
> >        175 Mar 10 04:35 _1gj.fnm
> >   10222776 Mar 10 04:35 _1gj.frq
> >          4 Mar 10 04:35 _1gj.nrm
> >    9054943 Mar 10 04:35 _1gj.prx
> >     329060 Mar 10 04:35 _1gj.tii
> >   23849395 Mar 10 04:35 _1gj.tis
> >        175 Mar 10 04:42 _1gk.fnm
> >   10220381 Mar 10 04:42 _1gk.frq
> >          4 Mar 10 04:42 _1gk.nrm
> >    9052810 Mar 10 04:42 _1gk.prx
> >     329029 Mar 10 04:42 _1gk.tii
> >   23845373 Mar 10 04:42 _1gk.tis
> >        175 Mar 10 04:48 _1gl.fnm
> >    9274170 Mar 10 04:48 _1gl.frq
> >          4 Mar 10 04:48 _1gl.nrm
> >    8226681 Mar 10 04:48 _1gl.prx
> >     303327 Mar 10 04:48 _1gl.tii
> >   21996826 Mar 10 04:48 _1gl.tis
> >   22418126 Mar 10 04:58 _1gm.fdt
> >      31732 Mar 10 04:58 _1gm.fdx
> >        175 Mar 10 04:57 _1gm.fnm
> >   10216672 Mar 10 04:57 _1gm.frq
> >          4 Mar 10 04:57 _1gm.nrm
> >    9049487 Mar 10 04:57 _1gm.prx
> >     328813 Mar 10 04:57 _1gm.tii
> >   23829627 Mar 10 04:57 _1gm.tis
> >        175 Mar 10 04:58 _1gn.fnm
> >     392014 Mar 10 04:58 _1gn.frq
> >          4 Mar 10 04:58 _1gn.nrm
> >     415225 Mar 10 04:58 _1gn.prx
> >      24695 Mar 10 04:58 _1gn.tii
> >    1816750 Mar 10 04:58 _1gn.tis
> >        683 Mar 10 04:58 segments_7t
> >         20 Mar 10 04:58 segments.gen
> > 1935727800 Mar  9 11:17 _u1..fdt
> >    2739180 Mar  9 11:17 _u1.fdx
> >        175 Mar  9 11:15 _u1.fnm
> > 1193583522 Mar  9 11:25 _u1.frq
> >  817164507 Mar  9 11:25 _u1.prx
> >   16547464 Mar  9 11:25 _u1..tii
> > 1153764013 Mar  9 11:25 _u1.tis
> > 1949493315 Mar  9 12:21 _x3.fdt
> >    2758580 Mar  9 12:21 _x3.fdx
> >        175 Mar  9 12:18 _x3.fnm
> > 1202068425 Mar  9 12:29 _x3.frq
> >  822963200 Mar  9 12:29 _x3.prx
> >   16629485 Mar  9 12:29 _x3.tii
> > 1159419149 Mar  9 12:29 _x3.tis
> >
> >
> > Any ideas? I'm out of settings to tweak here.
> >
> > Cheers,
> > Mark
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Michael McCandless <lu...@mikemccandless.com>
> > To: java-user@lucene.apache.org
> > Sent: Tuesday, 10 March, 2009 0:01:30
> > Subject: Re: A model for predicting indexing memory costs?
> >
> >
> > mark harwood wrote:
> >
> >>
> >> I've been building a large index (hundreds of millions) with mainly
> structured data which consists of several fields with mostly unique values.
> >> I've been hitting out of memory issues when doing periodic
> commits/closes which I suspect is down to the sheer number of terms.
> >>
> >> I set the IndexWriter..setTermIndexInterval to 8 times the normal size
> of 128 (an intervalof 1024) which delayed the onset of the issue but still
> failed.
> >
> > I think that setting won't change how much RAM is used when writing.
> >
> >> I'd like to get a little more scientific about what to set here rather
> than simply experimenting with settings and hoping it doesn't fail again.
> >>
> >> Does anyone have a decent model worked out for how much memory is
> consumed at peak? I'm guessing the contributing factors are:
> >>
> >> * Numbers of fields
> >> * Numbers of unique terms per field
> >> * Numbers of segments?
> >
> > Number of net unique terms (across all fields) is a big driver, but also
> net number of term occurrences, and how many docs.  Lots of tiny docs take
> more RAM than fewer large docs, when # occurrences are equal.
> >
> > But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW
> should simply flush when it's used that much RAM.
> >
> > I don't think number of segments is a factor.
> >
> > Though mergeFactor is, since during merging the SegmentMerger holds
> SegmentReaders open, and int[] maps (if there are any deletes) for each
> segment.  Do you have a large merge taking place when you hit the OOMs?
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: A model for predicting indexing memory costs?

Posted by mark harwood <ma...@yahoo.co.uk>.

>>Could you get a heap dump (eg with YourKit) of what's using up all the memory when you hit OOM?

On this particular machine I have a JRE, no admin rights and therefore limited profiling capability :(
That's why I was trying to come up with some formula for estimating memory usage.

>>When you say "write session", are you closing & opening a new IndexWriter each time? 

Yes, commit then close.

>>It seems likely this has something to do with merging,

Presumably (norms + deleted arrays discounted) RAM usage for merges is not proportional to number of terms or docs? I imagine the structures being merged are streamed rather than loaded in whole or as a fixed percentage of the whole.



----- Original Message ----
From: Michael McCandless <lu...@mikemccandless.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 10 March, 2009 14:23:59
Subject: Re: A model for predicting indexing memory costs?


Mark,

Could you get a heap dump (eg with YourKit) of what's using up all the memory when you hit OOM?

Also, can you turn on infoStream and post the output leading up to the OOM?

When you say "write session", are you closing & opening a new IndexWriter each time?  Or, just calling .commit() and then re-using the same writer?

It seems likely this has something to do with merging, though from your listing I count 14 segments which shouldn't have been doing any merging at mergeFactor=20, so that's confusing.

Mike

mark harwood wrote:

> 
>>> But... how come setting IW's RAM buffer doesn't prevent the OOMs?
> 
> I've been setting the IndexWriter RAM buffer to 300 meg and giving the JVM 1gig.
> 
> Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300 meg, merge factor=20, term interval=8192, usecompound=false. All fields are ANALYZED_NO_NORMS.
> Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
> 
> This graphic shows timings for 100 consecutive write sessions, each adding 30,000 documents, committing and then closing :
>    http://tinyurl.com/anzcjw
> You can see the periodic merge costs and then a big spike towards the end before it crashed.
> 
> The crash details are here after adding ~3 million documents in 98 write sessions:
> 
> This batch index session added 3000 of 30000 docs : 10% complete
> Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC overhead limit exceeded
>    at java.util..Arrays.copyOf(Unknown Source)
>    at java.lang.String..<init>(Unknown Source)
>    at org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:148)
>    at org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
>    at test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
>    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159)
>    at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
>    at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
>    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:762)
>    at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter..java:740)
>    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
>    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC overhead limit exceeded
>    at org.apache.commons.csv.CharBuffer.toString(CharBuffer.java:177)
>    at org.apache.commons.csv.CSVParser.getLine(CSVParser.java:242)
>    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:314)
> Committing
> Closing
> Exception in thread "main" java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit
>    at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
>    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3660)
>    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3634)
>    at test.IndexMarksFile.run(IndexMarksFile.java:176)
>    at test.IndexMarksFile.main(IndexMarksFile.java:101)
>    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
> 
> 
> For each write session I have a single writer, and 2 indexing threads adding documents through this writer. There are no updates/deletes - only adds. When both indexing threads complete the primary thread commits and closes the writer.
> I then open a searcher run some search benchmarks, close the searcher and start another write session.
> The documents have ~12 fields and are all the same size so I don't think this OOM is down to rogue data. Each field has 100 near-unique tokens.
> 
> The files on disk after the crash are as follows:
> 1930004059 Mar  9 13:32 _106.fdt
>    2731084 Mar  9 13:32 _106.fdx
>        175 Mar  9 13:30 _106.fnm
> 1190042394 Mar  9 13:39 _106.frq
>  814748995 Mar  9 13:39 _106.prx
>   16512596 Mar  9 13:39 _106.tii
> 1151364311 Mar  9 13:39 _106.tis
> 1949444533 Mar  9 14:53 _139.fdt
>    2758580 Mar  9 14:53 _139.fdx
>        175 Mar  9 14:51 _139.fnm
> 1202044423 Mar  9 15:00 _139.frq
>  822954002 Mar  9 15:00 _139.prx
>   16629104 Mar  9 15:00 _139.tii
> 1159392207 Mar  9 15:00 _139.tis
> 1930102055 Mar  9 16:15 _16c.fdt
>    2731084 Mar  9 16:15 _16c.fdx
>        175 Mar  9 16:13 _16c.fnm
> 1190090014 Mar  9 16:22 _16c.frq
>  814763781 Mar  9 16:22 _16c.prx
>   16514967 Mar  9 16:22 _16c.tii
> 1151524173 Mar  9 16:22 _16c.tis
> 1928053697 Mar  9 17:52 _19e.fdt
>    2728260 Mar  9 17:52 _19e.fdx
>        175 Mar  9 17:46 _19e.fnm
> 1188837093 Mar  9 18:08 _19e.frq
>  813915820 Mar  9 18:08 _19e.prx
>   16501902 Mar  9 18:08 _19e.tii
> 1150623773 Mar  9 18:08 _19e.tis
> 1951474247 Mar  9 20:22 _1cj.fdt
>    2761396 Mar  9 20:22 _1cj.fdx
>        175 Mar  9 20:18 _1cj.fnm
> 1203285781 Mar  9 20:39 _1cj.frq
>  823797656 Mar  9 20:39 _1cj.prx
>   16639997 Mar  9 20:39 _1cj.tii
> 1160143978 Mar  9 20:39 _1cj.tis
> 1929978366 Mar 10 01:02 _1fm.fdt
>    2731060 Mar 10 01:02 _1fm.fdx
>        175 Mar 10 00:43 _1fm.fnm
> 1190031780 Mar 10 02:36 _1fm.frq
>  814741146 Mar 10 02:36 _1fm.prx
>   16513189 Mar 10 02:36 _1fm.tii
> 1151399139 Mar 10 02:36 _1fm.tis
>  189073186 Mar 10 01:51 _1ft.fdt
>     267556 Mar 10 01:51 _1ft.fdx
>        175 Mar 10 01:50 _1ft.fnm
>  110750150 Mar 10 02:04 _1ft.frq
>   79818488 Mar 10 02:04 _1ft.prx
>    2326691 Mar 10 02:04 _1ft.tii
>  165932844 Mar 10 02:04 _1ft.tis
>  212500024 Mar 10 03:16 _1g5..fdt
>     300684 Mar 10 03:16 _1g5.fdx
>        175 Mar 10 03:16 _1g5.fnm
>  125179984 Mar 10 03:28 _1g5.frq
>   89703062 Mar 10 03:28 _1g5.prx
>    2594360 Mar 10 03:28 _1g5.tii
>  184495760 Mar 10 03:28 _1g5.tis
>   64323505 Mar 10 04:09 _1gc.fdt
>      91020 Mar 10 04:09 _1gc.fdx
>  105283820 Mar 10 04:48 _1gf.fdt
>     148988 Mar 10 04:48 _1gf.fdx
>        175 Mar 10 04:09 _1gf.fnm
>       1491 Mar 10 04:09 _1gf.frq
>          4 Mar 10 04:09 _1gf.nrm
>       2388 Mar 10 04:09 _1gf.prx
>        254 Mar 10 04:09 _1gf.tii
>      15761 Mar 10 04:09 _1gf.tis
>  191035191 Mar 10 04:09 _1gg.fdt
>     270332 Mar 10 04:09 _1gg.fdx
>        175 Mar 10 04:09 _1gg.fnm
>  111958741 Mar 10 04:24 _1gg.frq
>   80645411 Mar 10 04:24 _1gg.prx
>    2349153 Mar 10 04:24 _1gg.tii
>  167494232 Mar 10 04:24 _1gg.tis
>        175 Mar 10 04:20 _1gh.fnm
>   10223275 Mar 10 04:20 _1gh.frq
>          4 Mar 10 04:20 _1gh.nrm
>    9056546 Mar 10 04:20 _1gh.prx
>     329012 Mar 10 04:20 _1gh.tii
>   23846511 Mar 10 04:20 _1gh.tis
>        175 Mar 10 04:28 _1gi.fnm
>   10221888 Mar 10 04:28 _1gi.frq
>          4 Mar 10 04:28 _1gi.nrm
>    9054280 Mar 10 04:28 _1gi.prx
>     328980 Mar 10 04:28 _1gi.tii
>   23843209 Mar 10 04:28 _1gi.tis
>        175 Mar 10 04:35 _1gj.fnm
>   10222776 Mar 10 04:35 _1gj.frq
>          4 Mar 10 04:35 _1gj.nrm
>    9054943 Mar 10 04:35 _1gj.prx
>     329060 Mar 10 04:35 _1gj.tii
>   23849395 Mar 10 04:35 _1gj.tis
>        175 Mar 10 04:42 _1gk.fnm
>   10220381 Mar 10 04:42 _1gk.frq
>          4 Mar 10 04:42 _1gk.nrm
>    9052810 Mar 10 04:42 _1gk.prx
>     329029 Mar 10 04:42 _1gk.tii
>   23845373 Mar 10 04:42 _1gk.tis
>        175 Mar 10 04:48 _1gl.fnm
>    9274170 Mar 10 04:48 _1gl.frq
>          4 Mar 10 04:48 _1gl.nrm
>    8226681 Mar 10 04:48 _1gl.prx
>     303327 Mar 10 04:48 _1gl.tii
>   21996826 Mar 10 04:48 _1gl.tis
>   22418126 Mar 10 04:58 _1gm.fdt
>      31732 Mar 10 04:58 _1gm.fdx
>        175 Mar 10 04:57 _1gm.fnm
>   10216672 Mar 10 04:57 _1gm.frq
>          4 Mar 10 04:57 _1gm.nrm
>    9049487 Mar 10 04:57 _1gm.prx
>     328813 Mar 10 04:57 _1gm.tii
>   23829627 Mar 10 04:57 _1gm.tis
>        175 Mar 10 04:58 _1gn.fnm
>     392014 Mar 10 04:58 _1gn.frq
>          4 Mar 10 04:58 _1gn.nrm
>     415225 Mar 10 04:58 _1gn.prx
>      24695 Mar 10 04:58 _1gn.tii
>    1816750 Mar 10 04:58 _1gn.tis
>        683 Mar 10 04:58 segments_7t
>         20 Mar 10 04:58 segments.gen
> 1935727800 Mar  9 11:17 _u1..fdt
>    2739180 Mar  9 11:17 _u1.fdx
>        175 Mar  9 11:15 _u1.fnm
> 1193583522 Mar  9 11:25 _u1.frq
>  817164507 Mar  9 11:25 _u1.prx
>   16547464 Mar  9 11:25 _u1..tii
> 1153764013 Mar  9 11:25 _u1.tis
> 1949493315 Mar  9 12:21 _x3.fdt
>    2758580 Mar  9 12:21 _x3.fdx
>        175 Mar  9 12:18 _x3.fnm
> 1202068425 Mar  9 12:29 _x3.frq
>  822963200 Mar  9 12:29 _x3.prx
>   16629485 Mar  9 12:29 _x3.tii
> 1159419149 Mar  9 12:29 _x3.tis
> 
> 
> Any ideas? I'm out of settings to tweak here.
> 
> Cheers,
> Mark
> 
> 
> 
> 
> ----- Original Message ----
> From: Michael McCandless <lu...@mikemccandless.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 0:01:30
> Subject: Re: A model for predicting indexing memory costs?
> 
> 
> mark harwood wrote:
> 
>> 
>> I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values.
>> I've been hitting out of memory issues when doing periodic commits/closes which I suspect is down to the sheer number of terms.
>> 
>> I set the IndexWriter..setTermIndexInterval to 8 times the normal size of 128 (an intervalof 1024) which delayed the onset of the issue but still failed.
> 
> I think that setting won't change how much RAM is used when writing.
> 
>> I'd like to get a little more scientific about what to set here rather than simply experimenting with settings and hoping it doesn't fail again.
>> 
>> Does anyone have a decent model worked out for how much memory is consumed at peak? I'm guessing the contributing factors are:
>> 
>> * Numbers of fields
>> * Numbers of unique terms per field
>> * Numbers of segments?
> 
> Number of net unique terms (across all fields) is a big driver, but also net number of term occurrences, and how many docs.  Lots of tiny docs take more RAM than fewer large docs, when # occurrences are equal.
> 
> But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW should simply flush when it's used that much RAM.
> 
> I don't think number of segments is a factor.
> 
> Though mergeFactor is, since during merging the SegmentMerger holds SegmentReaders open, and int[] maps (if there are any deletes) for each segment.  Do you have a large merge taking place when you hit the OOMs?
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Mark,

Could you get a heap dump (eg with YourKit) of what's using up all the  
memory when you hit OOM?

Also, can you turn on infoStream and post the output leading up to the  
OOM?

When you say "write session", are you closing & opening a new  
IndexWriter each time?  Or, just calling .commit() and then re-using  
the same writer?

It seems likely this has something to do with merging, though from  
your listing I count 14 segments which shouldn't have been doing any  
merging at mergeFactor=20, so that's confusing.

Mike

mark harwood wrote:

>
>>> But... how come setting IW's RAM buffer doesn't prevent the OOMs?
>
> I've been setting the IndexWriter RAM buffer to 300 meg and giving  
> the JVM 1gig.
>
> Last run I gave the JVM 3 gig, with writer settings of  RAM  
> buffer=300 meg, merge factor=20, term interval=8192,  
> usecompound=false. All fields are ANALYZED_NO_NORMS.
> Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
>
> This graphic shows timings for 100 consecutive write sessions, each  
> adding 30,000 documents, committing and then closing :
>     http://tinyurl.com/anzcjw
> You can see the periodic merge costs and then a big spike towards  
> the end before it crashed.
>
> The crash details are here after adding ~3 million documents in 98  
> write sessions:
>
> This batch index session added 3000 of 30000 docs : 10% complete
> Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC  
> overhead limit exceeded
>    at java.util.Arrays.copyOf(Unknown Source)
>    at java.lang.String..<init>(Unknown Source)
>    at  
> org 
> .apache 
> .lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:148)
>    at  
> org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java: 
> 302)
>    at test.LongTrieAnalyzer 
> $LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
>    at  
> org 
> .apache 
> .lucene 
> .index.DocInverterPerField.processFields(DocInverterPerField.java:159)
>    at  
> org 
> .apache 
> .lucene 
> .index 
> .DocFieldConsumersPerField 
> .processFields(DocFieldConsumersPerField.java:36)
>    at  
> org 
> .apache 
> .lucene 
> .index 
> .DocFieldProcessorPerThread 
> .processDocument(DocFieldProcessorPerThread.java:234)
>    at  
> org 
> .apache 
> .lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:762)
>    at  
> org 
> .apache 
> .lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:740)
>    at  
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
>    at  
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC  
> overhead limit exceeded
>    at org.apache.commons.csv.CharBuffer.toString(CharBuffer.java:177)
>    at org.apache.commons.csv.CSVParser.getLine(CSVParser.java:242)
>    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:314)
> Committing
> Closing
> Exception in thread "main" java.lang.IllegalStateException: this  
> writer hit an OutOfMemoryError; cannot commit
>    at  
> org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java: 
> 3569)
>    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java: 
> 3660)
>    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java: 
> 3634)
>    at test.IndexMarksFile.run(IndexMarksFile.java:176)
>    at test.IndexMarksFile.main(IndexMarksFile.java:101)
>    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
>
>
> For each write session I have a single writer, and 2 indexing  
> threads adding documents through this writer. There are no updates/ 
> deletes - only adds. When both indexing threads complete the primary  
> thread commits and closes the writer.
> I then open a searcher run some search benchmarks, close the  
> searcher and start another write session.
> The documents have ~12 fields and are all the same size so I don't  
> think this OOM is down to rogue data. Each field has 100 near-unique  
> tokens.
>
> The files on disk after the crash are as follows:
> 1930004059 Mar  9 13:32 _106.fdt
>    2731084 Mar  9 13:32 _106.fdx
>        175 Mar  9 13:30 _106.fnm
> 1190042394 Mar  9 13:39 _106.frq
>  814748995 Mar  9 13:39 _106.prx
>   16512596 Mar  9 13:39 _106.tii
> 1151364311 Mar  9 13:39 _106.tis
> 1949444533 Mar  9 14:53 _139.fdt
>    2758580 Mar  9 14:53 _139.fdx
>        175 Mar  9 14:51 _139.fnm
> 1202044423 Mar  9 15:00 _139.frq
>  822954002 Mar  9 15:00 _139.prx
>   16629104 Mar  9 15:00 _139.tii
> 1159392207 Mar  9 15:00 _139.tis
> 1930102055 Mar  9 16:15 _16c.fdt
>    2731084 Mar  9 16:15 _16c.fdx
>        175 Mar  9 16:13 _16c.fnm
> 1190090014 Mar  9 16:22 _16c.frq
>  814763781 Mar  9 16:22 _16c.prx
>   16514967 Mar  9 16:22 _16c.tii
> 1151524173 Mar  9 16:22 _16c.tis
> 1928053697 Mar  9 17:52 _19e.fdt
>    2728260 Mar  9 17:52 _19e.fdx
>        175 Mar  9 17:46 _19e.fnm
> 1188837093 Mar  9 18:08 _19e.frq
>  813915820 Mar  9 18:08 _19e.prx
>   16501902 Mar  9 18:08 _19e.tii
> 1150623773 Mar  9 18:08 _19e.tis
> 1951474247 Mar  9 20:22 _1cj.fdt
>    2761396 Mar  9 20:22 _1cj.fdx
>        175 Mar  9 20:18 _1cj.fnm
> 1203285781 Mar  9 20:39 _1cj.frq
>  823797656 Mar  9 20:39 _1cj.prx
>   16639997 Mar  9 20:39 _1cj.tii
> 1160143978 Mar  9 20:39 _1cj.tis
> 1929978366 Mar 10 01:02 _1fm.fdt
>    2731060 Mar 10 01:02 _1fm.fdx
>        175 Mar 10 00:43 _1fm.fnm
> 1190031780 Mar 10 02:36 _1fm.frq
>  814741146 Mar 10 02:36 _1fm.prx
>   16513189 Mar 10 02:36 _1fm.tii
> 1151399139 Mar 10 02:36 _1fm.tis
>  189073186 Mar 10 01:51 _1ft.fdt
>     267556 Mar 10 01:51 _1ft.fdx
>        175 Mar 10 01:50 _1ft.fnm
>  110750150 Mar 10 02:04 _1ft.frq
>   79818488 Mar 10 02:04 _1ft.prx
>    2326691 Mar 10 02:04 _1ft.tii
>  165932844 Mar 10 02:04 _1ft.tis
>  212500024 Mar 10 03:16 _1g5.fdt
>     300684 Mar 10 03:16 _1g5.fdx
>        175 Mar 10 03:16 _1g5.fnm
>  125179984 Mar 10 03:28 _1g5.frq
>   89703062 Mar 10 03:28 _1g5.prx
>    2594360 Mar 10 03:28 _1g5.tii
>  184495760 Mar 10 03:28 _1g5.tis
>   64323505 Mar 10 04:09 _1gc.fdt
>      91020 Mar 10 04:09 _1gc.fdx
>  105283820 Mar 10 04:48 _1gf.fdt
>     148988 Mar 10 04:48 _1gf.fdx
>        175 Mar 10 04:09 _1gf.fnm
>       1491 Mar 10 04:09 _1gf.frq
>          4 Mar 10 04:09 _1gf.nrm
>       2388 Mar 10 04:09 _1gf.prx
>        254 Mar 10 04:09 _1gf.tii
>      15761 Mar 10 04:09 _1gf.tis
>  191035191 Mar 10 04:09 _1gg.fdt
>     270332 Mar 10 04:09 _1gg.fdx
>        175 Mar 10 04:09 _1gg.fnm
>  111958741 Mar 10 04:24 _1gg.frq
>   80645411 Mar 10 04:24 _1gg.prx
>    2349153 Mar 10 04:24 _1gg.tii
>  167494232 Mar 10 04:24 _1gg.tis
>        175 Mar 10 04:20 _1gh.fnm
>   10223275 Mar 10 04:20 _1gh.frq
>          4 Mar 10 04:20 _1gh.nrm
>    9056546 Mar 10 04:20 _1gh.prx
>     329012 Mar 10 04:20 _1gh.tii
>   23846511 Mar 10 04:20 _1gh.tis
>        175 Mar 10 04:28 _1gi.fnm
>   10221888 Mar 10 04:28 _1gi.frq
>          4 Mar 10 04:28 _1gi.nrm
>    9054280 Mar 10 04:28 _1gi.prx
>     328980 Mar 10 04:28 _1gi.tii
>   23843209 Mar 10 04:28 _1gi.tis
>        175 Mar 10 04:35 _1gj.fnm
>   10222776 Mar 10 04:35 _1gj.frq
>          4 Mar 10 04:35 _1gj.nrm
>    9054943 Mar 10 04:35 _1gj.prx
>     329060 Mar 10 04:35 _1gj.tii
>   23849395 Mar 10 04:35 _1gj.tis
>        175 Mar 10 04:42 _1gk.fnm
>   10220381 Mar 10 04:42 _1gk.frq
>          4 Mar 10 04:42 _1gk.nrm
>    9052810 Mar 10 04:42 _1gk.prx
>     329029 Mar 10 04:42 _1gk.tii
>   23845373 Mar 10 04:42 _1gk.tis
>        175 Mar 10 04:48 _1gl.fnm
>    9274170 Mar 10 04:48 _1gl.frq
>          4 Mar 10 04:48 _1gl.nrm
>    8226681 Mar 10 04:48 _1gl.prx
>     303327 Mar 10 04:48 _1gl.tii
>   21996826 Mar 10 04:48 _1gl.tis
>   22418126 Mar 10 04:58 _1gm.fdt
>      31732 Mar 10 04:58 _1gm.fdx
>        175 Mar 10 04:57 _1gm.fnm
>   10216672 Mar 10 04:57 _1gm.frq
>          4 Mar 10 04:57 _1gm.nrm
>    9049487 Mar 10 04:57 _1gm.prx
>     328813 Mar 10 04:57 _1gm.tii
>   23829627 Mar 10 04:57 _1gm.tis
>        175 Mar 10 04:58 _1gn.fnm
>     392014 Mar 10 04:58 _1gn.frq
>          4 Mar 10 04:58 _1gn.nrm
>     415225 Mar 10 04:58 _1gn.prx
>      24695 Mar 10 04:58 _1gn.tii
>    1816750 Mar 10 04:58 _1gn.tis
>        683 Mar 10 04:58 segments_7t
>         20 Mar 10 04:58 segments.gen
> 1935727800 Mar  9 11:17 _u1.fdt
>    2739180 Mar  9 11:17 _u1.fdx
>        175 Mar  9 11:15 _u1.fnm
> 1193583522 Mar  9 11:25 _u1.frq
>  817164507 Mar  9 11:25 _u1.prx
>   16547464 Mar  9 11:25 _u1..tii
> 1153764013 Mar  9 11:25 _u1.tis
> 1949493315 Mar  9 12:21 _x3.fdt
>    2758580 Mar  9 12:21 _x3.fdx
>        175 Mar  9 12:18 _x3.fnm
> 1202068425 Mar  9 12:29 _x3.frq
>  822963200 Mar  9 12:29 _x3.prx
>   16629485 Mar  9 12:29 _x3.tii
> 1159419149 Mar  9 12:29 _x3.tis
>
>
> Any ideas? I'm out of settings to tweak here.
>
> Cheers,
> Mark
>
>
>
>
> ----- Original Message ----
> From: Michael McCandless <lu...@mikemccandless.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 0:01:30
> Subject: Re: A model for predicting indexing memory costs?
>
>
> mark harwood wrote:
>
>>
>> I've been building a large index (hundreds of millions) with mainly  
>> structured data which consists of several fields with mostly unique  
>> values.
>> I've been hitting out of memory issues when doing periodic commits/ 
>> closes which I suspect is down to the sheer number of terms.
>>
>> I set the IndexWriter..setTermIndexInterval to 8 times the normal  
>> size of 128 (an intervalof 1024) which delayed the onset of the  
>> issue but still failed.
>
> I think that setting won't change how much RAM is used when writing.
>
>> I'd like to get a little more scientific about what to set here  
>> rather than simply experimenting with settings and hoping it  
>> doesn't fail again.
>>
>> Does anyone have a decent model worked out for how much memory is  
>> consumed at peak? I'm guessing the contributing factors are:
>>
>> * Numbers of fields
>> * Numbers of unique terms per field
>> * Numbers of segments?
>
> Number of net unique terms (across all fields) is a big driver, but  
> also net number of term occurrences, and how many docs.  Lots of  
> tiny docs take more RAM than fewer large docs, when # occurrences  
> are equal.
>
> But... how come setting IW's RAM buffer doesn't prevent the OOMs?   
> IW should simply flush when it's used that much RAM.
>
> I don't think number of segments is a factor.
>
> Though mergeFactor is, since during merging the SegmentMerger holds  
> SegmentReaders open, and int[] maps (if there are any deletes) for  
> each segment.  Do you have a large merge taking place when you hit  
> the OOMs?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: A model for predicting indexing memory costs?

Posted by Jon Loken <jo...@bipsolutions.com>.

Hi,

I haven't followed the whole thread, so pardon me if I am off topic.

In terms of OutOfMemoryExceptions, why not attempt to alleviate this in your code, rather than overly relying on garbage collection. On other words: set big objects to null when you are finished with them, in particular in loops. (like in C++)

What we do in the indexing loop is to set myDocument = null; etc. In addition, every n loops we also do a garbage collection. Otherwise we would also have faced memory issues.

Jon

-----Original Message-----
From: Ian Lea [mailto:ian.lea@gmail.com]
Sent: 10 March 2009 10:54
To: java-user@lucene.apache.org
Subject: Re: A model for predicting indexing memory costs?

That's not the usual OOM message is it? java.lang.OutOfMemoryError: GC overhead limit exceeded.

Looks like you might be able to work round it with -XX:-UseGCOverheadLimit

http://java-monitor.com/forum/archive/index.php/t-54.html
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom


--
Ian.


On Tue, Mar 10, 2009 at 10:45 AM, mark harwood <ma...@yahoo.co.uk> wrote:
>
>>>But... how come setting IW's RAM buffer doesn't prevent the OOMs?
>
> I've been setting the IndexWriter RAM buffer to 300 meg and giving the JVM 1gig.
>
> Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300 meg, merge factor=20, term interval=8192, usecompound=false. All fields are ANALYZED_NO_NORMS.
> Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
>
> This graphic shows timings for 100 consecutive write sessions, each adding 30,000 documents, committing and then closing :
>     http://tinyurl.com/anzcjw
> You can see the periodic merge costs and then a big spike towards the end before it crashed.
>
> The crash details are here after adding ~3 million documents in 98 write sessions:
>
> This batch index session added 3000 of 30000 docs : 10% complete
> Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC
> overhead limit exceeded
>    at java.util.Arrays.copyOf(Unknown Source)
>    at java.lang.String..<init>(Unknown Source)
>    at
> org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.ja
> va:148)
>    at
> org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:30
> 2)
>    at
> test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:4
> 9)
>    at
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterP
> erField.java:159)
>    at
> org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFie
> ldConsumersPerField.java:36)
>    at
> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(Doc
> FieldProcessorPerThread.java:234)
>    at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter
> .java:762)
>    at
> org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.ja
> va:740)
>    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
>    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC
> overhead limit exceeded
>    at org.apache.commons.csv.CharBuffer.toString(CharBuffer.java:177)
>    at org.apache.commons.csv.CSVParser.getLine(CSVParser.java:242)
>    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:314)
> Committing
> Closing
> Exception in thread "main" java.lang.IllegalStateException: this
> writer hit an OutOfMemoryError; cannot commit
>    at
> org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:356
> 9)
>    at
> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3660)
>    at
> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3634)
>    at test.IndexMarksFile.run(IndexMarksFile.java:176)
>    at test.IndexMarksFile.main(IndexMarksFile.java:101)
>    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
>
>
> For each write session I have a single writer, and 2 indexing threads adding documents through this writer. There are no updates/deletes - only adds. When both indexing threads complete the primary thread commits and closes the writer.
> I then open a searcher run some search benchmarks, close the searcher and start another write session.
> The documents have ~12 fields and are all the same size so I don't think this OOM is down to rogue data. Each field has 100 near-unique tokens.
>
> The files on disk after the crash are as follows:
>  1930004059 Mar  9 13:32 _106.fdt
>    2731084 Mar  9 13:32 _106.fdx
>        175 Mar  9 13:30 _106.fnm
>  1190042394 Mar  9 13:39 _106.frq
>  814748995 Mar  9 13:39 _106.prx
>   16512596 Mar  9 13:39 _106.tii
>  1151364311 Mar  9 13:39 _106.tis
>  1949444533 Mar  9 14:53 _139.fdt
>    2758580 Mar  9 14:53 _139.fdx
>        175 Mar  9 14:51 _139.fnm
>  1202044423 Mar  9 15:00 _139.frq
>  822954002 Mar  9 15:00 _139.prx
>   16629104 Mar  9 15:00 _139.tii
>  1159392207 Mar  9 15:00 _139.tis
>  1930102055 Mar  9 16:15 _16c.fdt
>    2731084 Mar  9 16:15 _16c.fdx
>        175 Mar  9 16:13 _16c.fnm
>  1190090014 Mar  9 16:22 _16c.frq
>  814763781 Mar  9 16:22 _16c.prx
>   16514967 Mar  9 16:22 _16c.tii
>  1151524173 Mar  9 16:22 _16c.tis
>  1928053697 Mar  9 17:52 _19e.fdt
>    2728260 Mar  9 17:52 _19e.fdx
>        175 Mar  9 17:46 _19e.fnm
>  1188837093 Mar  9 18:08 _19e.frq
>  813915820 Mar  9 18:08 _19e.prx
>   16501902 Mar  9 18:08 _19e.tii
>  1150623773 Mar  9 18:08 _19e.tis
>  1951474247 Mar  9 20:22 _1cj.fdt
>    2761396 Mar  9 20:22 _1cj.fdx
>        175 Mar  9 20:18 _1cj.fnm
>  1203285781 Mar  9 20:39 _1cj.frq
>  823797656 Mar  9 20:39 _1cj.prx
>   16639997 Mar  9 20:39 _1cj.tii
>  1160143978 Mar  9 20:39 _1cj.tis
>  1929978366 Mar 10 01:02 _1fm.fdt
>    2731060 Mar 10 01:02 _1fm.fdx
>        175 Mar 10 00:43 _1fm.fnm
>  1190031780 Mar 10 02:36 _1fm.frq
>  814741146 Mar 10 02:36 _1fm.prx
>   16513189 Mar 10 02:36 _1fm.tii
>  1151399139 Mar 10 02:36 _1fm.tis
>  189073186 Mar 10 01:51 _1ft.fdt
>     267556 Mar 10 01:51 _1ft.fdx
>        175 Mar 10 01:50 _1ft.fnm
>  110750150 Mar 10 02:04 _1ft.frq
>   79818488 Mar 10 02:04 _1ft.prx
>    2326691 Mar 10 02:04 _1ft.tii
>  165932844 Mar 10 02:04 _1ft.tis
>  212500024 Mar 10 03:16 _1g5.fdt
>     300684 Mar 10 03:16 _1g5.fdx
>        175 Mar 10 03:16 _1g5.fnm
>  125179984 Mar 10 03:28 _1g5.frq
>   89703062 Mar 10 03:28 _1g5.prx
>    2594360 Mar 10 03:28 _1g5.tii
>  184495760 Mar 10 03:28 _1g5.tis
>   64323505 Mar 10 04:09 _1gc.fdt
>      91020 Mar 10 04:09 _1gc.fdx
>  105283820 Mar 10 04:48 _1gf.fdt
>     148988 Mar 10 04:48 _1gf.fdx
>        175 Mar 10 04:09 _1gf.fnm
>       1491 Mar 10 04:09 _1gf.frq
>          4 Mar 10 04:09 _1gf.nrm
>       2388 Mar 10 04:09 _1gf.prx
>        254 Mar 10 04:09 _1gf.tii
>      15761 Mar 10 04:09 _1gf.tis
>  191035191 Mar 10 04:09 _1gg.fdt
>     270332 Mar 10 04:09 _1gg.fdx
>        175 Mar 10 04:09 _1gg.fnm
>  111958741 Mar 10 04:24 _1gg.frq
>   80645411 Mar 10 04:24 _1gg.prx
>    2349153 Mar 10 04:24 _1gg.tii
>  167494232 Mar 10 04:24 _1gg.tis
>        175 Mar 10 04:20 _1gh.fnm
>   10223275 Mar 10 04:20 _1gh.frq
>          4 Mar 10 04:20 _1gh.nrm
>    9056546 Mar 10 04:20 _1gh.prx
>     329012 Mar 10 04:20 _1gh.tii
>   23846511 Mar 10 04:20 _1gh.tis
>        175 Mar 10 04:28 _1gi.fnm
>   10221888 Mar 10 04:28 _1gi.frq
>          4 Mar 10 04:28 _1gi.nrm
>    9054280 Mar 10 04:28 _1gi.prx
>     328980 Mar 10 04:28 _1gi.tii
>   23843209 Mar 10 04:28 _1gi.tis
>        175 Mar 10 04:35 _1gj.fnm
>   10222776 Mar 10 04:35 _1gj.frq
>          4 Mar 10 04:35 _1gj.nrm
>    9054943 Mar 10 04:35 _1gj.prx
>     329060 Mar 10 04:35 _1gj.tii
>   23849395 Mar 10 04:35 _1gj.tis
>        175 Mar 10 04:42 _1gk.fnm
>   10220381 Mar 10 04:42 _1gk.frq
>          4 Mar 10 04:42 _1gk.nrm
>    9052810 Mar 10 04:42 _1gk.prx
>     329029 Mar 10 04:42 _1gk.tii
>   23845373 Mar 10 04:42 _1gk.tis
>        175 Mar 10 04:48 _1gl.fnm
>    9274170 Mar 10 04:48 _1gl.frq
>          4 Mar 10 04:48 _1gl.nrm
>    8226681 Mar 10 04:48 _1gl.prx
>     303327 Mar 10 04:48 _1gl.tii
>   21996826 Mar 10 04:48 _1gl.tis
>   22418126 Mar 10 04:58 _1gm.fdt
>      31732 Mar 10 04:58 _1gm.fdx
>        175 Mar 10 04:57 _1gm.fnm
>   10216672 Mar 10 04:57 _1gm.frq
>          4 Mar 10 04:57 _1gm.nrm
>    9049487 Mar 10 04:57 _1gm.prx
>     328813 Mar 10 04:57 _1gm.tii
>   23829627 Mar 10 04:57 _1gm.tis
>        175 Mar 10 04:58 _1gn.fnm
>     392014 Mar 10 04:58 _1gn.frq
>          4 Mar 10 04:58 _1gn.nrm
>     415225 Mar 10 04:58 _1gn.prx
>      24695 Mar 10 04:58 _1gn.tii
>    1816750 Mar 10 04:58 _1gn.tis
>        683 Mar 10 04:58 segments_7t
>         20 Mar 10 04:58 segments.gen
>  1935727800 Mar  9 11:17 _u1.fdt
>    2739180 Mar  9 11:17 _u1.fdx
>        175 Mar  9 11:15 _u1.fnm
>  1193583522 Mar  9 11:25 _u1.frq
>  817164507 Mar  9 11:25 _u1.prx
>   16547464 Mar  9 11:25 _u1..tii
>  1153764013 Mar  9 11:25 _u1.tis
>  1949493315 Mar  9 12:21 _x3.fdt
>    2758580 Mar  9 12:21 _x3.fdx
>        175 Mar  9 12:18 _x3.fnm
>  1202068425 Mar  9 12:29 _x3.frq
>  822963200 Mar  9 12:29 _x3.prx
>   16629485 Mar  9 12:29 _x3.tii
>  1159419149 Mar  9 12:29 _x3.tis
>
>
> Any ideas? I'm out of settings to tweak here.
>
> Cheers,
> Mark
>
>
>
>
> ----- Original Message ----
> From: Michael McCandless <lu...@mikemccandless.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 0:01:30
> Subject: Re: A model for predicting indexing memory costs?
>
>
> mark harwood wrote:
>
>>
>> I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values.
>> I've been hitting out of memory issues when doing periodic commits/closes which I suspect is down to the sheer number of terms.
>>
>> I set the IndexWriter..setTermIndexInterval to 8 times the normal size of 128 (an intervalof 1024) which delayed the onset of the issue but still failed.
>
> I think that setting won't change how much RAM is used when writing.
>
>> I'd like to get a little more scientific about what to set here rather than simply experimenting with settings and hoping it doesn't fail again.
>>
>> Does anyone have a decent model worked out for how much memory is consumed at peak? I'm guessing the contributing factors are:
>>
>> * Numbers of fields
>> * Numbers of unique terms per field
>> * Numbers of segments?
>
> Number of net unique terms (across all fields) is a big driver, but also net number of term occurrences, and how many docs.  Lots of tiny docs take more RAM than fewer large docs, when # occurrences are equal.
>
> But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW should simply flush when it's used that much RAM.
>
> I don't think number of segments is a factor.
>
> Though mergeFactor is, since during merging the SegmentMerger holds SegmentReaders open, and int[] maps (if there are any deletes) for each segment.  Do you have a large merge taking place when you hit the OOMs?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email

BiP Solutions Limited is a company registered in Scotland with Company
Number SC086146 and VAT number 383030966 and having its registered
office at Park House, 300 Glasgow Road, Shawfield, Glasgow, G73 1SQ.

****************************************************************************
This e-mail (and any attachment) is intended only for the attention of
the addressee(s). Its unauthorised use, disclosure, storage or copying
is not permitted. If you are not the intended recipient, please destroy
all copies and inform the sender by return e-mail.
This e-mail (whether you are the sender or the recipient) may be
monitored, recorded and retained by BiP Solutions Ltd.
E-mail monitoring/ blocking software may be used, and e-mail content may
be read at any time.You have a responsibility to ensure laws are not
broken when composing or forwarding e-mails and their contents.
****************************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 10, 2009, at 7:55 AM, mark harwood wrote:

>
>>> It does not indefinitely hang,
>
> I guess I just need to be more patient.
> Thanks for the GC settings. I don't currently have the luxury of "15  
> other" processors but this will definitely be of use in other  
> environments.

It is also, usually helpful to look at what is causing all the garbage  
in the first place.   I've seen those kind of major collection pauses  
(even on a multicore machine using parallel GC) induced when switching  
over IndexReaders and having to load FieldCaches (for sorting,  
etc.).    Simply put, the Eden storage and the "old" storage can get  
filled up faster than even the parallel collector can keep up with.  
Taking the time to analyze the needs for the memory almost always  
solves the problem versus spending copious amounts of time tweaking GC  
parameters that are under documented and also contain non-obvious  
interactions between parameters.  Plus, it undoubtedly plays to your  
strength as a search designer, not a JVM GC expert.

Mark M has a post on helpful tools at: http://www.lucidimagination.com/blog/2009/02/09/investigating-oom-and-other-jvm-issues/ 
  and I've found http://java.sun.com/docs/hotspot/gc5.0/ 
gc_tuning_5.html to be really useful at times.

Just my two cents,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by mark harwood <ma...@yahoo.co.uk>.

>>OK. What do you think about LUCENE-1541, does the more complicated APIrectify the space improvement and reduced term number?


I don't see the Trie terms being the main contributor to the term pool. Using the Luke vocabulary-growth plugin I can see the number of unique terms tailing off fairly rapidly.
It's mainly other fields.

As for Lucene-1541, I have used variable sized steps in some related technology and stored the settings in my Analyzer. 
All my analyzers are required to implement the XML BeanEncoder/Decoder serialisation mechanism to record any state and my query logic is therefore able to read the serialized analyzer settings for a particular index from disk.

I've found this to be a generally useful way of managing/retrieving index settings.


Cheers
Mark






----- Original Message ----
From: Uwe Schindler <uw...@thetaphi.de>
To: java-user@lucene.apache.org
Sent: Tuesday, 10 March, 2009 12:53:19
Subject: RE: A model for predicting indexing memory costs?

> >>It does not indefinitely hang,
> 
> I guess I just need to be more patient.
> Thanks for the GC settings. I don't currently have the luxury of "15
> other" processors but this will definitely be of use in other
> environments.

Even with one processor, a parallel GC is sometimes better. The traditional
GC does not really run completely decoupled in an own thread, so often
blocking the normal code flow in case of limited memory. The parallel GC
runs almost completely decoupled reducing the wait time for the actual
running java code. But if the memory usage for a large number of small
instances is too high, GC maybe to slow in freeing up memory, you can still
catch OOMs. Sometimes enabling verbose GC output helps solving such
problems.

> >>How works TrieRange for you?
> 
> I used it back when it was tucked away in PanFMP and so am very happy to
> see it making it's way into core.
> I'm still struggling with these OOM issues here so it's too early to
> comment on query performance for this particular app - I need to get the
> index built first.

OK. What do you think about LUCENE-1541, does the more complicated API
rectify the space improvement and reduced term number?

Cheers,
Uwe

> ----- Original Message ----
> From: Uwe Schindler <uw...@thetaphi.de>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 11:32:48
> Subject: RE: A model for predicting indexing memory costs?
> 
> It does not indefinitely hang, I think the problem is, that the GC takes
> up
> all processor resources and nothing else runs any more. You should also
> enable the parallel GC. We had similar problems on the searching side,
> when
> the webserver suddenly stopped for about 20 minutes (!) and doing nothing
> more than garbage collecting (64bit JVM, Java 1.5.0_17, Slolaris).
> Changing
> the web container's GC to parallel helped:
> 
> -Xms4096M -Xmx8192M -XX:MaxPermSize=512M -Xrs -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC -XX:+UseLargePages
> 
> Maybe add -XX:-UseGCOverheadLimit
> 
> One processor of this machine now always garbage collects :-), the other
> 15
> are serving searches...
> 
> Something other:
> From the stack trace I can see usage of TrieRange. The OOM happened there
> (when converting the char[] to a String). When we do not need two field
> names "field" and "field#trie" (because of sorting) (hope we can fix the
> sorting some time, see the corresponding JIRA issue), it would be better
> to
> index all trie values into one field. For that, a simplier API using a
> TrieTokenStream (like SOLR-940 uses for Trie, but because of that Solr is
> not able to sort at the moment) for indexing could be supplied. This API
> could directly use the buffers in the Token class when creating the trie
> encoded fields.
> 
> How works TrieRange for you? Are you happy, does searches work well with
> 30
> mio docs, which precisionStep do you use?
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> > -----Original Message-----
> > From: mark harwood [mailto:markharw00d@yahoo.co.uk]
> > Sent: Tuesday, March 10, 2009 12:07 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: A model for predicting indexing memory costs?
> >
> >
> > Thanks, Ian.
> >
> > I forgot to mention I tried that setting and it then seemed to hang
> > indefinitely.
> > I then switched back to a strategy of trying to minimise memory usage or
> > at least gain an understanding of how much memory would be required by
> my
> > application.
> >
> > Cheers
> > Mark
> >
> >
> >
> > ----- Original Message ----
> > From: Ian Lea <ia...@gmail..com>
> > To: java-user@lucene.apache.org
> > Sent: Tuesday, 10 March, 2009 10:54:05
> > Subject: Re: A model for predicting indexing memory costs?
> >
> > That's not the usual OOM message is it? java.lang.OutOfMemoryError: GC
> > overhead limit exceeded.
> >
> > Looks like you might be able to work round it with -XX:-
> UseGCOverheadLimit
> >
> > http://java-monitor.com/forum/archive/index.php/t-54.html
> >
> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc
> > .oom
> >
> >
> > --
> > Ian.
> >
> >
> > On Tue, Mar 10, 2009 at 10:45 AM, mark harwood <ma...@yahoo.co.uk>
> > wrote:
> > >
> > >>>But... how come setting IW's RAM buffer doesn't prevent the OOMs?
> > >
> > > I've been setting the IndexWriter RAM buffer to 300 meg and giving the
> > JVM 1gig.
> > >
> > > Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300
> > meg, merge factor=20, term interval=8192, usecompound=false. All fields
> > are ANALYZED_NO_NORMS.
> > > Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
> > >
> > > This graphic shows timings for 100 consecutive write sessions, each
> > adding 30,000 documents, committing and then closing :
> > >    http://tinyurl.com/anzcjw
> > > You can see the periodic merge costs and then a big spike towards the
> > end before it crashed.
> > >
> > > The crash details are here after adding ~3 million documents in 98
> write
> > sessions:
> > >
> > > This batch index session added 3000 of 30000 docs : 10% complete
> > > Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC
> overhead
> > limit exceeded
> > >    at java.util.Arrays.copyOf(Unknown Source)
> > >    at java.lang.String..<init>(Unknown Source)
> > >    at
> >
> org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:1
> > 48)
> > >    at
> > org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
> > >    at
> > test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
> > >    at
> >
> org.apache.lucene.index.DocInverterPerField..processFields(DocInverterPerFi
> > eld.java:159)
> > >    at
> >
> org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldCo
> > nsumersPerField.java:36)
> > >    at
> >
> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFiel
> > dProcessorPerThread.java:234)
> > >    at
> >
> org.apache..lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.ja
> > va:762)
> > >    at
> >
> org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:7
> > 40)
> > >    at
> > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
> > >    at
> > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
> > >    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> > > Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC
> overhead
> > limit exceeded
> > >    at org.apache.commons.csv.CharBuffer..toString(CharBuffer.java:177)
> > >    at org.apache.commons.csv.CSVParser..getLine(CSVParser..java:242)
> > >    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
> > >    at test.IndexMarksFile$IndexingThread..run(IndexMarksFile.java:314)
> > > Committing
> > > Closing
> > > Exception in thread "main" java.lang.IllegalStateException: this
> writer
> > hit an OutOfMemoryError; cannot commit
> > >    at
> > org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
> > >    at
> org.apache.lucene..index.IndexWriter.commit(IndexWriter.java:3660)
> > >    at
> org.apache.lucene.index.IndexWriter..commit(IndexWriter.java:3634)
> > >    at test.IndexMarksFile.run(IndexMarksFile.java:176)
> > >    at test..IndexMarksFile.main(IndexMarksFile.java:101)
> > >    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
> > >
> > >
> > > For each write session I have a single writer, and 2 indexing threads
> > adding documents through this writer. There are no updates/deletes -
> only
> > adds. When both indexing threads complete the primary thread commits and
> > closes the writer.
> > > I then open a searcher run some search benchmarks, close the searcher
> > and start another write session.
> > > The documents have ~12 fields and are all the same size so I don't
> think
> > this OOM is down to rogue data. Each field has 100 near-unique tokens.
> > >
> > > The files on disk after the crash are as follows:
> > >  1930004059 Mar  9 13:32 _106.fdt
> > >    2731084 Mar  9 13:32 _106.fdx
> > >        175 Mar  9 13:30 _106.fnm
> > >  1190042394 Mar  9 13:39 _106.frq
> > >  814748995 Mar  9 13:39 _106.prx
> > >   16512596 Mar  9 13:39 _106.tii
> > >  1151364311 Mar  9 13:39 _106.tis
> > >  1949444533 Mar  9 14:53 _139.fdt
> > >    2758580 Mar  9 14:53 _139.fdx
> > >        175 Mar  9 14:51 _139.fnm
> > >  1202044423 Mar  9 15:00 _139.frq
> > >  822954002 Mar  9 15:00 _139.prx
> > >   16629104 Mar  9 15:00 _139.tii
> > >  1159392207 Mar  9 15:00 _139.tis
> > >  1930102055 Mar  9 16:15 _16c.fdt
> > >    2731084 Mar  9 16:15 _16c.fdx
> > >        175 Mar  9 16:13 _16c.fnm
> > >  1190090014 Mar  9 16:22 _16c.frq
> > >  814763781 Mar  9 16:22 _16c.prx
> > >   16514967 Mar  9 16:22 _16c.tii
> > >  1151524173 Mar  9 16:22 _16c.tis
> > >  1928053697 Mar  9 17:52 _19e.fdt
> > >    2728260 Mar  9 17:52 _19e.fdx
> > >        175 Mar  9 17:46 _19e.fnm
> > >  1188837093 Mar  9 18:08 _19e.frq
> > >  813915820 Mar  9 18:08 _19e.prx
> > >   16501902 Mar  9 18:08 _19e.tii
> > >  1150623773 Mar  9 18:08 _19e.tis
> > >  1951474247 Mar  9 20:22 _1cj.fdt
> > >    2761396 Mar  9 20:22 _1cj.fdx
> > >        175 Mar  9 20:18 _1cj.fnm
> > >  1203285781 Mar  9 20:39 _1cj.frq
> > >  823797656 Mar  9 20:39 _1cj.prx
> > >   16639997 Mar  9 20:39 _1cj.tii
> > >  1160143978 Mar  9 20:39 _1cj.tis
> > >  1929978366 Mar 10 01:02 _1fm.fdt
> > >    2731060 Mar 10 01:02 _1fm.fdx
> > >        175 Mar 10 00:43 _1fm.fnm
> > >  1190031780 Mar 10 02:36 _1fm.frq
> > >  814741146 Mar 10 02:36 _1fm.prx
> > >   16513189 Mar 10 02:36 _1fm.tii
> > >  1151399139 Mar 10 02:36 _1fm.tis
> > >  189073186 Mar 10 01:51 _1ft.fdt
> > >     267556 Mar 10 01:51 _1ft.fdx
> > >        175 Mar 10 01:50 _1ft.fnm
> > >  110750150 Mar 10 02:04 _1ft.frq
> > >   79818488 Mar 10 02:04 _1ft.prx
> > >    2326691 Mar 10 02:04 _1ft.tii
> > >  165932844 Mar 10 02:04 _1ft.tis
> > >  212500024 Mar 10 03:16 _1g5.fdt
> > >     300684 Mar 10 03:16 _1g5.fdx
> > >        175 Mar 10 03:16 _1g5.fnm
> > >  125179984 Mar 10 03:28 _1g5.frq
> > >   89703062 Mar 10 03:28 _1g5.prx
> > >    2594360 Mar 10 03:28 _1g5.tii
> > >  184495760 Mar 10 03:28 _1g5.tis
> > >   64323505 Mar 10 04:09 _1gc.fdt
> > >      91020 Mar 10 04:09 _1gc.fdx
> > >  105283820 Mar 10 04:48 _1gf.fdt
> > >     148988 Mar 10 04:48 _1gf.fdx
> > >        175 Mar 10 04:09 _1gf.fnm
> > >       1491 Mar 10 04:09 _1gf..frq
> > >          4 Mar 10 04:09 _1gf.nrm
> > >       2388 Mar 10 04:09 _1gf.prx
> > >        254 Mar 10 04:09 _1gf.tii
> > >      15761 Mar 10 04:09 _1gf.tis
> > >  191035191 Mar 10 04:09 _1gg.fdt
> > >     270332 Mar 10 04:09 _1gg.fdx
> > >        175 Mar 10 04:09 _1gg.fnm
> > >  111958741 Mar 10 04:24 _1gg.frq
> > >   80645411 Mar 10 04:24 _1gg.prx
> > >    2349153 Mar 10 04:24 _1gg.tii
> > >  167494232 Mar 10 04:24 _1gg.tis
> > >        175 Mar 10 04:20 _1gh.fnm
> > >   10223275 Mar 10 04:20 _1gh.frq
> > >          4 Mar 10 04:20 _1gh..nrm
> > >    9056546 Mar 10 04:20 _1gh.prx
> > >     329012 Mar 10 04:20 _1gh.tii
> > >   23846511 Mar 10 04:20 _1gh.tis
> > >        175 Mar 10 04:28 _1gi.fnm
> > >   10221888 Mar 10 04:28 _1gi.frq
> > >          4 Mar 10 04:28 _1gi.nrm
> > >    9054280 Mar 10 04:28 _1gi.prx
> > >     328980 Mar 10 04:28 _1gi.tii
> > >   23843209 Mar 10 04:28 _1gi.tis
> > >        175 Mar 10 04:35 _1gj.fnm
> > >   10222776 Mar 10 04:35 _1gj.frq
> > >          4 Mar 10 04:35 _1gj.nrm
> > >    9054943 Mar 10 04:35 _1gj.prx
> > >     329060 Mar 10 04:35 _1gj.tii
> > >   23849395 Mar 10 04:35 _1gj.tis
> > >        175 Mar 10 04:42 _1gk.fnm
> > >   10220381 Mar 10 04:42 _1gk.frq
> > >          4 Mar 10 04:42 _1gk.nrm
> > >    9052810 Mar 10 04:42 _1gk.prx
> > >     329029 Mar 10 04:42 _1gk.tii
> > >   23845373 Mar 10 04:42 _1gk.tis
> > >        175 Mar 10 04:48 _1gl.fnm
> > >    9274170 Mar 10 04:48 _1gl.frq
> > >          4 Mar 10 04:48 _1gl.nrm
> > >    8226681 Mar 10 04:48 _1gl.prx
> > >     303327 Mar 10 04:48 _1gl.tii
> > >   21996826 Mar 10 04:48 _1gl.tis
> > >   22418126 Mar 10 04:58 _1gm..fdt
> > >      31732 Mar 10 04:58 _1gm.fdx
> > >        175 Mar 10 04:57 _1gm.fnm
> > >   10216672 Mar 10 04:57 _1gm.frq
> > >          4 Mar 10 04:57 _1gm.nrm
> > >    9049487 Mar 10 04:57 _1gm.prx
> > >     328813 Mar 10 04:57 _1gm.tii
> > >   23829627 Mar 10 04:57 _1gm.tis
> > >        175 Mar 10 04:58 _1gn.fnm
> > >     392014 Mar 10 04:58 _1gn.frq
> > >          4 Mar 10 04:58 _1gn.nrm
> > >     415225 Mar 10 04:58 _1gn.prx
> > >      24695 Mar 10 04:58 _1gn.tii
> > >    1816750 Mar 10 04:58 _1gn.tis
> > >        683 Mar 10 04:58 segments_7t
> > >         20 Mar 10 04:58 segments.gen
> > >  1935727800 Mar  9 11:17 _u1.fdt
> > >    2739180 Mar  9 11:17 _u1.fdx
> > >        175 Mar  9 11:15 _u1.fnm
> > >  1193583522 Mar  9 11:25 _u1.frq
> > >  817164507 Mar  9 11:25 _u1.prx
> > >   16547464 Mar  9 11:25 _u1..tii
> > >  1153764013 Mar  9 11:25 _u1.tis
> > >  1949493315 Mar  9 12:21 _x3.fdt
> > >    2758580 Mar  9 12:21 _x3.fdx
> > >        175 Mar  9 12:18 _x3.fnm
> > >  1202068425 Mar  9 12:29 _x3.frq
> > >  822963200 Mar  9 12:29 _x3.prx
> > >   16629485 Mar  9 12:29 _x3.tii
> > >  1159419149 Mar  9 12:29 _x3.tis
> > >
> > >
> > > Any ideas? I'm out of settings to tweak here.
> > >
> > > Cheers,
> > > Mark
> > >
> > >
> > >
> > >
> > > ----- Original Message ----
> > > From: Michael McCandless <lu...@mikemccandless.com>
> > > To: java-user@lucene.apache.org
> > > Sent: Tuesday, 10 March, 2009 0:01:30
> > > Subject: Re: A model for predicting indexing memory costs?
> > >
> > >
> > > mark harwood wrote:
> > >
> > >>
> > >> I've been building a large index (hundreds of millions) with mainly
> > structured data which consists of several fields with mostly unique
> > values.
> > >> I've been hitting out of memory issues when doing periodic
> > commits/closes which I suspect is down to the sheer number of terms.
> > >>
> > >> I set the IndexWriter..setTermIndexInterval to 8 times the normal
> size
> > of 128 (an intervalof 1024) which delayed the onset of the issue but
> still
> > failed.
> > >
> > > I think that setting won't change how much RAM is used when writing.
> > >
> > >> I'd like to get a little more scientific about what to set here
> rather
> > than simply experimenting with settings and hoping it doesn't fail
> again.
> > >>
> > >> Does anyone have a decent model worked out for how much memory is
> > consumed at peak? I'm guessing the contributing factors are:
> > >>
> > >> * Numbers of fields
> > >> * Numbers of unique terms per field
> > >> * Numbers of segments?
> > >
> > > Number of net unique terms (across all fields) is a big driver, but
> also
> > net number of term occurrences, and how many docs.  Lots of tiny docs
> take
> > more RAM than fewer large docs, when # occurrences are equal.
> > >
> > > But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW
> > should simply flush when it's used that much RAM.
> > >
> > > I don't think number of segments is a factor.
> > >
> > > Though mergeFactor is, since during merging the SegmentMerger holds
> > SegmentReaders open, and int[] maps (if there are any deletes) for each
> > segment.  Do you have a large merge taking place when you hit the OOMs?
> > >
> > > Mike
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache..org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache..org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: A model for predicting indexing memory costs?

Posted by Uwe Schindler <uw...@thetaphi.de>.

> >>It does not indefinitely hang,
> 
> I guess I just need to be more patient.
> Thanks for the GC settings. I don't currently have the luxury of "15
> other" processors but this will definitely be of use in other
> environments.

Even with one processor, a parallel GC is sometimes better. The traditional
GC does not really run completely decoupled in an own thread, so often
blocking the normal code flow in case of limited memory. The parallel GC
runs almost completely decoupled reducing the wait time for the actual
running java code. But if the memory usage for a large number of small
instances is too high, GC maybe to slow in freeing up memory, you can still
catch OOMs. Sometimes enabling verbose GC output helps solving such
problems.

> >>How works TrieRange for you?
> 
> I used it back when it was tucked away in PanFMP and so am very happy to
> see it making it's way into core.
> I'm still struggling with these OOM issues here so it's too early to
> comment on query performance for this particular app - I need to get the
> index built first.

OK. What do you think about LUCENE-1541, does the more complicated API
rectify the space improvement and reduced term number?

Cheers,
Uwe

> ----- Original Message ----
> From: Uwe Schindler <uw...@thetaphi.de>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 11:32:48
> Subject: RE: A model for predicting indexing memory costs?
> 
> It does not indefinitely hang, I think the problem is, that the GC takes
> up
> all processor resources and nothing else runs any more. You should also
> enable the parallel GC. We had similar problems on the searching side,
> when
> the webserver suddenly stopped for about 20 minutes (!) and doing nothing
> more than garbage collecting (64bit JVM, Java 1.5.0_17, Slolaris).
> Changing
> the web container's GC to parallel helped:
> 
> -Xms4096M -Xmx8192M -XX:MaxPermSize=512M -Xrs -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC -XX:+UseLargePages
> 
> Maybe add -XX:-UseGCOverheadLimit
> 
> One processor of this machine now always garbage collects :-), the other
> 15
> are serving searches...
> 
> Something other:
> From the stack trace I can see usage of TrieRange. The OOM happened there
> (when converting the char[] to a String). When we do not need two field
> names "field" and "field#trie" (because of sorting) (hope we can fix the
> sorting some time, see the corresponding JIRA issue), it would be better
> to
> index all trie values into one field. For that, a simplier API using a
> TrieTokenStream (like SOLR-940 uses for Trie, but because of that Solr is
> not able to sort at the moment) for indexing could be supplied. This API
> could directly use the buffers in the Token class when creating the trie
> encoded fields.
> 
> How works TrieRange for you? Are you happy, does searches work well with
> 30
> mio docs, which precisionStep do you use?
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> > -----Original Message-----
> > From: mark harwood [mailto:markharw00d@yahoo.co.uk]
> > Sent: Tuesday, March 10, 2009 12:07 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: A model for predicting indexing memory costs?
> >
> >
> > Thanks, Ian.
> >
> > I forgot to mention I tried that setting and it then seemed to hang
> > indefinitely.
> > I then switched back to a strategy of trying to minimise memory usage or
> > at least gain an understanding of how much memory would be required by
> my
> > application.
> >
> > Cheers
> > Mark
> >
> >
> >
> > ----- Original Message ----
> > From: Ian Lea <ia...@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Tuesday, 10 March, 2009 10:54:05
> > Subject: Re: A model for predicting indexing memory costs?
> >
> > That's not the usual OOM message is it? java.lang.OutOfMemoryError: GC
> > overhead limit exceeded.
> >
> > Looks like you might be able to work round it with -XX:-
> UseGCOverheadLimit
> >
> > http://java-monitor.com/forum/archive/index.php/t-54.html
> >
> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc
> > .oom
> >
> >
> > --
> > Ian.
> >
> >
> > On Tue, Mar 10, 2009 at 10:45 AM, mark harwood <ma...@yahoo.co.uk>
> > wrote:
> > >
> > >>>But... how come setting IW's RAM buffer doesn't prevent the OOMs?
> > >
> > > I've been setting the IndexWriter RAM buffer to 300 meg and giving the
> > JVM 1gig.
> > >
> > > Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300
> > meg, merge factor=20, term interval=8192, usecompound=false. All fields
> > are ANALYZED_NO_NORMS.
> > > Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
> > >
> > > This graphic shows timings for 100 consecutive write sessions, each
> > adding 30,000 documents, committing and then closing :
> > >    http://tinyurl.com/anzcjw
> > > You can see the periodic merge costs and then a big spike towards the
> > end before it crashed.
> > >
> > > The crash details are here after adding ~3 million documents in 98
> write
> > sessions:
> > >
> > > This batch index session added 3000 of 30000 docs : 10% complete
> > > Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC
> overhead
> > limit exceeded
> > >    at java.util.Arrays.copyOf(Unknown Source)
> > >    at java.lang.String..<init>(Unknown Source)
> > >    at
> >
> org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:1
> > 48)
> > >    at
> > org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
> > >    at
> > test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
> > >    at
> >
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerFi
> > eld.java:159)
> > >    at
> >
> org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldCo
> > nsumersPerField.java:36)
> > >    at
> >
> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFiel
> > dProcessorPerThread.java:234)
> > >    at
> >
> org.apache..lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.ja
> > va:762)
> > >    at
> >
> org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:7
> > 40)
> > >    at
> > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
> > >    at
> > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
> > >    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> > > Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC
> overhead
> > limit exceeded
> > >    at org.apache.commons.csv.CharBuffer..toString(CharBuffer.java:177)
> > >    at org.apache.commons.csv.CSVParser..getLine(CSVParser.java:242)
> > >    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
> > >    at test.IndexMarksFile$IndexingThread..run(IndexMarksFile.java:314)
> > > Committing
> > > Closing
> > > Exception in thread "main" java.lang.IllegalStateException: this
> writer
> > hit an OutOfMemoryError; cannot commit
> > >    at
> > org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
> > >    at
> org.apache.lucene..index.IndexWriter.commit(IndexWriter.java:3660)
> > >    at
> org.apache.lucene.index.IndexWriter..commit(IndexWriter.java:3634)
> > >    at test.IndexMarksFile.run(IndexMarksFile.java:176)
> > >    at test.IndexMarksFile.main(IndexMarksFile.java:101)
> > >    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
> > >
> > >
> > > For each write session I have a single writer, and 2 indexing threads
> > adding documents through this writer. There are no updates/deletes -
> only
> > adds. When both indexing threads complete the primary thread commits and
> > closes the writer.
> > > I then open a searcher run some search benchmarks, close the searcher
> > and start another write session.
> > > The documents have ~12 fields and are all the same size so I don't
> think
> > this OOM is down to rogue data. Each field has 100 near-unique tokens.
> > >
> > > The files on disk after the crash are as follows:
> > >  1930004059 Mar  9 13:32 _106.fdt
> > >    2731084 Mar  9 13:32 _106.fdx
> > >        175 Mar  9 13:30 _106.fnm
> > >  1190042394 Mar  9 13:39 _106.frq
> > >  814748995 Mar  9 13:39 _106.prx
> > >   16512596 Mar  9 13:39 _106.tii
> > >  1151364311 Mar  9 13:39 _106.tis
> > >  1949444533 Mar  9 14:53 _139.fdt
> > >    2758580 Mar  9 14:53 _139.fdx
> > >        175 Mar  9 14:51 _139.fnm
> > >  1202044423 Mar  9 15:00 _139.frq
> > >  822954002 Mar  9 15:00 _139.prx
> > >   16629104 Mar  9 15:00 _139.tii
> > >  1159392207 Mar  9 15:00 _139.tis
> > >  1930102055 Mar  9 16:15 _16c.fdt
> > >    2731084 Mar  9 16:15 _16c.fdx
> > >        175 Mar  9 16:13 _16c.fnm
> > >  1190090014 Mar  9 16:22 _16c.frq
> > >  814763781 Mar  9 16:22 _16c.prx
> > >   16514967 Mar  9 16:22 _16c.tii
> > >  1151524173 Mar  9 16:22 _16c.tis
> > >  1928053697 Mar  9 17:52 _19e.fdt
> > >    2728260 Mar  9 17:52 _19e.fdx
> > >        175 Mar  9 17:46 _19e.fnm
> > >  1188837093 Mar  9 18:08 _19e.frq
> > >  813915820 Mar  9 18:08 _19e.prx
> > >   16501902 Mar  9 18:08 _19e.tii
> > >  1150623773 Mar  9 18:08 _19e.tis
> > >  1951474247 Mar  9 20:22 _1cj.fdt
> > >    2761396 Mar  9 20:22 _1cj.fdx
> > >        175 Mar  9 20:18 _1cj.fnm
> > >  1203285781 Mar  9 20:39 _1cj.frq
> > >  823797656 Mar  9 20:39 _1cj.prx
> > >   16639997 Mar  9 20:39 _1cj.tii
> > >  1160143978 Mar  9 20:39 _1cj.tis
> > >  1929978366 Mar 10 01:02 _1fm.fdt
> > >    2731060 Mar 10 01:02 _1fm.fdx
> > >        175 Mar 10 00:43 _1fm.fnm
> > >  1190031780 Mar 10 02:36 _1fm.frq
> > >  814741146 Mar 10 02:36 _1fm.prx
> > >   16513189 Mar 10 02:36 _1fm.tii
> > >  1151399139 Mar 10 02:36 _1fm.tis
> > >  189073186 Mar 10 01:51 _1ft.fdt
> > >     267556 Mar 10 01:51 _1ft.fdx
> > >        175 Mar 10 01:50 _1ft.fnm
> > >  110750150 Mar 10 02:04 _1ft.frq
> > >   79818488 Mar 10 02:04 _1ft.prx
> > >    2326691 Mar 10 02:04 _1ft.tii
> > >  165932844 Mar 10 02:04 _1ft.tis
> > >  212500024 Mar 10 03:16 _1g5.fdt
> > >     300684 Mar 10 03:16 _1g5.fdx
> > >        175 Mar 10 03:16 _1g5.fnm
> > >  125179984 Mar 10 03:28 _1g5.frq
> > >   89703062 Mar 10 03:28 _1g5.prx
> > >    2594360 Mar 10 03:28 _1g5.tii
> > >  184495760 Mar 10 03:28 _1g5.tis
> > >   64323505 Mar 10 04:09 _1gc.fdt
> > >      91020 Mar 10 04:09 _1gc.fdx
> > >  105283820 Mar 10 04:48 _1gf.fdt
> > >     148988 Mar 10 04:48 _1gf.fdx
> > >        175 Mar 10 04:09 _1gf.fnm
> > >       1491 Mar 10 04:09 _1gf.frq
> > >          4 Mar 10 04:09 _1gf.nrm
> > >       2388 Mar 10 04:09 _1gf.prx
> > >        254 Mar 10 04:09 _1gf.tii
> > >      15761 Mar 10 04:09 _1gf.tis
> > >  191035191 Mar 10 04:09 _1gg.fdt
> > >     270332 Mar 10 04:09 _1gg.fdx
> > >        175 Mar 10 04:09 _1gg.fnm
> > >  111958741 Mar 10 04:24 _1gg.frq
> > >   80645411 Mar 10 04:24 _1gg.prx
> > >    2349153 Mar 10 04:24 _1gg.tii
> > >  167494232 Mar 10 04:24 _1gg.tis
> > >        175 Mar 10 04:20 _1gh.fnm
> > >   10223275 Mar 10 04:20 _1gh.frq
> > >          4 Mar 10 04:20 _1gh..nrm
> > >    9056546 Mar 10 04:20 _1gh.prx
> > >     329012 Mar 10 04:20 _1gh.tii
> > >   23846511 Mar 10 04:20 _1gh.tis
> > >        175 Mar 10 04:28 _1gi.fnm
> > >   10221888 Mar 10 04:28 _1gi.frq
> > >          4 Mar 10 04:28 _1gi.nrm
> > >    9054280 Mar 10 04:28 _1gi.prx
> > >     328980 Mar 10 04:28 _1gi.tii
> > >   23843209 Mar 10 04:28 _1gi.tis
> > >        175 Mar 10 04:35 _1gj.fnm
> > >   10222776 Mar 10 04:35 _1gj.frq
> > >          4 Mar 10 04:35 _1gj.nrm
> > >    9054943 Mar 10 04:35 _1gj.prx
> > >     329060 Mar 10 04:35 _1gj.tii
> > >   23849395 Mar 10 04:35 _1gj.tis
> > >        175 Mar 10 04:42 _1gk.fnm
> > >   10220381 Mar 10 04:42 _1gk.frq
> > >          4 Mar 10 04:42 _1gk.nrm
> > >    9052810 Mar 10 04:42 _1gk.prx
> > >     329029 Mar 10 04:42 _1gk.tii
> > >   23845373 Mar 10 04:42 _1gk.tis
> > >        175 Mar 10 04:48 _1gl.fnm
> > >    9274170 Mar 10 04:48 _1gl.frq
> > >          4 Mar 10 04:48 _1gl.nrm
> > >    8226681 Mar 10 04:48 _1gl.prx
> > >     303327 Mar 10 04:48 _1gl.tii
> > >   21996826 Mar 10 04:48 _1gl.tis
> > >   22418126 Mar 10 04:58 _1gm.fdt
> > >      31732 Mar 10 04:58 _1gm.fdx
> > >        175 Mar 10 04:57 _1gm.fnm
> > >   10216672 Mar 10 04:57 _1gm.frq
> > >          4 Mar 10 04:57 _1gm.nrm
> > >    9049487 Mar 10 04:57 _1gm.prx
> > >     328813 Mar 10 04:57 _1gm.tii
> > >   23829627 Mar 10 04:57 _1gm.tis
> > >        175 Mar 10 04:58 _1gn.fnm
> > >     392014 Mar 10 04:58 _1gn.frq
> > >          4 Mar 10 04:58 _1gn.nrm
> > >     415225 Mar 10 04:58 _1gn.prx
> > >      24695 Mar 10 04:58 _1gn.tii
> > >    1816750 Mar 10 04:58 _1gn.tis
> > >        683 Mar 10 04:58 segments_7t
> > >         20 Mar 10 04:58 segments.gen
> > >  1935727800 Mar  9 11:17 _u1.fdt
> > >    2739180 Mar  9 11:17 _u1.fdx
> > >        175 Mar  9 11:15 _u1.fnm
> > >  1193583522 Mar  9 11:25 _u1.frq
> > >  817164507 Mar  9 11:25 _u1.prx
> > >   16547464 Mar  9 11:25 _u1..tii
> > >  1153764013 Mar  9 11:25 _u1.tis
> > >  1949493315 Mar  9 12:21 _x3.fdt
> > >    2758580 Mar  9 12:21 _x3.fdx
> > >        175 Mar  9 12:18 _x3.fnm
> > >  1202068425 Mar  9 12:29 _x3.frq
> > >  822963200 Mar  9 12:29 _x3.prx
> > >   16629485 Mar  9 12:29 _x3.tii
> > >  1159419149 Mar  9 12:29 _x3.tis
> > >
> > >
> > > Any ideas? I'm out of settings to tweak here.
> > >
> > > Cheers,
> > > Mark
> > >
> > >
> > >
> > >
> > > ----- Original Message ----
> > > From: Michael McCandless <lu...@mikemccandless.com>
> > > To: java-user@lucene.apache.org
> > > Sent: Tuesday, 10 March, 2009 0:01:30
> > > Subject: Re: A model for predicting indexing memory costs?
> > >
> > >
> > > mark harwood wrote:
> > >
> > >>
> > >> I've been building a large index (hundreds of millions) with mainly
> > structured data which consists of several fields with mostly unique
> > values.
> > >> I've been hitting out of memory issues when doing periodic
> > commits/closes which I suspect is down to the sheer number of terms.
> > >>
> > >> I set the IndexWriter..setTermIndexInterval to 8 times the normal
> size
> > of 128 (an intervalof 1024) which delayed the onset of the issue but
> still
> > failed.
> > >
> > > I think that setting won't change how much RAM is used when writing.
> > >
> > >> I'd like to get a little more scientific about what to set here
> rather
> > than simply experimenting with settings and hoping it doesn't fail
> again.
> > >>
> > >> Does anyone have a decent model worked out for how much memory is
> > consumed at peak? I'm guessing the contributing factors are:
> > >>
> > >> * Numbers of fields
> > >> * Numbers of unique terms per field
> > >> * Numbers of segments?
> > >
> > > Number of net unique terms (across all fields) is a big driver, but
> also
> > net number of term occurrences, and how many docs.  Lots of tiny docs
> take
> > more RAM than fewer large docs, when # occurrences are equal.
> > >
> > > But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW
> > should simply flush when it's used that much RAM.
> > >
> > > I don't think number of segments is a factor.
> > >
> > > Though mergeFactor is, since during merging the SegmentMerger holds
> > SegmentReaders open, and int[] maps (if there are any deletes) for each
> > segment.  Do you have a large merge taking place when you hit the OOMs?
> > >
> > > Mike
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache..org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache..org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by mark harwood <ma...@yahoo.co.uk>.

>>It does not indefinitely hang,

I guess I just need to be more patient. 
Thanks for the GC settings. I don't currently have the luxury of "15 other" processors but this will definitely be of use in other environments.

>>How works TrieRange for you?

I used it back when it was tucked away in PanFMP and so am very happy to see it making it's way into core.
I'm still struggling with these OOM issues here so it's too early to comment on query performance for this particular app - I need to get the index built first.

Cheers
Mark


----- Original Message ----
From: Uwe Schindler <uw...@thetaphi.de>
To: java-user@lucene.apache.org
Sent: Tuesday, 10 March, 2009 11:32:48
Subject: RE: A model for predicting indexing memory costs?

It does not indefinitely hang, I think the problem is, that the GC takes up
all processor resources and nothing else runs any more. You should also
enable the parallel GC. We had similar problems on the searching side, when
the webserver suddenly stopped for about 20 minutes (!) and doing nothing
more than garbage collecting (64bit JVM, Java 1.5.0_17, Slolaris). Changing
the web container's GC to parallel helped:

-Xms4096M -Xmx8192M -XX:MaxPermSize=512M -Xrs -XX:+UseConcMarkSweepGC
-XX:+UseParNewGC -XX:+UseLargePages

Maybe add -XX:-UseGCOverheadLimit

One processor of this machine now always garbage collects :-), the other 15
are serving searches...

Something other:
From the stack trace I can see usage of TrieRange. The OOM happened there
(when converting the char[] to a String). When we do not need two field
names "field" and "field#trie" (because of sorting) (hope we can fix the
sorting some time, see the corresponding JIRA issue), it would be better to
index all trie values into one field. For that, a simplier API using a
TrieTokenStream (like SOLR-940 uses for Trie, but because of that Solr is
not able to sort at the moment) for indexing could be supplied. This API
could directly use the buffers in the Token class when creating the trie
encoded fields.

How works TrieRange for you? Are you happy, does searches work well with 30
mio docs, which precisionStep do you use?

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: mark harwood [mailto:markharw00d@yahoo.co.uk]
> Sent: Tuesday, March 10, 2009 12:07 PM
> To: java-user@lucene.apache.org
> Subject: Re: A model for predicting indexing memory costs?
> 
> 
> Thanks, Ian.
> 
> I forgot to mention I tried that setting and it then seemed to hang
> indefinitely.
> I then switched back to a strategy of trying to minimise memory usage or
> at least gain an understanding of how much memory would be required by my
> application.
> 
> Cheers
> Mark
> 
> 
> 
> ----- Original Message ----
> From: Ian Lea <ia...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 10:54:05
> Subject: Re: A model for predicting indexing memory costs?
> 
> That's not the usual OOM message is it? java.lang.OutOfMemoryError: GC
> overhead limit exceeded.
> 
> Looks like you might be able to work round it with -XX:-UseGCOverheadLimit
> 
> http://java-monitor.com/forum/archive/index.php/t-54.html
> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc
> .oom
> 
> 
> --
> Ian.
> 
> 
> On Tue, Mar 10, 2009 at 10:45 AM, mark harwood <ma...@yahoo.co.uk>
> wrote:
> >
> >>>But... how come setting IW's RAM buffer doesn't prevent the OOMs?
> >
> > I've been setting the IndexWriter RAM buffer to 300 meg and giving the
> JVM 1gig.
> >
> > Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300
> meg, merge factor=20, term interval=8192, usecompound=false. All fields
> are ANALYZED_NO_NORMS.
> > Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
> >
> > This graphic shows timings for 100 consecutive write sessions, each
> adding 30,000 documents, committing and then closing :
> >    http://tinyurl.com/anzcjw
> > You can see the periodic merge costs and then a big spike towards the
> end before it crashed.
> >
> > The crash details are here after adding ~3 million documents in 98 write
> sessions:
> >
> > This batch index session added 3000 of 30000 docs : 10% complete
> > Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> >    at java.util.Arrays.copyOf(Unknown Source)
> >    at java.lang.String..<init>(Unknown Source)
> >    at
> org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:1
> 48)
> >    at
> org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
> >    at
> test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
> >    at
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerFi
> eld.java:159)
> >    at
> org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldCo
> nsumersPerField.java:36)
> >    at
> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFiel
> dProcessorPerThread.java:234)
> >    at
> org.apache..lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.ja
> va:762)
> >    at
> org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:7
> 40)
> >    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
> >    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
> >    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> > Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> >    at org.apache.commons.csv.CharBuffer..toString(CharBuffer.java:177)
> >    at org.apache.commons.csv.CSVParser..getLine(CSVParser.java:242)
> >    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
> >    at test.IndexMarksFile$IndexingThread..run(IndexMarksFile.java:314)
> > Committing
> > Closing
> > Exception in thread "main" java.lang.IllegalStateException: this writer
> hit an OutOfMemoryError; cannot commit
> >    at
> org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
> >    at org.apache.lucene..index.IndexWriter.commit(IndexWriter.java:3660)
> >    at org.apache.lucene.index.IndexWriter..commit(IndexWriter.java:3634)
> >    at test.IndexMarksFile.run(IndexMarksFile.java:176)
> >    at test.IndexMarksFile.main(IndexMarksFile.java:101)
> >    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
> >
> >
> > For each write session I have a single writer, and 2 indexing threads
> adding documents through this writer. There are no updates/deletes - only
> adds. When both indexing threads complete the primary thread commits and
> closes the writer.
> > I then open a searcher run some search benchmarks, close the searcher
> and start another write session.
> > The documents have ~12 fields and are all the same size so I don't think
> this OOM is down to rogue data. Each field has 100 near-unique tokens.
> >
> > The files on disk after the crash are as follows:
> >  1930004059 Mar  9 13:32 _106.fdt
> >    2731084 Mar  9 13:32 _106.fdx
> >        175 Mar  9 13:30 _106.fnm
> >  1190042394 Mar  9 13:39 _106.frq
> >  814748995 Mar  9 13:39 _106.prx
> >   16512596 Mar  9 13:39 _106.tii
> >  1151364311 Mar  9 13:39 _106.tis
> >  1949444533 Mar  9 14:53 _139.fdt
> >    2758580 Mar  9 14:53 _139.fdx
> >        175 Mar  9 14:51 _139.fnm
> >  1202044423 Mar  9 15:00 _139.frq
> >  822954002 Mar  9 15:00 _139.prx
> >   16629104 Mar  9 15:00 _139.tii
> >  1159392207 Mar  9 15:00 _139.tis
> >  1930102055 Mar  9 16:15 _16c.fdt
> >    2731084 Mar  9 16:15 _16c.fdx
> >        175 Mar  9 16:13 _16c.fnm
> >  1190090014 Mar  9 16:22 _16c.frq
> >  814763781 Mar  9 16:22 _16c.prx
> >   16514967 Mar  9 16:22 _16c.tii
> >  1151524173 Mar  9 16:22 _16c.tis
> >  1928053697 Mar  9 17:52 _19e.fdt
> >    2728260 Mar  9 17:52 _19e.fdx
> >        175 Mar  9 17:46 _19e.fnm
> >  1188837093 Mar  9 18:08 _19e.frq
> >  813915820 Mar  9 18:08 _19e.prx
> >   16501902 Mar  9 18:08 _19e.tii
> >  1150623773 Mar  9 18:08 _19e.tis
> >  1951474247 Mar  9 20:22 _1cj.fdt
> >    2761396 Mar  9 20:22 _1cj.fdx
> >        175 Mar  9 20:18 _1cj.fnm
> >  1203285781 Mar  9 20:39 _1cj.frq
> >  823797656 Mar  9 20:39 _1cj.prx
> >   16639997 Mar  9 20:39 _1cj.tii
> >  1160143978 Mar  9 20:39 _1cj.tis
> >  1929978366 Mar 10 01:02 _1fm.fdt
> >    2731060 Mar 10 01:02 _1fm.fdx
> >        175 Mar 10 00:43 _1fm.fnm
> >  1190031780 Mar 10 02:36 _1fm.frq
> >  814741146 Mar 10 02:36 _1fm.prx
> >   16513189 Mar 10 02:36 _1fm.tii
> >  1151399139 Mar 10 02:36 _1fm.tis
> >  189073186 Mar 10 01:51 _1ft.fdt
> >     267556 Mar 10 01:51 _1ft.fdx
> >        175 Mar 10 01:50 _1ft.fnm
> >  110750150 Mar 10 02:04 _1ft.frq
> >   79818488 Mar 10 02:04 _1ft.prx
> >    2326691 Mar 10 02:04 _1ft.tii
> >  165932844 Mar 10 02:04 _1ft.tis
> >  212500024 Mar 10 03:16 _1g5.fdt
> >     300684 Mar 10 03:16 _1g5.fdx
> >        175 Mar 10 03:16 _1g5.fnm
> >  125179984 Mar 10 03:28 _1g5.frq
> >   89703062 Mar 10 03:28 _1g5.prx
> >    2594360 Mar 10 03:28 _1g5.tii
> >  184495760 Mar 10 03:28 _1g5.tis
> >   64323505 Mar 10 04:09 _1gc.fdt
> >      91020 Mar 10 04:09 _1gc.fdx
> >  105283820 Mar 10 04:48 _1gf.fdt
> >     148988 Mar 10 04:48 _1gf.fdx
> >        175 Mar 10 04:09 _1gf.fnm
> >       1491 Mar 10 04:09 _1gf.frq
> >          4 Mar 10 04:09 _1gf.nrm
> >       2388 Mar 10 04:09 _1gf.prx
> >        254 Mar 10 04:09 _1gf.tii
> >      15761 Mar 10 04:09 _1gf.tis
> >  191035191 Mar 10 04:09 _1gg.fdt
> >     270332 Mar 10 04:09 _1gg.fdx
> >        175 Mar 10 04:09 _1gg.fnm
> >  111958741 Mar 10 04:24 _1gg.frq
> >   80645411 Mar 10 04:24 _1gg.prx
> >    2349153 Mar 10 04:24 _1gg.tii
> >  167494232 Mar 10 04:24 _1gg.tis
> >        175 Mar 10 04:20 _1gh.fnm
> >   10223275 Mar 10 04:20 _1gh.frq
> >          4 Mar 10 04:20 _1gh..nrm
> >    9056546 Mar 10 04:20 _1gh.prx
> >     329012 Mar 10 04:20 _1gh.tii
> >   23846511 Mar 10 04:20 _1gh.tis
> >        175 Mar 10 04:28 _1gi.fnm
> >   10221888 Mar 10 04:28 _1gi.frq
> >          4 Mar 10 04:28 _1gi.nrm
> >    9054280 Mar 10 04:28 _1gi.prx
> >     328980 Mar 10 04:28 _1gi.tii
> >   23843209 Mar 10 04:28 _1gi.tis
> >        175 Mar 10 04:35 _1gj.fnm
> >   10222776 Mar 10 04:35 _1gj.frq
> >          4 Mar 10 04:35 _1gj.nrm
> >    9054943 Mar 10 04:35 _1gj.prx
> >     329060 Mar 10 04:35 _1gj.tii
> >   23849395 Mar 10 04:35 _1gj.tis
> >        175 Mar 10 04:42 _1gk.fnm
> >   10220381 Mar 10 04:42 _1gk.frq
> >          4 Mar 10 04:42 _1gk.nrm
> >    9052810 Mar 10 04:42 _1gk.prx
> >     329029 Mar 10 04:42 _1gk.tii
> >   23845373 Mar 10 04:42 _1gk.tis
> >        175 Mar 10 04:48 _1gl.fnm
> >    9274170 Mar 10 04:48 _1gl.frq
> >          4 Mar 10 04:48 _1gl.nrm
> >    8226681 Mar 10 04:48 _1gl.prx
> >     303327 Mar 10 04:48 _1gl.tii
> >   21996826 Mar 10 04:48 _1gl.tis
> >   22418126 Mar 10 04:58 _1gm.fdt
> >      31732 Mar 10 04:58 _1gm.fdx
> >        175 Mar 10 04:57 _1gm.fnm
> >   10216672 Mar 10 04:57 _1gm.frq
> >          4 Mar 10 04:57 _1gm.nrm
> >    9049487 Mar 10 04:57 _1gm.prx
> >     328813 Mar 10 04:57 _1gm.tii
> >   23829627 Mar 10 04:57 _1gm.tis
> >        175 Mar 10 04:58 _1gn.fnm
> >     392014 Mar 10 04:58 _1gn.frq
> >          4 Mar 10 04:58 _1gn.nrm
> >     415225 Mar 10 04:58 _1gn.prx
> >      24695 Mar 10 04:58 _1gn.tii
> >    1816750 Mar 10 04:58 _1gn.tis
> >        683 Mar 10 04:58 segments_7t
> >         20 Mar 10 04:58 segments.gen
> >  1935727800 Mar  9 11:17 _u1.fdt
> >    2739180 Mar  9 11:17 _u1.fdx
> >        175 Mar  9 11:15 _u1.fnm
> >  1193583522 Mar  9 11:25 _u1.frq
> >  817164507 Mar  9 11:25 _u1.prx
> >   16547464 Mar  9 11:25 _u1..tii
> >  1153764013 Mar  9 11:25 _u1.tis
> >  1949493315 Mar  9 12:21 _x3.fdt
> >    2758580 Mar  9 12:21 _x3.fdx
> >        175 Mar  9 12:18 _x3.fnm
> >  1202068425 Mar  9 12:29 _x3.frq
> >  822963200 Mar  9 12:29 _x3.prx
> >   16629485 Mar  9 12:29 _x3.tii
> >  1159419149 Mar  9 12:29 _x3.tis
> >
> >
> > Any ideas? I'm out of settings to tweak here.
> >
> > Cheers,
> > Mark
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Michael McCandless <lu...@mikemccandless.com>
> > To: java-user@lucene.apache.org
> > Sent: Tuesday, 10 March, 2009 0:01:30
> > Subject: Re: A model for predicting indexing memory costs?
> >
> >
> > mark harwood wrote:
> >
> >>
> >> I've been building a large index (hundreds of millions) with mainly
> structured data which consists of several fields with mostly unique
> values.
> >> I've been hitting out of memory issues when doing periodic
> commits/closes which I suspect is down to the sheer number of terms.
> >>
> >> I set the IndexWriter..setTermIndexInterval to 8 times the normal size
> of 128 (an intervalof 1024) which delayed the onset of the issue but still
> failed.
> >
> > I think that setting won't change how much RAM is used when writing.
> >
> >> I'd like to get a little more scientific about what to set here rather
> than simply experimenting with settings and hoping it doesn't fail again.
> >>
> >> Does anyone have a decent model worked out for how much memory is
> consumed at peak? I'm guessing the contributing factors are:
> >>
> >> * Numbers of fields
> >> * Numbers of unique terms per field
> >> * Numbers of segments?
> >
> > Number of net unique terms (across all fields) is a big driver, but also
> net number of term occurrences, and how many docs.  Lots of tiny docs take
> more RAM than fewer large docs, when # occurrences are equal.
> >
> > But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW
> should simply flush when it's used that much RAM.
> >
> > I don't think number of segments is a factor.
> >
> > Though mergeFactor is, since during merging the SegmentMerger holds
> SegmentReaders open, and int[] maps (if there are any deletes) for each
> segment.  Do you have a large merge taking place when you hit the OOMs?
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache..org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache..org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: A model for predicting indexing memory costs?

Posted by Uwe Schindler <uw...@thetaphi.de>.

It does not indefinitely hang, I think the problem is, that the GC takes up
all processor resources and nothing else runs any more. You should also
enable the parallel GC. We had similar problems on the searching side, when
the webserver suddenly stopped for about 20 minutes (!) and doing nothing
more than garbage collecting (64bit JVM, Java 1.5.0_17, Slolaris). Changing
the web container's GC to parallel helped:

-Xms4096M -Xmx8192M -XX:MaxPermSize=512M -Xrs -XX:+UseConcMarkSweepGC
-XX:+UseParNewGC -XX:+UseLargePages

Maybe add -XX:-UseGCOverheadLimit

One processor of this machine now always garbage collects :-), the other 15
are serving searches...

Something other:
>From the stack trace I can see usage of TrieRange. The OOM happened there
(when converting the char[] to a String). When we do not need two field
names "field" and "field#trie" (because of sorting) (hope we can fix the
sorting some time, see the corresponding JIRA issue), it would be better to
index all trie values into one field. For that, a simplier API using a
TrieTokenStream (like SOLR-940 uses for Trie, but because of that Solr is
not able to sort at the moment) for indexing could be supplied. This API
could directly use the buffers in the Token class when creating the trie
encoded fields.

How works TrieRange for you? Are you happy, does searches work well with 30
mio docs, which precisionStep do you use?

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: mark harwood [mailto:markharw00d@yahoo.co.uk]
> Sent: Tuesday, March 10, 2009 12:07 PM
> To: java-user@lucene.apache.org
> Subject: Re: A model for predicting indexing memory costs?
> 
> 
> Thanks, Ian.
> 
> I forgot to mention I tried that setting and it then seemed to hang
> indefinitely.
> I then switched back to a strategy of trying to minimise memory usage or
> at least gain an understanding of how much memory would be required by my
> application.
> 
> Cheers
> Mark
> 
> 
> 
> ----- Original Message ----
> From: Ian Lea <ia...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 10:54:05
> Subject: Re: A model for predicting indexing memory costs?
> 
> That's not the usual OOM message is it? java.lang.OutOfMemoryError: GC
> overhead limit exceeded.
> 
> Looks like you might be able to work round it with -XX:-UseGCOverheadLimit
> 
> http://java-monitor.com/forum/archive/index.php/t-54.html
> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc
> .oom
> 
> 
> --
> Ian.
> 
> 
> On Tue, Mar 10, 2009 at 10:45 AM, mark harwood <ma...@yahoo.co.uk>
> wrote:
> >
> >>>But... how come setting IW's RAM buffer doesn't prevent the OOMs?
> >
> > I've been setting the IndexWriter RAM buffer to 300 meg and giving the
> JVM 1gig.
> >
> > Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300
> meg, merge factor=20, term interval=8192, usecompound=false. All fields
> are ANALYZED_NO_NORMS.
> > Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
> >
> > This graphic shows timings for 100 consecutive write sessions, each
> adding 30,000 documents, committing and then closing :
> >     http://tinyurl.com/anzcjw
> > You can see the periodic merge costs and then a big spike towards the
> end before it crashed.
> >
> > The crash details are here after adding ~3 million documents in 98 write
> sessions:
> >
> > This batch index session added 3000 of 30000 docs : 10% complete
> > Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> >    at java.util.Arrays.copyOf(Unknown Source)
> >    at java.lang.String..<init>(Unknown Source)
> >    at
> org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:1
> 48)
> >    at
> org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
> >    at
> test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
> >    at
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerFi
> eld.java:159)
> >    at
> org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldCo
> nsumersPerField.java:36)
> >    at
> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFiel
> dProcessorPerThread.java:234)
> >    at
> org.apache..lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.ja
> va:762)
> >    at
> org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:7
> 40)
> >    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
> >    at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
> >    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> > Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> >    at org.apache.commons.csv.CharBuffer.toString(CharBuffer.java:177)
> >    at org.apache.commons.csv.CSVParser.getLine(CSVParser.java:242)
> >    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
> >    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:314)
> > Committing
> > Closing
> > Exception in thread "main" java.lang.IllegalStateException: this writer
> hit an OutOfMemoryError; cannot commit
> >    at
> org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
> >    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3660)
> >    at org.apache.lucene.index.IndexWriter..commit(IndexWriter.java:3634)
> >    at test.IndexMarksFile.run(IndexMarksFile.java:176)
> >    at test.IndexMarksFile.main(IndexMarksFile.java:101)
> >    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
> >
> >
> > For each write session I have a single writer, and 2 indexing threads
> adding documents through this writer. There are no updates/deletes - only
> adds. When both indexing threads complete the primary thread commits and
> closes the writer.
> > I then open a searcher run some search benchmarks, close the searcher
> and start another write session.
> > The documents have ~12 fields and are all the same size so I don't think
> this OOM is down to rogue data. Each field has 100 near-unique tokens.
> >
> > The files on disk after the crash are as follows:
> >  1930004059 Mar  9 13:32 _106.fdt
> >    2731084 Mar  9 13:32 _106.fdx
> >        175 Mar  9 13:30 _106.fnm
> >  1190042394 Mar  9 13:39 _106.frq
> >  814748995 Mar  9 13:39 _106.prx
> >   16512596 Mar  9 13:39 _106.tii
> >  1151364311 Mar  9 13:39 _106.tis
> >  1949444533 Mar  9 14:53 _139.fdt
> >    2758580 Mar  9 14:53 _139.fdx
> >        175 Mar  9 14:51 _139.fnm
> >  1202044423 Mar  9 15:00 _139.frq
> >  822954002 Mar  9 15:00 _139.prx
> >   16629104 Mar  9 15:00 _139.tii
> >  1159392207 Mar  9 15:00 _139.tis
> >  1930102055 Mar  9 16:15 _16c.fdt
> >    2731084 Mar  9 16:15 _16c.fdx
> >        175 Mar  9 16:13 _16c.fnm
> >  1190090014 Mar  9 16:22 _16c.frq
> >  814763781 Mar  9 16:22 _16c.prx
> >   16514967 Mar  9 16:22 _16c.tii
> >  1151524173 Mar  9 16:22 _16c.tis
> >  1928053697 Mar  9 17:52 _19e.fdt
> >    2728260 Mar  9 17:52 _19e.fdx
> >        175 Mar  9 17:46 _19e.fnm
> >  1188837093 Mar  9 18:08 _19e.frq
> >  813915820 Mar  9 18:08 _19e.prx
> >   16501902 Mar  9 18:08 _19e.tii
> >  1150623773 Mar  9 18:08 _19e.tis
> >  1951474247 Mar  9 20:22 _1cj.fdt
> >    2761396 Mar  9 20:22 _1cj.fdx
> >        175 Mar  9 20:18 _1cj.fnm
> >  1203285781 Mar  9 20:39 _1cj.frq
> >  823797656 Mar  9 20:39 _1cj.prx
> >   16639997 Mar  9 20:39 _1cj.tii
> >  1160143978 Mar  9 20:39 _1cj.tis
> >  1929978366 Mar 10 01:02 _1fm.fdt
> >    2731060 Mar 10 01:02 _1fm.fdx
> >        175 Mar 10 00:43 _1fm.fnm
> >  1190031780 Mar 10 02:36 _1fm.frq
> >  814741146 Mar 10 02:36 _1fm.prx
> >   16513189 Mar 10 02:36 _1fm.tii
> >  1151399139 Mar 10 02:36 _1fm.tis
> >  189073186 Mar 10 01:51 _1ft.fdt
> >     267556 Mar 10 01:51 _1ft.fdx
> >        175 Mar 10 01:50 _1ft.fnm
> >  110750150 Mar 10 02:04 _1ft.frq
> >   79818488 Mar 10 02:04 _1ft.prx
> >    2326691 Mar 10 02:04 _1ft.tii
> >  165932844 Mar 10 02:04 _1ft.tis
> >  212500024 Mar 10 03:16 _1g5.fdt
> >     300684 Mar 10 03:16 _1g5.fdx
> >        175 Mar 10 03:16 _1g5.fnm
> >  125179984 Mar 10 03:28 _1g5.frq
> >   89703062 Mar 10 03:28 _1g5.prx
> >    2594360 Mar 10 03:28 _1g5.tii
> >  184495760 Mar 10 03:28 _1g5.tis
> >   64323505 Mar 10 04:09 _1gc.fdt
> >      91020 Mar 10 04:09 _1gc.fdx
> >  105283820 Mar 10 04:48 _1gf.fdt
> >     148988 Mar 10 04:48 _1gf.fdx
> >        175 Mar 10 04:09 _1gf.fnm
> >       1491 Mar 10 04:09 _1gf.frq
> >          4 Mar 10 04:09 _1gf.nrm
> >       2388 Mar 10 04:09 _1gf.prx
> >        254 Mar 10 04:09 _1gf.tii
> >      15761 Mar 10 04:09 _1gf.tis
> >  191035191 Mar 10 04:09 _1gg.fdt
> >     270332 Mar 10 04:09 _1gg.fdx
> >        175 Mar 10 04:09 _1gg.fnm
> >  111958741 Mar 10 04:24 _1gg.frq
> >   80645411 Mar 10 04:24 _1gg.prx
> >    2349153 Mar 10 04:24 _1gg.tii
> >  167494232 Mar 10 04:24 _1gg.tis
> >        175 Mar 10 04:20 _1gh.fnm
> >   10223275 Mar 10 04:20 _1gh.frq
> >          4 Mar 10 04:20 _1gh..nrm
> >    9056546 Mar 10 04:20 _1gh.prx
> >     329012 Mar 10 04:20 _1gh.tii
> >   23846511 Mar 10 04:20 _1gh.tis
> >        175 Mar 10 04:28 _1gi.fnm
> >   10221888 Mar 10 04:28 _1gi.frq
> >          4 Mar 10 04:28 _1gi.nrm
> >    9054280 Mar 10 04:28 _1gi.prx
> >     328980 Mar 10 04:28 _1gi.tii
> >   23843209 Mar 10 04:28 _1gi.tis
> >        175 Mar 10 04:35 _1gj.fnm
> >   10222776 Mar 10 04:35 _1gj.frq
> >          4 Mar 10 04:35 _1gj.nrm
> >    9054943 Mar 10 04:35 _1gj.prx
> >     329060 Mar 10 04:35 _1gj.tii
> >   23849395 Mar 10 04:35 _1gj.tis
> >        175 Mar 10 04:42 _1gk.fnm
> >   10220381 Mar 10 04:42 _1gk.frq
> >          4 Mar 10 04:42 _1gk.nrm
> >    9052810 Mar 10 04:42 _1gk.prx
> >     329029 Mar 10 04:42 _1gk.tii
> >   23845373 Mar 10 04:42 _1gk.tis
> >        175 Mar 10 04:48 _1gl.fnm
> >    9274170 Mar 10 04:48 _1gl.frq
> >          4 Mar 10 04:48 _1gl.nrm
> >    8226681 Mar 10 04:48 _1gl.prx
> >     303327 Mar 10 04:48 _1gl.tii
> >   21996826 Mar 10 04:48 _1gl.tis
> >   22418126 Mar 10 04:58 _1gm.fdt
> >      31732 Mar 10 04:58 _1gm.fdx
> >        175 Mar 10 04:57 _1gm.fnm
> >   10216672 Mar 10 04:57 _1gm.frq
> >          4 Mar 10 04:57 _1gm.nrm
> >    9049487 Mar 10 04:57 _1gm.prx
> >     328813 Mar 10 04:57 _1gm.tii
> >   23829627 Mar 10 04:57 _1gm.tis
> >        175 Mar 10 04:58 _1gn.fnm
> >     392014 Mar 10 04:58 _1gn.frq
> >          4 Mar 10 04:58 _1gn.nrm
> >     415225 Mar 10 04:58 _1gn.prx
> >      24695 Mar 10 04:58 _1gn.tii
> >    1816750 Mar 10 04:58 _1gn.tis
> >        683 Mar 10 04:58 segments_7t
> >         20 Mar 10 04:58 segments.gen
> >  1935727800 Mar  9 11:17 _u1.fdt
> >    2739180 Mar  9 11:17 _u1.fdx
> >        175 Mar  9 11:15 _u1.fnm
> >  1193583522 Mar  9 11:25 _u1.frq
> >  817164507 Mar  9 11:25 _u1.prx
> >   16547464 Mar  9 11:25 _u1..tii
> >  1153764013 Mar  9 11:25 _u1.tis
> >  1949493315 Mar  9 12:21 _x3.fdt
> >    2758580 Mar  9 12:21 _x3.fdx
> >        175 Mar  9 12:18 _x3.fnm
> >  1202068425 Mar  9 12:29 _x3.frq
> >  822963200 Mar  9 12:29 _x3.prx
> >   16629485 Mar  9 12:29 _x3.tii
> >  1159419149 Mar  9 12:29 _x3.tis
> >
> >
> > Any ideas? I'm out of settings to tweak here.
> >
> > Cheers,
> > Mark
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Michael McCandless <lu...@mikemccandless.com>
> > To: java-user@lucene.apache.org
> > Sent: Tuesday, 10 March, 2009 0:01:30
> > Subject: Re: A model for predicting indexing memory costs?
> >
> >
> > mark harwood wrote:
> >
> >>
> >> I've been building a large index (hundreds of millions) with mainly
> structured data which consists of several fields with mostly unique
> values.
> >> I've been hitting out of memory issues when doing periodic
> commits/closes which I suspect is down to the sheer number of terms.
> >>
> >> I set the IndexWriter..setTermIndexInterval to 8 times the normal size
> of 128 (an intervalof 1024) which delayed the onset of the issue but still
> failed.
> >
> > I think that setting won't change how much RAM is used when writing.
> >
> >> I'd like to get a little more scientific about what to set here rather
> than simply experimenting with settings and hoping it doesn't fail again.
> >>
> >> Does anyone have a decent model worked out for how much memory is
> consumed at peak? I'm guessing the contributing factors are:
> >>
> >> * Numbers of fields
> >> * Numbers of unique terms per field
> >> * Numbers of segments?
> >
> > Number of net unique terms (across all fields) is a big driver, but also
> net number of term occurrences, and how many docs.  Lots of tiny docs take
> more RAM than fewer large docs, when # occurrences are equal.
> >
> > But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW
> should simply flush when it's used that much RAM.
> >
> > I don't think number of segments is a factor.
> >
> > Though mergeFactor is, since during merging the SegmentMerger holds
> SegmentReaders open, and int[] maps (if there are any deletes) for each
> segment.  Do you have a large merge taking place when you hit the OOMs?
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache..org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by mark harwood <ma...@yahoo.co.uk>.

Thanks, Ian.

I forgot to mention I tried that setting and it then seemed to hang indefinitely.
I then switched back to a strategy of trying to minimise memory usage or at least gain an understanding of how much memory would be required by my application.

Cheers
Mark



----- Original Message ----
From: Ian Lea <ia...@gmail.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 10 March, 2009 10:54:05
Subject: Re: A model for predicting indexing memory costs?

That's not the usual OOM message is it? java.lang.OutOfMemoryError: GC
overhead limit exceeded.

Looks like you might be able to work round it with -XX:-UseGCOverheadLimit

http://java-monitor.com/forum/archive/index.php/t-54.html
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom


--
Ian.


On Tue, Mar 10, 2009 at 10:45 AM, mark harwood <ma...@yahoo.co.uk> wrote:
>
>>>But... how come setting IW's RAM buffer doesn't prevent the OOMs?
>
> I've been setting the IndexWriter RAM buffer to 300 meg and giving the JVM 1gig.
>
> Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300 meg, merge factor=20, term interval=8192, usecompound=false. All fields are ANALYZED_NO_NORMS.
> Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
>
> This graphic shows timings for 100 consecutive write sessions, each adding 30,000 documents, committing and then closing :
>     http://tinyurl.com/anzcjw
> You can see the periodic merge costs and then a big spike towards the end before it crashed.
>
> The crash details are here after adding ~3 million documents in 98 write sessions:
>
> This batch index session added 3000 of 30000 docs : 10% complete
> Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC overhead limit exceeded
>    at java.util.Arrays.copyOf(Unknown Source)
>    at java.lang.String..<init>(Unknown Source)
>    at org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:148)
>    at org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
>    at test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
>    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159)
>    at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
>    at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
>    at org.apache..lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:762)
>    at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:740)
>    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
>    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC overhead limit exceeded
>    at org.apache.commons.csv.CharBuffer.toString(CharBuffer.java:177)
>    at org.apache.commons.csv.CSVParser.getLine(CSVParser.java:242)
>    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:314)
> Committing
> Closing
> Exception in thread "main" java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit
>    at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
>    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3660)
>    at org.apache.lucene.index.IndexWriter..commit(IndexWriter.java:3634)
>    at test.IndexMarksFile.run(IndexMarksFile.java:176)
>    at test.IndexMarksFile.main(IndexMarksFile.java:101)
>    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
>
>
> For each write session I have a single writer, and 2 indexing threads adding documents through this writer. There are no updates/deletes - only adds. When both indexing threads complete the primary thread commits and closes the writer.
> I then open a searcher run some search benchmarks, close the searcher and start another write session.
> The documents have ~12 fields and are all the same size so I don't think this OOM is down to rogue data. Each field has 100 near-unique tokens.
>
> The files on disk after the crash are as follows:
>  1930004059 Mar  9 13:32 _106.fdt
>    2731084 Mar  9 13:32 _106.fdx
>        175 Mar  9 13:30 _106.fnm
>  1190042394 Mar  9 13:39 _106.frq
>  814748995 Mar  9 13:39 _106.prx
>   16512596 Mar  9 13:39 _106.tii
>  1151364311 Mar  9 13:39 _106.tis
>  1949444533 Mar  9 14:53 _139.fdt
>    2758580 Mar  9 14:53 _139.fdx
>        175 Mar  9 14:51 _139.fnm
>  1202044423 Mar  9 15:00 _139.frq
>  822954002 Mar  9 15:00 _139.prx
>   16629104 Mar  9 15:00 _139.tii
>  1159392207 Mar  9 15:00 _139.tis
>  1930102055 Mar  9 16:15 _16c.fdt
>    2731084 Mar  9 16:15 _16c.fdx
>        175 Mar  9 16:13 _16c.fnm
>  1190090014 Mar  9 16:22 _16c.frq
>  814763781 Mar  9 16:22 _16c.prx
>   16514967 Mar  9 16:22 _16c.tii
>  1151524173 Mar  9 16:22 _16c.tis
>  1928053697 Mar  9 17:52 _19e.fdt
>    2728260 Mar  9 17:52 _19e.fdx
>        175 Mar  9 17:46 _19e.fnm
>  1188837093 Mar  9 18:08 _19e.frq
>  813915820 Mar  9 18:08 _19e.prx
>   16501902 Mar  9 18:08 _19e.tii
>  1150623773 Mar  9 18:08 _19e.tis
>  1951474247 Mar  9 20:22 _1cj.fdt
>    2761396 Mar  9 20:22 _1cj.fdx
>        175 Mar  9 20:18 _1cj.fnm
>  1203285781 Mar  9 20:39 _1cj.frq
>  823797656 Mar  9 20:39 _1cj.prx
>   16639997 Mar  9 20:39 _1cj.tii
>  1160143978 Mar  9 20:39 _1cj.tis
>  1929978366 Mar 10 01:02 _1fm.fdt
>    2731060 Mar 10 01:02 _1fm.fdx
>        175 Mar 10 00:43 _1fm.fnm
>  1190031780 Mar 10 02:36 _1fm.frq
>  814741146 Mar 10 02:36 _1fm.prx
>   16513189 Mar 10 02:36 _1fm.tii
>  1151399139 Mar 10 02:36 _1fm.tis
>  189073186 Mar 10 01:51 _1ft.fdt
>     267556 Mar 10 01:51 _1ft.fdx
>        175 Mar 10 01:50 _1ft.fnm
>  110750150 Mar 10 02:04 _1ft.frq
>   79818488 Mar 10 02:04 _1ft.prx
>    2326691 Mar 10 02:04 _1ft.tii
>  165932844 Mar 10 02:04 _1ft.tis
>  212500024 Mar 10 03:16 _1g5.fdt
>     300684 Mar 10 03:16 _1g5.fdx
>        175 Mar 10 03:16 _1g5.fnm
>  125179984 Mar 10 03:28 _1g5.frq
>   89703062 Mar 10 03:28 _1g5.prx
>    2594360 Mar 10 03:28 _1g5.tii
>  184495760 Mar 10 03:28 _1g5.tis
>   64323505 Mar 10 04:09 _1gc.fdt
>      91020 Mar 10 04:09 _1gc.fdx
>  105283820 Mar 10 04:48 _1gf.fdt
>     148988 Mar 10 04:48 _1gf.fdx
>        175 Mar 10 04:09 _1gf.fnm
>       1491 Mar 10 04:09 _1gf.frq
>          4 Mar 10 04:09 _1gf.nrm
>       2388 Mar 10 04:09 _1gf.prx
>        254 Mar 10 04:09 _1gf.tii
>      15761 Mar 10 04:09 _1gf.tis
>  191035191 Mar 10 04:09 _1gg.fdt
>     270332 Mar 10 04:09 _1gg.fdx
>        175 Mar 10 04:09 _1gg.fnm
>  111958741 Mar 10 04:24 _1gg.frq
>   80645411 Mar 10 04:24 _1gg.prx
>    2349153 Mar 10 04:24 _1gg.tii
>  167494232 Mar 10 04:24 _1gg.tis
>        175 Mar 10 04:20 _1gh.fnm
>   10223275 Mar 10 04:20 _1gh.frq
>          4 Mar 10 04:20 _1gh..nrm
>    9056546 Mar 10 04:20 _1gh.prx
>     329012 Mar 10 04:20 _1gh.tii
>   23846511 Mar 10 04:20 _1gh.tis
>        175 Mar 10 04:28 _1gi.fnm
>   10221888 Mar 10 04:28 _1gi.frq
>          4 Mar 10 04:28 _1gi.nrm
>    9054280 Mar 10 04:28 _1gi.prx
>     328980 Mar 10 04:28 _1gi.tii
>   23843209 Mar 10 04:28 _1gi.tis
>        175 Mar 10 04:35 _1gj.fnm
>   10222776 Mar 10 04:35 _1gj.frq
>          4 Mar 10 04:35 _1gj.nrm
>    9054943 Mar 10 04:35 _1gj.prx
>     329060 Mar 10 04:35 _1gj.tii
>   23849395 Mar 10 04:35 _1gj.tis
>        175 Mar 10 04:42 _1gk.fnm
>   10220381 Mar 10 04:42 _1gk.frq
>          4 Mar 10 04:42 _1gk.nrm
>    9052810 Mar 10 04:42 _1gk.prx
>     329029 Mar 10 04:42 _1gk.tii
>   23845373 Mar 10 04:42 _1gk.tis
>        175 Mar 10 04:48 _1gl.fnm
>    9274170 Mar 10 04:48 _1gl.frq
>          4 Mar 10 04:48 _1gl.nrm
>    8226681 Mar 10 04:48 _1gl.prx
>     303327 Mar 10 04:48 _1gl.tii
>   21996826 Mar 10 04:48 _1gl.tis
>   22418126 Mar 10 04:58 _1gm.fdt
>      31732 Mar 10 04:58 _1gm.fdx
>        175 Mar 10 04:57 _1gm.fnm
>   10216672 Mar 10 04:57 _1gm.frq
>          4 Mar 10 04:57 _1gm.nrm
>    9049487 Mar 10 04:57 _1gm.prx
>     328813 Mar 10 04:57 _1gm.tii
>   23829627 Mar 10 04:57 _1gm.tis
>        175 Mar 10 04:58 _1gn.fnm
>     392014 Mar 10 04:58 _1gn.frq
>          4 Mar 10 04:58 _1gn.nrm
>     415225 Mar 10 04:58 _1gn.prx
>      24695 Mar 10 04:58 _1gn.tii
>    1816750 Mar 10 04:58 _1gn.tis
>        683 Mar 10 04:58 segments_7t
>         20 Mar 10 04:58 segments.gen
>  1935727800 Mar  9 11:17 _u1.fdt
>    2739180 Mar  9 11:17 _u1.fdx
>        175 Mar  9 11:15 _u1.fnm
>  1193583522 Mar  9 11:25 _u1.frq
>  817164507 Mar  9 11:25 _u1.prx
>   16547464 Mar  9 11:25 _u1..tii
>  1153764013 Mar  9 11:25 _u1.tis
>  1949493315 Mar  9 12:21 _x3.fdt
>    2758580 Mar  9 12:21 _x3.fdx
>        175 Mar  9 12:18 _x3.fnm
>  1202068425 Mar  9 12:29 _x3.frq
>  822963200 Mar  9 12:29 _x3.prx
>   16629485 Mar  9 12:29 _x3.tii
>  1159419149 Mar  9 12:29 _x3.tis
>
>
> Any ideas? I'm out of settings to tweak here.
>
> Cheers,
> Mark
>
>
>
>
> ----- Original Message ----
> From: Michael McCandless <lu...@mikemccandless.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 0:01:30
> Subject: Re: A model for predicting indexing memory costs?
>
>
> mark harwood wrote:
>
>>
>> I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values.
>> I've been hitting out of memory issues when doing periodic commits/closes which I suspect is down to the sheer number of terms.
>>
>> I set the IndexWriter..setTermIndexInterval to 8 times the normal size of 128 (an intervalof 1024) which delayed the onset of the issue but still failed.
>
> I think that setting won't change how much RAM is used when writing.
>
>> I'd like to get a little more scientific about what to set here rather than simply experimenting with settings and hoping it doesn't fail again.
>>
>> Does anyone have a decent model worked out for how much memory is consumed at peak? I'm guessing the contributing factors are:
>>
>> * Numbers of fields
>> * Numbers of unique terms per field
>> * Numbers of segments?
>
> Number of net unique terms (across all fields) is a big driver, but also net number of term occurrences, and how many docs.  Lots of tiny docs take more RAM than fewer large docs, when # occurrences are equal.
>
> But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW should simply flush when it's used that much RAM.
>
> I don't think number of segments is a factor.
>
> Though mergeFactor is, since during merging the SegmentMerger holds SegmentReaders open, and int[] maps (if there are any deletes) for each segment.  Do you have a large merge taking place when you hit the OOMs?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache..org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Ian Lea <ia...@gmail.com>.

That's not the usual OOM message is it? java.lang.OutOfMemoryError: GC
overhead limit exceeded.

Looks like you might be able to work round it with -XX:-UseGCOverheadLimit

http://java-monitor.com/forum/archive/index.php/t-54.html
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom


--
Ian.


On Tue, Mar 10, 2009 at 10:45 AM, mark harwood <ma...@yahoo.co.uk> wrote:
>
>>>But... how come setting IW's RAM buffer doesn't prevent the OOMs?
>
> I've been setting the IndexWriter RAM buffer to 300 meg and giving the JVM 1gig.
>
> Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300 meg, merge factor=20, term interval=8192, usecompound=false. All fields are ANALYZED_NO_NORMS.
> Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.
>
> This graphic shows timings for 100 consecutive write sessions, each adding 30,000 documents, committing and then closing :
>     http://tinyurl.com/anzcjw
> You can see the periodic merge costs and then a big spike towards the end before it crashed.
>
> The crash details are here after adding ~3 million documents in 98 write sessions:
>
> This batch index session added 3000 of 30000 docs : 10% complete
> Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC overhead limit exceeded
>    at java.util.Arrays.copyOf(Unknown Source)
>    at java.lang.String..<init>(Unknown Source)
>    at org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:148)
>    at org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
>    at test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
>    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159)
>    at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
>    at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
>    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:762)
>    at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:740)
>    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
>    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
> Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC overhead limit exceeded
>    at org.apache.commons.csv.CharBuffer.toString(CharBuffer.java:177)
>    at org.apache.commons.csv.CSVParser.getLine(CSVParser.java:242)
>    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
>    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:314)
> Committing
> Closing
> Exception in thread "main" java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit
>    at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
>    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3660)
>    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3634)
>    at test.IndexMarksFile.run(IndexMarksFile.java:176)
>    at test.IndexMarksFile.main(IndexMarksFile.java:101)
>    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)
>
>
> For each write session I have a single writer, and 2 indexing threads adding documents through this writer. There are no updates/deletes - only adds. When both indexing threads complete the primary thread commits and closes the writer.
> I then open a searcher run some search benchmarks, close the searcher and start another write session.
> The documents have ~12 fields and are all the same size so I don't think this OOM is down to rogue data. Each field has 100 near-unique tokens.
>
> The files on disk after the crash are as follows:
>  1930004059 Mar  9 13:32 _106.fdt
>    2731084 Mar  9 13:32 _106.fdx
>        175 Mar  9 13:30 _106.fnm
>  1190042394 Mar  9 13:39 _106.frq
>  814748995 Mar  9 13:39 _106.prx
>   16512596 Mar  9 13:39 _106.tii
>  1151364311 Mar  9 13:39 _106.tis
>  1949444533 Mar  9 14:53 _139.fdt
>    2758580 Mar  9 14:53 _139.fdx
>        175 Mar  9 14:51 _139.fnm
>  1202044423 Mar  9 15:00 _139.frq
>  822954002 Mar  9 15:00 _139.prx
>   16629104 Mar  9 15:00 _139.tii
>  1159392207 Mar  9 15:00 _139.tis
>  1930102055 Mar  9 16:15 _16c.fdt
>    2731084 Mar  9 16:15 _16c.fdx
>        175 Mar  9 16:13 _16c.fnm
>  1190090014 Mar  9 16:22 _16c.frq
>  814763781 Mar  9 16:22 _16c.prx
>   16514967 Mar  9 16:22 _16c.tii
>  1151524173 Mar  9 16:22 _16c.tis
>  1928053697 Mar  9 17:52 _19e.fdt
>    2728260 Mar  9 17:52 _19e.fdx
>        175 Mar  9 17:46 _19e.fnm
>  1188837093 Mar  9 18:08 _19e.frq
>  813915820 Mar  9 18:08 _19e.prx
>   16501902 Mar  9 18:08 _19e.tii
>  1150623773 Mar  9 18:08 _19e.tis
>  1951474247 Mar  9 20:22 _1cj.fdt
>    2761396 Mar  9 20:22 _1cj.fdx
>        175 Mar  9 20:18 _1cj.fnm
>  1203285781 Mar  9 20:39 _1cj.frq
>  823797656 Mar  9 20:39 _1cj.prx
>   16639997 Mar  9 20:39 _1cj.tii
>  1160143978 Mar  9 20:39 _1cj.tis
>  1929978366 Mar 10 01:02 _1fm.fdt
>    2731060 Mar 10 01:02 _1fm.fdx
>        175 Mar 10 00:43 _1fm.fnm
>  1190031780 Mar 10 02:36 _1fm.frq
>  814741146 Mar 10 02:36 _1fm.prx
>   16513189 Mar 10 02:36 _1fm.tii
>  1151399139 Mar 10 02:36 _1fm.tis
>  189073186 Mar 10 01:51 _1ft.fdt
>     267556 Mar 10 01:51 _1ft.fdx
>        175 Mar 10 01:50 _1ft.fnm
>  110750150 Mar 10 02:04 _1ft.frq
>   79818488 Mar 10 02:04 _1ft.prx
>    2326691 Mar 10 02:04 _1ft.tii
>  165932844 Mar 10 02:04 _1ft.tis
>  212500024 Mar 10 03:16 _1g5.fdt
>     300684 Mar 10 03:16 _1g5.fdx
>        175 Mar 10 03:16 _1g5.fnm
>  125179984 Mar 10 03:28 _1g5.frq
>   89703062 Mar 10 03:28 _1g5.prx
>    2594360 Mar 10 03:28 _1g5.tii
>  184495760 Mar 10 03:28 _1g5.tis
>   64323505 Mar 10 04:09 _1gc.fdt
>      91020 Mar 10 04:09 _1gc.fdx
>  105283820 Mar 10 04:48 _1gf.fdt
>     148988 Mar 10 04:48 _1gf.fdx
>        175 Mar 10 04:09 _1gf.fnm
>       1491 Mar 10 04:09 _1gf.frq
>          4 Mar 10 04:09 _1gf.nrm
>       2388 Mar 10 04:09 _1gf.prx
>        254 Mar 10 04:09 _1gf.tii
>      15761 Mar 10 04:09 _1gf.tis
>  191035191 Mar 10 04:09 _1gg.fdt
>     270332 Mar 10 04:09 _1gg.fdx
>        175 Mar 10 04:09 _1gg.fnm
>  111958741 Mar 10 04:24 _1gg.frq
>   80645411 Mar 10 04:24 _1gg.prx
>    2349153 Mar 10 04:24 _1gg.tii
>  167494232 Mar 10 04:24 _1gg.tis
>        175 Mar 10 04:20 _1gh.fnm
>   10223275 Mar 10 04:20 _1gh.frq
>          4 Mar 10 04:20 _1gh.nrm
>    9056546 Mar 10 04:20 _1gh.prx
>     329012 Mar 10 04:20 _1gh.tii
>   23846511 Mar 10 04:20 _1gh.tis
>        175 Mar 10 04:28 _1gi.fnm
>   10221888 Mar 10 04:28 _1gi.frq
>          4 Mar 10 04:28 _1gi.nrm
>    9054280 Mar 10 04:28 _1gi.prx
>     328980 Mar 10 04:28 _1gi.tii
>   23843209 Mar 10 04:28 _1gi.tis
>        175 Mar 10 04:35 _1gj.fnm
>   10222776 Mar 10 04:35 _1gj.frq
>          4 Mar 10 04:35 _1gj.nrm
>    9054943 Mar 10 04:35 _1gj.prx
>     329060 Mar 10 04:35 _1gj.tii
>   23849395 Mar 10 04:35 _1gj.tis
>        175 Mar 10 04:42 _1gk.fnm
>   10220381 Mar 10 04:42 _1gk.frq
>          4 Mar 10 04:42 _1gk.nrm
>    9052810 Mar 10 04:42 _1gk.prx
>     329029 Mar 10 04:42 _1gk.tii
>   23845373 Mar 10 04:42 _1gk.tis
>        175 Mar 10 04:48 _1gl.fnm
>    9274170 Mar 10 04:48 _1gl.frq
>          4 Mar 10 04:48 _1gl.nrm
>    8226681 Mar 10 04:48 _1gl.prx
>     303327 Mar 10 04:48 _1gl.tii
>   21996826 Mar 10 04:48 _1gl.tis
>   22418126 Mar 10 04:58 _1gm.fdt
>      31732 Mar 10 04:58 _1gm.fdx
>        175 Mar 10 04:57 _1gm.fnm
>   10216672 Mar 10 04:57 _1gm.frq
>          4 Mar 10 04:57 _1gm.nrm
>    9049487 Mar 10 04:57 _1gm.prx
>     328813 Mar 10 04:57 _1gm.tii
>   23829627 Mar 10 04:57 _1gm.tis
>        175 Mar 10 04:58 _1gn.fnm
>     392014 Mar 10 04:58 _1gn.frq
>          4 Mar 10 04:58 _1gn.nrm
>     415225 Mar 10 04:58 _1gn.prx
>      24695 Mar 10 04:58 _1gn.tii
>    1816750 Mar 10 04:58 _1gn.tis
>        683 Mar 10 04:58 segments_7t
>         20 Mar 10 04:58 segments.gen
>  1935727800 Mar  9 11:17 _u1.fdt
>    2739180 Mar  9 11:17 _u1.fdx
>        175 Mar  9 11:15 _u1.fnm
>  1193583522 Mar  9 11:25 _u1.frq
>  817164507 Mar  9 11:25 _u1.prx
>   16547464 Mar  9 11:25 _u1..tii
>  1153764013 Mar  9 11:25 _u1.tis
>  1949493315 Mar  9 12:21 _x3.fdt
>    2758580 Mar  9 12:21 _x3.fdx
>        175 Mar  9 12:18 _x3.fnm
>  1202068425 Mar  9 12:29 _x3.frq
>  822963200 Mar  9 12:29 _x3.prx
>   16629485 Mar  9 12:29 _x3.tii
>  1159419149 Mar  9 12:29 _x3.tis
>
>
> Any ideas? I'm out of settings to tweak here.
>
> Cheers,
> Mark
>
>
>
>
> ----- Original Message ----
> From: Michael McCandless <lu...@mikemccandless.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 10 March, 2009 0:01:30
> Subject: Re: A model for predicting indexing memory costs?
>
>
> mark harwood wrote:
>
>>
>> I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values.
>> I've been hitting out of memory issues when doing periodic commits/closes which I suspect is down to the sheer number of terms.
>>
>> I set the IndexWriter..setTermIndexInterval to 8 times the normal size of 128 (an intervalof 1024) which delayed the onset of the issue but still failed.
>
> I think that setting won't change how much RAM is used when writing.
>
>> I'd like to get a little more scientific about what to set here rather than simply experimenting with settings and hoping it doesn't fail again.
>>
>> Does anyone have a decent model worked out for how much memory is consumed at peak? I'm guessing the contributing factors are:
>>
>> * Numbers of fields
>> * Numbers of unique terms per field
>> * Numbers of segments?
>
> Number of net unique terms (across all fields) is a big driver, but also net number of term occurrences, and how many docs.  Lots of tiny docs take more RAM than fewer large docs, when # occurrences are equal.
>
> But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW should simply flush when it's used that much RAM.
>
> I don't think number of segments is a factor.
>
> Though mergeFactor is, since during merging the SegmentMerger holds SegmentReaders open, and int[] maps (if there are any deletes) for each segment.  Do you have a large merge taking place when you hit the OOMs?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by mark harwood <ma...@yahoo.co.uk>.

>>But... how come setting IW's RAM buffer doesn't prevent the OOMs?

I've been setting the IndexWriter RAM buffer to 300 meg and giving the JVM 1gig. 

Last run I gave the JVM 3 gig, with writer settings of  RAM buffer=300 meg, merge factor=20, term interval=8192, usecompound=false. All fields are ANALYZED_NO_NORMS.
Lucene version is a 2.9 build,  JVM is Sun 64bit 1.6.0_07.

This graphic shows timings for 100 consecutive write sessions, each adding 30,000 documents, committing and then closing :
     http://tinyurl.com/anzcjw
You can see the periodic merge costs and then a big spike towards the end before it crashed.

The crash details are here after adding ~3 million documents in 98 write sessions:

This batch index session added 3000 of 30000 docs : 10% complete
Exception in thread "Thread-280" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOf(Unknown Source)
    at java.lang.String..<init>(Unknown Source)
    at org.apache.lucene.search.trie.TrieUtils.longToPrefixCoded(TrieUtils.java:148)
    at org.apache.lucene.search.trie.TrieUtils.trieCodeLong(TrieUtils.java:302)
    at test.LongTrieAnalyzer$LongTrieTokenStream.next(LongTrieAnalyzer.java:49)
    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159)
    at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
    at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:762)
    at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:740)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2039)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2013)
    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:319)
Exception in thread "Thread-281" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.commons.csv.CharBuffer.toString(CharBuffer.java:177)
    at org.apache.commons.csv.CSVParser.getLine(CSVParser.java:242)
    at test.IndexMarksFile.getLuceneDocument(IndexMarksFile.java:272)
    at test.IndexMarksFile$IndexingThread.run(IndexMarksFile.java:314)
Committing
Closing
Exception in thread "main" java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit
    at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3569)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3660)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3634)
    at test.IndexMarksFile.run(IndexMarksFile.java:176)
    at test.IndexMarksFile.main(IndexMarksFile.java:101)
    at test.MultiIndexAndRun.main(MultiIndexAndRun.java:49)


For each write session I have a single writer, and 2 indexing threads adding documents through this writer. There are no updates/deletes - only adds. When both indexing threads complete the primary thread commits and closes the writer.
I then open a searcher run some search benchmarks, close the searcher and start another write session.
The documents have ~12 fields and are all the same size so I don't think this OOM is down to rogue data. Each field has 100 near-unique tokens.

The files on disk after the crash are as follows:
 1930004059 Mar  9 13:32 _106.fdt
    2731084 Mar  9 13:32 _106.fdx
        175 Mar  9 13:30 _106.fnm
 1190042394 Mar  9 13:39 _106.frq
  814748995 Mar  9 13:39 _106.prx
   16512596 Mar  9 13:39 _106.tii
 1151364311 Mar  9 13:39 _106.tis
 1949444533 Mar  9 14:53 _139.fdt
    2758580 Mar  9 14:53 _139.fdx
        175 Mar  9 14:51 _139.fnm
 1202044423 Mar  9 15:00 _139.frq
  822954002 Mar  9 15:00 _139.prx
   16629104 Mar  9 15:00 _139.tii
 1159392207 Mar  9 15:00 _139.tis
 1930102055 Mar  9 16:15 _16c.fdt
    2731084 Mar  9 16:15 _16c.fdx
        175 Mar  9 16:13 _16c.fnm
 1190090014 Mar  9 16:22 _16c.frq
  814763781 Mar  9 16:22 _16c.prx
   16514967 Mar  9 16:22 _16c.tii
 1151524173 Mar  9 16:22 _16c.tis
 1928053697 Mar  9 17:52 _19e.fdt
    2728260 Mar  9 17:52 _19e.fdx
        175 Mar  9 17:46 _19e.fnm
 1188837093 Mar  9 18:08 _19e.frq
  813915820 Mar  9 18:08 _19e.prx
   16501902 Mar  9 18:08 _19e.tii
 1150623773 Mar  9 18:08 _19e.tis
 1951474247 Mar  9 20:22 _1cj.fdt
    2761396 Mar  9 20:22 _1cj.fdx
        175 Mar  9 20:18 _1cj.fnm
 1203285781 Mar  9 20:39 _1cj.frq
  823797656 Mar  9 20:39 _1cj.prx
   16639997 Mar  9 20:39 _1cj.tii
 1160143978 Mar  9 20:39 _1cj.tis
 1929978366 Mar 10 01:02 _1fm.fdt
    2731060 Mar 10 01:02 _1fm.fdx
        175 Mar 10 00:43 _1fm.fnm
 1190031780 Mar 10 02:36 _1fm.frq
  814741146 Mar 10 02:36 _1fm.prx
   16513189 Mar 10 02:36 _1fm.tii
 1151399139 Mar 10 02:36 _1fm.tis
  189073186 Mar 10 01:51 _1ft.fdt
     267556 Mar 10 01:51 _1ft.fdx
        175 Mar 10 01:50 _1ft.fnm
  110750150 Mar 10 02:04 _1ft.frq
   79818488 Mar 10 02:04 _1ft.prx
    2326691 Mar 10 02:04 _1ft.tii
  165932844 Mar 10 02:04 _1ft.tis
  212500024 Mar 10 03:16 _1g5.fdt
     300684 Mar 10 03:16 _1g5.fdx
        175 Mar 10 03:16 _1g5.fnm
  125179984 Mar 10 03:28 _1g5.frq
   89703062 Mar 10 03:28 _1g5.prx
    2594360 Mar 10 03:28 _1g5.tii
  184495760 Mar 10 03:28 _1g5.tis
   64323505 Mar 10 04:09 _1gc.fdt
      91020 Mar 10 04:09 _1gc.fdx
  105283820 Mar 10 04:48 _1gf.fdt
     148988 Mar 10 04:48 _1gf.fdx
        175 Mar 10 04:09 _1gf.fnm
       1491 Mar 10 04:09 _1gf.frq
          4 Mar 10 04:09 _1gf.nrm
       2388 Mar 10 04:09 _1gf.prx
        254 Mar 10 04:09 _1gf.tii
      15761 Mar 10 04:09 _1gf.tis
  191035191 Mar 10 04:09 _1gg.fdt
     270332 Mar 10 04:09 _1gg.fdx
        175 Mar 10 04:09 _1gg.fnm
  111958741 Mar 10 04:24 _1gg.frq
   80645411 Mar 10 04:24 _1gg.prx
    2349153 Mar 10 04:24 _1gg.tii
  167494232 Mar 10 04:24 _1gg.tis
        175 Mar 10 04:20 _1gh.fnm
   10223275 Mar 10 04:20 _1gh.frq
          4 Mar 10 04:20 _1gh.nrm
    9056546 Mar 10 04:20 _1gh.prx
     329012 Mar 10 04:20 _1gh.tii
   23846511 Mar 10 04:20 _1gh.tis
        175 Mar 10 04:28 _1gi.fnm
   10221888 Mar 10 04:28 _1gi.frq
          4 Mar 10 04:28 _1gi.nrm
    9054280 Mar 10 04:28 _1gi.prx
     328980 Mar 10 04:28 _1gi.tii
   23843209 Mar 10 04:28 _1gi.tis
        175 Mar 10 04:35 _1gj.fnm
   10222776 Mar 10 04:35 _1gj.frq
          4 Mar 10 04:35 _1gj.nrm
    9054943 Mar 10 04:35 _1gj.prx
     329060 Mar 10 04:35 _1gj.tii
   23849395 Mar 10 04:35 _1gj.tis
        175 Mar 10 04:42 _1gk.fnm
   10220381 Mar 10 04:42 _1gk.frq
          4 Mar 10 04:42 _1gk.nrm
    9052810 Mar 10 04:42 _1gk.prx
     329029 Mar 10 04:42 _1gk.tii
   23845373 Mar 10 04:42 _1gk.tis
        175 Mar 10 04:48 _1gl.fnm
    9274170 Mar 10 04:48 _1gl.frq
          4 Mar 10 04:48 _1gl.nrm
    8226681 Mar 10 04:48 _1gl.prx
     303327 Mar 10 04:48 _1gl.tii
   21996826 Mar 10 04:48 _1gl.tis
   22418126 Mar 10 04:58 _1gm.fdt
      31732 Mar 10 04:58 _1gm.fdx
        175 Mar 10 04:57 _1gm.fnm
   10216672 Mar 10 04:57 _1gm.frq
          4 Mar 10 04:57 _1gm.nrm
    9049487 Mar 10 04:57 _1gm.prx
     328813 Mar 10 04:57 _1gm.tii
   23829627 Mar 10 04:57 _1gm.tis
        175 Mar 10 04:58 _1gn.fnm
     392014 Mar 10 04:58 _1gn.frq
          4 Mar 10 04:58 _1gn.nrm
     415225 Mar 10 04:58 _1gn.prx
      24695 Mar 10 04:58 _1gn.tii
    1816750 Mar 10 04:58 _1gn.tis
        683 Mar 10 04:58 segments_7t
         20 Mar 10 04:58 segments.gen
 1935727800 Mar  9 11:17 _u1.fdt
    2739180 Mar  9 11:17 _u1.fdx
        175 Mar  9 11:15 _u1.fnm
 1193583522 Mar  9 11:25 _u1.frq
  817164507 Mar  9 11:25 _u1.prx
   16547464 Mar  9 11:25 _u1..tii
 1153764013 Mar  9 11:25 _u1.tis
 1949493315 Mar  9 12:21 _x3.fdt
    2758580 Mar  9 12:21 _x3.fdx
        175 Mar  9 12:18 _x3.fnm
 1202068425 Mar  9 12:29 _x3.frq
  822963200 Mar  9 12:29 _x3.prx
   16629485 Mar  9 12:29 _x3.tii
 1159419149 Mar  9 12:29 _x3.tis


Any ideas? I'm out of settings to tweak here.

Cheers,
Mark




----- Original Message ----
From: Michael McCandless <lu...@mikemccandless.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 10 March, 2009 0:01:30
Subject: Re: A model for predicting indexing memory costs?


mark harwood wrote:

> 
> I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values.
> I've been hitting out of memory issues when doing periodic commits/closes which I suspect is down to the sheer number of terms.
> 
> I set the IndexWriter..setTermIndexInterval to 8 times the normal size of 128 (an intervalof 1024) which delayed the onset of the issue but still failed.

I think that setting won't change how much RAM is used when writing.

> I'd like to get a little more scientific about what to set here rather than simply experimenting with settings and hoping it doesn't fail again.
> 
> Does anyone have a decent model worked out for how much memory is consumed at peak? I'm guessing the contributing factors are:
> 
> * Numbers of fields
> * Numbers of unique terms per field
> * Numbers of segments?

Number of net unique terms (across all fields) is a big driver, but also net number of term occurrences, and how many docs.  Lots of tiny docs take more RAM than fewer large docs, when # occurrences are equal.

But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW should simply flush when it's used that much RAM.

I don't think number of segments is a factor.

Though mergeFactor is, since during merging the SegmentMerger holds SegmentReaders open, and int[] maps (if there are any deletes) for each segment.  Do you have a large merge taking place when you hit the OOMs?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: A model for predicting indexing memory costs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

mark harwood wrote:

>
> I've been building a large index (hundreds of millions) with mainly  
> structured data which consists of several fields with mostly unique  
> values.
> I've been hitting out of memory issues when doing periodic commits/ 
> closes which I suspect is down to the sheer number of terms.
>
> I set the IndexWriter..setTermIndexInterval to 8 times the normal  
> size of 128 (an intervalof 1024) which delayed the onset of the  
> issue but still failed.

I think that setting won't change how much RAM is used when writing.

> I'd like to get a little more scientific about what to set here  
> rather than simply experimenting with settings and hoping it doesn't  
> fail again.
>
> Does anyone have a decent model worked out for how much memory is  
> consumed at peak? I'm guessing the contributing factors are:
>
> * Numbers of fields
> * Numbers of unique terms per field
> * Numbers of segments?

Number of net unique terms (across all fields) is a big driver, but  
also net number of term occurrences, and how many docs.  Lots of tiny  
docs take more RAM than fewer large docs, when # occurrences are equal.

But... how come setting IW's RAM buffer doesn't prevent the OOMs?  IW  
should simply flush when it's used that much RAM.

I don't think number of segments is a factor.

Though mergeFactor is, since during merging the SegmentMerger holds  
SegmentReaders open, and int[] maps (if there are any deletes) for  
each segment.  Do you have a large merge taking place when you hit the  
OOMs?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org