You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2015/01/09 22:15:07 UTC

Details on setting block parameters for Lucene41PostingsFormat

Hello all,

We have over 3 billion unique terms in our indexes and with Solr 3.x we set
the TermIndexInterval to about 8 times its default value in order to index
without OOMs.  (
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)

We are now working with Solr 4 and running into memory issues and are
wondering if we need to do something analogous for Solr 4.

The javadoc for IndexWriterConfig (
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
)
indicates that the lucene 4.1 postings format has some parameters which may
be set:
"..To configure its parameters (the minimum and maximum size for a block),
you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
int)
<https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29>
"

Is there documentation or discussion somewhere about how to determine
appropriate parameters or some detail about what setting the maxBlockSize
and minBlockSize does?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

Re: Details on setting block parameters for Lucene41PostingsFormat

Posted by Tom Burton-West <tb...@umich.edu>.
Thanks Mike,

Do you know how I can configure Solr to use the min=200 and
max=398 block sizes you suggested?  Or should I ask on the Solr list?

Tom

On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> The first int to Lucene41PostingsFormat is the min block size (default
> 25) and the second is the max (default 48) for the block tree terms
> dict.
>
> The max must be >= 2*(min-1).
>
> Since you were using 8X the default before, maybe try min=200 and
> max=398?  However, block tree should have been more RAM efficient than
> 3.x's terms index... if you run CheckIndex with -verbose it will print
> additional details about the block structure of your terms indices...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu>
> wrote:
> > Hello all,
> >
> > We have over 3 billion unique terms in our indexes and with Solr 3.x we
> set
> > the TermIndexInterval to about 8 times its default value in order to
> index
> > without OOMs.  (
> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
> >
> > We are now working with Solr 4 and running into memory issues and are
> > wondering if we need to do something analogous for Solr 4.
> >
> > The javadoc for IndexWriterConfig (
> >
> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
> > )
> > indicates that the lucene 4.1 postings format has some parameters which
> may
> > be set:
> > "..To configure its parameters (the minimum and maximum size for a
> block),
> > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
> > int)
> > <
> https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29
> >
> > "
> >
> > Is there documentation or discussion somewhere about how to determine
> > appropriate parameters or some detail about what setting the maxBlockSize
> > and minBlockSize does?
> >
> > Tom Burton-West
> > http://www.hathitrust.org/blogs/large-scale-search
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Details on setting block parameters for Lucene41PostingsFormat

Posted by Chris Hostetter <ho...@fucit.org>.
: 
: The first int to Lucene41PostingsFormat is the min block size (default
: 25) and the second is the max (default 48) for the block tree terms
: dict.

we were discussing over on the solr-user mailing list how Tom would/could 
go about configuring Solr to use a custom subclass of 
Lucene41PostingsFormat where he overrode those min/max constructor params, 
but i realized i have no idea how he's suppose to leverage the plumbing in 
PostingFormat to override the "name" of the format so it's used properly 
in SPI.

Lucene41PostingsFormat's constructor options only allow overriding the 
block sizes, not the "name" that gets propogated up to the PostingFormat() 
constructor ... so what is the expected way to write a subclass?


: On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu> wrote:
: > Hello all,
: >
: > We have over 3 billion unique terms in our indexes and with Solr 3.x we set
: > the TermIndexInterval to about 8 times its default value in order to index
: > without OOMs.  (
: > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
: >
: > We are now working with Solr 4 and running into memory issues and are
: > wondering if we need to do something analogous for Solr 4.
: >
: > The javadoc for IndexWriterConfig (
: > http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
: > )
: > indicates that the lucene 4.1 postings format has some parameters which may
: > be set:
: > "..To configure its parameters (the minimum and maximum size for a block),
: > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
: > int)
: > <https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29>
: > "
: >
: > Is there documentation or discussion somewhere about how to determine
: > appropriate parameters or some detail about what setting the maxBlockSize
: > and minBlockSize does?
: >
: > Tom Burton-West
: > http://www.hathitrust.org/blogs/large-scale-search
: 
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
: 
: 

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Details on setting block parameters for Lucene41PostingsFormat

Posted by Erick Erickson <er...@gmail.com>.
Tom:

I'll be very interested to see your final numbers. I did a worst-case
test at one
point and saw a 2/3 reduction, but.... that was deliberately "worst
case", I used
a bunch of string/text types, did some faceting on them, etc, IOW not real-world
at all. So it'll be cool to see what you come up with.

The other benefit is that you have many, many few objects allocated on the heap,
I was seeing two orders of magnitude fewer. That's right, 99%
reduction. Again, though,
I was deliberately doing really bad stuff....

Best,
Erick

On Sat, Jan 10, 2015 at 4:58 PM, Tom Burton-West <tb...@umich.edu> wrote:
> Thanks Mike,
>
> We run our Solr 3.x indexing with 10GB/shard.  I've been testing Solr 4
> with 4,6, and 8GB for heap.  As of Friday night when the indexes were about
> half done (about 400GB on disk) only the 4GB had issues.  I'll find out on
> Monday if the other runs had issues.  If we can go from 10GB in Solr 3.x to
> 6GB with Solr 4.x, that will be a significant change.
>
> With TermsIndexInterval we traded off less memory use for increased chance
> of disk seeks and more data to be read per seek (and if I remember right,
> that more data was scanned sequentially rather than binary searched.)
> What is the trade-off when increasing the block size?
>
> Tom
>
> On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> The first int to Lucene41PostingsFormat is the min block size (default
>> 25) and the second is the max (default 48) for the block tree terms
>> dict.
>>
>> The max must be >= 2*(min-1).
>>
>> Since you were using 8X the default before, maybe try min=200 and
>> max=398?  However, block tree should have been more RAM efficient than
>> 3.x's terms index... if you run CheckIndex with -verbose it will print
>> additional details about the block structure of your terms indices...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu>
>> wrote:
>> > Hello all,
>> >
>> > We have over 3 billion unique terms in our indexes and with Solr 3.x we
>> set
>> > the TermIndexInterval to about 8 times its default value in order to
>> index
>> > without OOMs.  (
>> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
>> >
>> > We are now working with Solr 4 and running into memory issues and are
>> > wondering if we need to do something analogous for Solr 4.
>> >
>> > The javadoc for IndexWriterConfig (
>> >
>> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
>> > )
>> > indicates that the lucene 4.1 postings format has some parameters which
>> may
>> > be set:
>> > "..To configure its parameters (the minimum and maximum size for a
>> block),
>> > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
>> > int)
>> > <
>> https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29
>> >
>> > "
>> >
>> > Is there documentation or discussion somewhere about how to determine
>> > appropriate parameters or some detail about what setting the maxBlockSize
>> > and minBlockSize does?
>> >
>> > Tom Burton-West
>> > http://www.hathitrust.org/blogs/large-scale-search
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Details on setting block parameters for Lucene41PostingsFormat

Posted by Tom Burton-West <tb...@umich.edu>.
Thanks Mike,


> OK.  It would be good to know where all your RAM is being consumed,
> and how much of that is really the terms index: it ought to be a very
> small part of it.
>
> I made a bunch of heap dumps.  I just watched with jconsole and ran jmap
-histo when memory use got high.
I've appended a bit more from the error trace  and the top memory users
from one of the heap dumps below..

I tried to send a bunch of heap dumps to the mailing list but the message
got rejected. I'll send them directly to you.


Tom



----
java.lang.OutOfMemoryError: Java heap space
        at
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:212)
        at
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:230)
        at
org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
        at
org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:252)
        at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:292)
        at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
        at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:659)
        at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
---
top memory users from one of the heap dumps:

   1:       1131932     2546933736  [B
   2:        308670      743033280  [I
   3:        696803      203038680  [C
   4:        383039       36771744
 org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState
   5:       1089864       26156736
 org.apache.lucene.util.AttributeSource$State
   6:        544870       26153760
 org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl
   7:        687500       16500000  org.apache.lucene.util.BytesRef
   8:        135820        9779040  org.apache.lucene.util.fst.FST$Arc
   9:        382519        9180456
 org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$PendingTerm
  10:        382037        9168888  org.apache.lucene.codecs.TermStats
  11:        544952        8719232  org.apache.lucene.util.BytesRefBuilder

Re: Details on setting block parameters for Lucene41PostingsFormat

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sat, Jan 10, 2015 at 7:58 PM, Tom Burton-West <tb...@umich.edu> wrote:
> Thanks Mike,
>
> We run our Solr 3.x indexing with 10GB/shard.  I've been testing Solr 4
> with 4,6, and 8GB for heap.  As of Friday night when the indexes were about
> half done (about 400GB on disk) only the 4GB had issues.  I'll find out on
> Monday if the other runs had issues.  If we can go from 10GB in Solr 3.x to
> 6GB with Solr 4.x, that will be a significant change.

OK.  It would be good to know where all your RAM is being consumed,
and how much of that is really the terms index: it ought to be a very
small part of it.

> With TermsIndexInterval we traded off less memory use for increased chance
> of disk seeks and more data to be read per seek (and if I remember right,
> that more data was scanned sequentially rather than binary searched.)
> What is the trade-off when increasing the block size?

It's exactly the same tradeoff: blocks will be larger, so there will
be fewer blocks that the terms index must reference (making it
smaller), but more scanning to find your exact term within a given
block.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Details on setting block parameters for Lucene41PostingsFormat

Posted by Tom Burton-West <tb...@umich.edu>.
Thanks Mike,

We run our Solr 3.x indexing with 10GB/shard.  I've been testing Solr 4
with 4,6, and 8GB for heap.  As of Friday night when the indexes were about
half done (about 400GB on disk) only the 4GB had issues.  I'll find out on
Monday if the other runs had issues.  If we can go from 10GB in Solr 3.x to
6GB with Solr 4.x, that will be a significant change.

With TermsIndexInterval we traded off less memory use for increased chance
of disk seeks and more data to be read per seek (and if I remember right,
that more data was scanned sequentially rather than binary searched.)
What is the trade-off when increasing the block size?

Tom

On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> The first int to Lucene41PostingsFormat is the min block size (default
> 25) and the second is the max (default 48) for the block tree terms
> dict.
>
> The max must be >= 2*(min-1).
>
> Since you were using 8X the default before, maybe try min=200 and
> max=398?  However, block tree should have been more RAM efficient than
> 3.x's terms index... if you run CheckIndex with -verbose it will print
> additional details about the block structure of your terms indices...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu>
> wrote:
> > Hello all,
> >
> > We have over 3 billion unique terms in our indexes and with Solr 3.x we
> set
> > the TermIndexInterval to about 8 times its default value in order to
> index
> > without OOMs.  (
> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
> >
> > We are now working with Solr 4 and running into memory issues and are
> > wondering if we need to do something analogous for Solr 4.
> >
> > The javadoc for IndexWriterConfig (
> >
> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
> > )
> > indicates that the lucene 4.1 postings format has some parameters which
> may
> > be set:
> > "..To configure its parameters (the minimum and maximum size for a
> block),
> > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
> > int)
> > <
> https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29
> >
> > "
> >
> > Is there documentation or discussion somewhere about how to determine
> > appropriate parameters or some detail about what setting the maxBlockSize
> > and minBlockSize does?
> >
> > Tom Burton-West
> > http://www.hathitrust.org/blogs/large-scale-search
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Details on setting block parameters for Lucene41PostingsFormat

Posted by Michael McCandless <lu...@mikemccandless.com>.
The first int to Lucene41PostingsFormat is the min block size (default
25) and the second is the max (default 48) for the block tree terms
dict.

The max must be >= 2*(min-1).

Since you were using 8X the default before, maybe try min=200 and
max=398?  However, block tree should have been more RAM efficient than
3.x's terms index... if you run CheckIndex with -verbose it will print
additional details about the block structure of your terms indices...

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu> wrote:
> Hello all,
>
> We have over 3 billion unique terms in our indexes and with Solr 3.x we set
> the TermIndexInterval to about 8 times its default value in order to index
> without OOMs.  (
> http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
>
> We are now working with Solr 4 and running into memory issues and are
> wondering if we need to do something analogous for Solr 4.
>
> The javadoc for IndexWriterConfig (
> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
> )
> indicates that the lucene 4.1 postings format has some parameters which may
> be set:
> "..To configure its parameters (the minimum and maximum size for a block),
> you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
> int)
> <https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29>
> "
>
> Is there documentation or discussion somewhere about how to determine
> appropriate parameters or some detail about what setting the maxBlockSize
> and minBlockSize does?
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org