You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2015/01/09 22:15:07 UTC
Details on setting block parameters for Lucene41PostingsFormat
Hello all,
We have over 3 billion unique terms in our indexes and with Solr 3.x we set
the TermIndexInterval to about 8 times its default value in order to index
without OOMs. (
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
We are now working with Solr 4 and running into memory issues and are
wondering if we need to do something analogous for Solr 4.
The javadoc for IndexWriterConfig (
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
)
indicates that the lucene 4.1 postings format has some parameters which may
be set:
"..To configure its parameters (the minimum and maximum size for a block),
you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
int)
<https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29>
"
Is there documentation or discussion somewhere about how to determine
appropriate parameters or some detail about what setting the maxBlockSize
and minBlockSize does?
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
Re: Details on setting block parameters for Lucene41PostingsFormat
Posted by Tom Burton-West <tb...@umich.edu>.
Thanks Mike,
Do you know how I can configure Solr to use the min=200 and
max=398 block sizes you suggested? Or should I ask on the Solr list?
Tom
On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:
> The first int to Lucene41PostingsFormat is the min block size (default
> 25) and the second is the max (default 48) for the block tree terms
> dict.
>
> The max must be >= 2*(min-1).
>
> Since you were using 8X the default before, maybe try min=200 and
> max=398? However, block tree should have been more RAM efficient than
> 3.x's terms index... if you run CheckIndex with -verbose it will print
> additional details about the block structure of your terms indices...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu>
> wrote:
> > Hello all,
> >
> > We have over 3 billion unique terms in our indexes and with Solr 3.x we
> set
> > the TermIndexInterval to about 8 times its default value in order to
> index
> > without OOMs. (
> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
> >
> > We are now working with Solr 4 and running into memory issues and are
> > wondering if we need to do something analogous for Solr 4.
> >
> > The javadoc for IndexWriterConfig (
> >
> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
> > )
> > indicates that the lucene 4.1 postings format has some parameters which
> may
> > be set:
> > "..To configure its parameters (the minimum and maximum size for a
> block),
> > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
> > int)
> > <
> https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29
> >
> > "
> >
> > Is there documentation or discussion somewhere about how to determine
> > appropriate parameters or some detail about what setting the maxBlockSize
> > and minBlockSize does?
> >
> > Tom Burton-West
> > http://www.hathitrust.org/blogs/large-scale-search
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Details on setting block parameters for Lucene41PostingsFormat
Posted by Chris Hostetter <ho...@fucit.org>.
:
: The first int to Lucene41PostingsFormat is the min block size (default
: 25) and the second is the max (default 48) for the block tree terms
: dict.
we were discussing over on the solr-user mailing list how Tom would/could
go about configuring Solr to use a custom subclass of
Lucene41PostingsFormat where he overrode those min/max constructor params,
but i realized i have no idea how he's suppose to leverage the plumbing in
PostingFormat to override the "name" of the format so it's used properly
in SPI.
Lucene41PostingsFormat's constructor options only allow overriding the
block sizes, not the "name" that gets propogated up to the PostingFormat()
constructor ... so what is the expected way to write a subclass?
: On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu> wrote:
: > Hello all,
: >
: > We have over 3 billion unique terms in our indexes and with Solr 3.x we set
: > the TermIndexInterval to about 8 times its default value in order to index
: > without OOMs. (
: > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
: >
: > We are now working with Solr 4 and running into memory issues and are
: > wondering if we need to do something analogous for Solr 4.
: >
: > The javadoc for IndexWriterConfig (
: > http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
: > )
: > indicates that the lucene 4.1 postings format has some parameters which may
: > be set:
: > "..To configure its parameters (the minimum and maximum size for a block),
: > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
: > int)
: > <https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29>
: > "
: >
: > Is there documentation or discussion somewhere about how to determine
: > appropriate parameters or some detail about what setting the maxBlockSize
: > and minBlockSize does?
: >
: > Tom Burton-West
: > http://www.hathitrust.org/blogs/large-scale-search
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:
:
-Hoss
http://www.lucidworks.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Details on setting block parameters for Lucene41PostingsFormat
Posted by Erick Erickson <er...@gmail.com>.
Tom:
I'll be very interested to see your final numbers. I did a worst-case
test at one
point and saw a 2/3 reduction, but.... that was deliberately "worst
case", I used
a bunch of string/text types, did some faceting on them, etc, IOW not real-world
at all. So it'll be cool to see what you come up with.
The other benefit is that you have many, many few objects allocated on the heap,
I was seeing two orders of magnitude fewer. That's right, 99%
reduction. Again, though,
I was deliberately doing really bad stuff....
Best,
Erick
On Sat, Jan 10, 2015 at 4:58 PM, Tom Burton-West <tb...@umich.edu> wrote:
> Thanks Mike,
>
> We run our Solr 3.x indexing with 10GB/shard. I've been testing Solr 4
> with 4,6, and 8GB for heap. As of Friday night when the indexes were about
> half done (about 400GB on disk) only the 4GB had issues. I'll find out on
> Monday if the other runs had issues. If we can go from 10GB in Solr 3.x to
> 6GB with Solr 4.x, that will be a significant change.
>
> With TermsIndexInterval we traded off less memory use for increased chance
> of disk seeks and more data to be read per seek (and if I remember right,
> that more data was scanned sequentially rather than binary searched.)
> What is the trade-off when increasing the block size?
>
> Tom
>
> On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> The first int to Lucene41PostingsFormat is the min block size (default
>> 25) and the second is the max (default 48) for the block tree terms
>> dict.
>>
>> The max must be >= 2*(min-1).
>>
>> Since you were using 8X the default before, maybe try min=200 and
>> max=398? However, block tree should have been more RAM efficient than
>> 3.x's terms index... if you run CheckIndex with -verbose it will print
>> additional details about the block structure of your terms indices...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu>
>> wrote:
>> > Hello all,
>> >
>> > We have over 3 billion unique terms in our indexes and with Solr 3.x we
>> set
>> > the TermIndexInterval to about 8 times its default value in order to
>> index
>> > without OOMs. (
>> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
>> >
>> > We are now working with Solr 4 and running into memory issues and are
>> > wondering if we need to do something analogous for Solr 4.
>> >
>> > The javadoc for IndexWriterConfig (
>> >
>> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
>> > )
>> > indicates that the lucene 4.1 postings format has some parameters which
>> may
>> > be set:
>> > "..To configure its parameters (the minimum and maximum size for a
>> block),
>> > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
>> > int)
>> > <
>> https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29
>> >
>> > "
>> >
>> > Is there documentation or discussion somewhere about how to determine
>> > appropriate parameters or some detail about what setting the maxBlockSize
>> > and minBlockSize does?
>> >
>> > Tom Burton-West
>> > http://www.hathitrust.org/blogs/large-scale-search
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Details on setting block parameters for Lucene41PostingsFormat
Posted by Tom Burton-West <tb...@umich.edu>.
Thanks Mike,
> OK. It would be good to know where all your RAM is being consumed,
> and how much of that is really the terms index: it ought to be a very
> small part of it.
>
> I made a bunch of heap dumps. I just watched with jconsole and ran jmap
-histo when memory use got high.
I've appended a bit more from the error trace and the top memory users
from one of the heap dumps below..
I tried to send a bunch of heap dumps to the mailing list but the message
got rejected. I'll send them directly to you.
Tom
----
java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:212)
at
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:230)
at
org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
at
org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:252)
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:292)
at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:659)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
---
top memory users from one of the heap dumps:
1: 1131932 2546933736 [B
2: 308670 743033280 [I
3: 696803 203038680 [C
4: 383039 36771744
org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState
5: 1089864 26156736
org.apache.lucene.util.AttributeSource$State
6: 544870 26153760
org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl
7: 687500 16500000 org.apache.lucene.util.BytesRef
8: 135820 9779040 org.apache.lucene.util.fst.FST$Arc
9: 382519 9180456
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$PendingTerm
10: 382037 9168888 org.apache.lucene.codecs.TermStats
11: 544952 8719232 org.apache.lucene.util.BytesRefBuilder
Re: Details on setting block parameters for Lucene41PostingsFormat
Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sat, Jan 10, 2015 at 7:58 PM, Tom Burton-West <tb...@umich.edu> wrote:
> Thanks Mike,
>
> We run our Solr 3.x indexing with 10GB/shard. I've been testing Solr 4
> with 4,6, and 8GB for heap. As of Friday night when the indexes were about
> half done (about 400GB on disk) only the 4GB had issues. I'll find out on
> Monday if the other runs had issues. If we can go from 10GB in Solr 3.x to
> 6GB with Solr 4.x, that will be a significant change.
OK. It would be good to know where all your RAM is being consumed,
and how much of that is really the terms index: it ought to be a very
small part of it.
> With TermsIndexInterval we traded off less memory use for increased chance
> of disk seeks and more data to be read per seek (and if I remember right,
> that more data was scanned sequentially rather than binary searched.)
> What is the trade-off when increasing the block size?
It's exactly the same tradeoff: blocks will be larger, so there will
be fewer blocks that the terms index must reference (making it
smaller), but more scanning to find your exact term within a given
block.
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Details on setting block parameters for Lucene41PostingsFormat
Posted by Tom Burton-West <tb...@umich.edu>.
Thanks Mike,
We run our Solr 3.x indexing with 10GB/shard. I've been testing Solr 4
with 4,6, and 8GB for heap. As of Friday night when the indexes were about
half done (about 400GB on disk) only the 4GB had issues. I'll find out on
Monday if the other runs had issues. If we can go from 10GB in Solr 3.x to
6GB with Solr 4.x, that will be a significant change.
With TermsIndexInterval we traded off less memory use for increased chance
of disk seeks and more data to be read per seek (and if I remember right,
that more data was scanned sequentially rather than binary searched.)
What is the trade-off when increasing the block size?
Tom
On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:
> The first int to Lucene41PostingsFormat is the min block size (default
> 25) and the second is the max (default 48) for the block tree terms
> dict.
>
> The max must be >= 2*(min-1).
>
> Since you were using 8X the default before, maybe try min=200 and
> max=398? However, block tree should have been more RAM efficient than
> 3.x's terms index... if you run CheckIndex with -verbose it will print
> additional details about the block structure of your terms indices...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu>
> wrote:
> > Hello all,
> >
> > We have over 3 billion unique terms in our indexes and with Solr 3.x we
> set
> > the TermIndexInterval to about 8 times its default value in order to
> index
> > without OOMs. (
> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
> >
> > We are now working with Solr 4 and running into memory issues and are
> > wondering if we need to do something analogous for Solr 4.
> >
> > The javadoc for IndexWriterConfig (
> >
> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
> > )
> > indicates that the lucene 4.1 postings format has some parameters which
> may
> > be set:
> > "..To configure its parameters (the minimum and maximum size for a
> block),
> > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
> > int)
> > <
> https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29
> >
> > "
> >
> > Is there documentation or discussion somewhere about how to determine
> > appropriate parameters or some detail about what setting the maxBlockSize
> > and minBlockSize does?
> >
> > Tom Burton-West
> > http://www.hathitrust.org/blogs/large-scale-search
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Details on setting block parameters for Lucene41PostingsFormat
Posted by Michael McCandless <lu...@mikemccandless.com>.
The first int to Lucene41PostingsFormat is the min block size (default
25) and the second is the max (default 48) for the block tree terms
dict.
The max must be >= 2*(min-1).
Since you were using 8X the default before, maybe try min=200 and
max=398? However, block tree should have been more RAM efficient than
3.x's terms index... if you run CheckIndex with -verbose it will print
additional details about the block structure of your terms indices...
Mike McCandless
http://blog.mikemccandless.com
On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tb...@umich.edu> wrote:
> Hello all,
>
> We have over 3 billion unique terms in our indexes and with Solr 3.x we set
> the TermIndexInterval to about 8 times its default value in order to index
> without OOMs. (
> http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
>
> We are now working with Solr 4 and running into memory issues and are
> wondering if we need to do something analogous for Solr 4.
>
> The javadoc for IndexWriterConfig (
> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
> )
> indicates that the lucene 4.1 postings format has some parameters which may
> be set:
> "..To configure its parameters (the minimum and maximum size for a block),
> you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
> int)
> <https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29>
> "
>
> Is there documentation or discussion somewhere about how to determine
> appropriate parameters or some detail about what setting the maxBlockSize
> and minBlockSize does?
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org