You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Steven Schlansker <st...@likeness.com> on 2013/10/02 20:11:59 UTC

DocValues formats hold large byte[][]s even when using MMapDirectory

Hi,

I have a search application using Lucene 4.4.0 with various BinaryDocValues and SortedSetDocValues.
We use MMapDirectory to help keep the Java heap small / GC pause times short and instead rely on the OS buffer cache to keep things fast, which I gather is generally considered a "best practice" around here.
As our index grows, I've noticed that we are getting GC pauses and later OOM errors when reloading a new index due to gigabytes of byte[][]s held by Lucene42DocValuesProducer, specifically the PagedBytes.Reader.blocks from within Lucene42DocValuesProducer.loadBinary

I would have expected DocValues fields to use mapped bytes instead of copying into the Java heap much as the "main" index data is.  Is this a technical limitation, a "we haven't gotten there yet" feature request, or something different entirely?

Thanks for helping my understanding,
Steven


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DocValues formats hold large byte[][]s even when using MMapDirectory

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Oct 2, 2013 at 2:37 PM, Steven Schlansker <st...@likeness.com> wrote:
>
> On Oct 2, 2013, at 11:16 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
>
>> In Lucene 4.5 (coming out any day now) we've switched by default to a
>> "mostly on disk" impl for doc values.
>>
>
> Awesome!  Looking forward to that then.
>
>> Before that, you can use DiskDocValuesFormat instead.
>>
>> But you'll need to re-index (or create a new index and use
>> IW.addIndexes) to cutover your current index to the DiskDVFormat.
>>
>
> I see a few references scattered on the internet but it's not in my Lucene jars.

It should be in the codecs module/JAR.

> The one reference I saw to it indicated that every patch release of Lucene will require a full reindex when using this, which is a serious bummer.

Right, all but the default codec do not offer back compatibility; but
now that Disk has become the default it will have back compatibility
going forward.

> So I think I'll hold out for 4.5 and hope that that solves my problem.

Sounds good!

> Thanks for the help!

You're welcome!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DocValues formats hold large byte[][]s even when using MMapDirectory

Posted by Steven Schlansker <st...@likeness.com>.
On Oct 2, 2013, at 11:16 AM, Michael McCandless <lu...@mikemccandless.com> wrote:

> In Lucene 4.5 (coming out any day now) we've switched by default to a
> "mostly on disk" impl for doc values.
> 

Awesome!  Looking forward to that then.

> Before that, you can use DiskDocValuesFormat instead.
> 
> But you'll need to re-index (or create a new index and use
> IW.addIndexes) to cutover your current index to the DiskDVFormat.
> 

I see a few references scattered on the internet but it's not in my Lucene jars.  The one reference I saw to it indicated that every patch release of Lucene will require a full reindex when using this, which is a serious bummer.

So I think I'll hold out for 4.5 and hope that that solves my problem.
Thanks for the help!


> 
> On Wed, Oct 2, 2013 at 2:11 PM, Steven Schlansker <st...@likeness.com> wrote:
>> Hi,
>> 
>> I have a search application using Lucene 4.4.0 with various BinaryDocValues and SortedSetDocValues.
>> We use MMapDirectory to help keep the Java heap small / GC pause times short and instead rely on the OS buffer cache to keep things fast, which I gather is generally considered a "best practice" around here.
>> As our index grows, I've noticed that we are getting GC pauses and later OOM errors when reloading a new index due to gigabytes of byte[][]s held by Lucene42DocValuesProducer, specifically the PagedBytes.Reader.blocks from within Lucene42DocValuesProducer.loadBinary
>> 
>> I would have expected DocValues fields to use mapped bytes instead of copying into the Java heap much as the "main" index data is.  Is this a technical limitation, a "we haven't gotten there yet" feature request, or something different entirely?
>> 
>> Thanks for helping my understanding,
>> Steven
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DocValues formats hold large byte[][]s even when using MMapDirectory

Posted by Michael McCandless <lu...@mikemccandless.com>.
In Lucene 4.5 (coming out any day now) we've switched by default to a
"mostly on disk" impl for doc values.

Before that, you can use DiskDocValuesFormat instead.

But you'll need to re-index (or create a new index and use
IW.addIndexes) to cutover your current index to the DiskDVFormat.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Oct 2, 2013 at 2:11 PM, Steven Schlansker <st...@likeness.com> wrote:
> Hi,
>
> I have a search application using Lucene 4.4.0 with various BinaryDocValues and SortedSetDocValues.
> We use MMapDirectory to help keep the Java heap small / GC pause times short and instead rely on the OS buffer cache to keep things fast, which I gather is generally considered a "best practice" around here.
> As our index grows, I've noticed that we are getting GC pauses and later OOM errors when reloading a new index due to gigabytes of byte[][]s held by Lucene42DocValuesProducer, specifically the PagedBytes.Reader.blocks from within Lucene42DocValuesProducer.loadBinary
>
> I would have expected DocValues fields to use mapped bytes instead of copying into the Java heap much as the "main" index data is.  Is this a technical limitation, a "we haven't gotten there yet" feature request, or something different entirely?
>
> Thanks for helping my understanding,
> Steven
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org