You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Steven Schlansker <st...@likeness.com> on 2013/10/02 20:11:59 UTC
DocValues formats hold large byte[][]s even when using MMapDirectory
Hi,
I have a search application using Lucene 4.4.0 with various BinaryDocValues and SortedSetDocValues.
We use MMapDirectory to help keep the Java heap small / GC pause times short and instead rely on the OS buffer cache to keep things fast, which I gather is generally considered a "best practice" around here.
As our index grows, I've noticed that we are getting GC pauses and later OOM errors when reloading a new index due to gigabytes of byte[][]s held by Lucene42DocValuesProducer, specifically the PagedBytes.Reader.blocks from within Lucene42DocValuesProducer.loadBinary
I would have expected DocValues fields to use mapped bytes instead of copying into the Java heap much as the "main" index data is. Is this a technical limitation, a "we haven't gotten there yet" feature request, or something different entirely?
Thanks for helping my understanding,
Steven
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: DocValues formats hold large byte[][]s even when using MMapDirectory
Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Oct 2, 2013 at 2:37 PM, Steven Schlansker <st...@likeness.com> wrote:
>
> On Oct 2, 2013, at 11:16 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
>
>> In Lucene 4.5 (coming out any day now) we've switched by default to a
>> "mostly on disk" impl for doc values.
>>
>
> Awesome! Looking forward to that then.
>
>> Before that, you can use DiskDocValuesFormat instead.
>>
>> But you'll need to re-index (or create a new index and use
>> IW.addIndexes) to cutover your current index to the DiskDVFormat.
>>
>
> I see a few references scattered on the internet but it's not in my Lucene jars.
It should be in the codecs module/JAR.
> The one reference I saw to it indicated that every patch release of Lucene will require a full reindex when using this, which is a serious bummer.
Right, all but the default codec do not offer back compatibility; but
now that Disk has become the default it will have back compatibility
going forward.
> So I think I'll hold out for 4.5 and hope that that solves my problem.
Sounds good!
> Thanks for the help!
You're welcome!
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: DocValues formats hold large byte[][]s even when using MMapDirectory
Posted by Steven Schlansker <st...@likeness.com>.
On Oct 2, 2013, at 11:16 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
> In Lucene 4.5 (coming out any day now) we've switched by default to a
> "mostly on disk" impl for doc values.
>
Awesome! Looking forward to that then.
> Before that, you can use DiskDocValuesFormat instead.
>
> But you'll need to re-index (or create a new index and use
> IW.addIndexes) to cutover your current index to the DiskDVFormat.
>
I see a few references scattered on the internet but it's not in my Lucene jars. The one reference I saw to it indicated that every patch release of Lucene will require a full reindex when using this, which is a serious bummer.
So I think I'll hold out for 4.5 and hope that that solves my problem.
Thanks for the help!
>
> On Wed, Oct 2, 2013 at 2:11 PM, Steven Schlansker <st...@likeness.com> wrote:
>> Hi,
>>
>> I have a search application using Lucene 4.4.0 with various BinaryDocValues and SortedSetDocValues.
>> We use MMapDirectory to help keep the Java heap small / GC pause times short and instead rely on the OS buffer cache to keep things fast, which I gather is generally considered a "best practice" around here.
>> As our index grows, I've noticed that we are getting GC pauses and later OOM errors when reloading a new index due to gigabytes of byte[][]s held by Lucene42DocValuesProducer, specifically the PagedBytes.Reader.blocks from within Lucene42DocValuesProducer.loadBinary
>>
>> I would have expected DocValues fields to use mapped bytes instead of copying into the Java heap much as the "main" index data is. Is this a technical limitation, a "we haven't gotten there yet" feature request, or something different entirely?
>>
>> Thanks for helping my understanding,
>> Steven
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: DocValues formats hold large byte[][]s even when using MMapDirectory
Posted by Michael McCandless <lu...@mikemccandless.com>.
In Lucene 4.5 (coming out any day now) we've switched by default to a
"mostly on disk" impl for doc values.
Before that, you can use DiskDocValuesFormat instead.
But you'll need to re-index (or create a new index and use
IW.addIndexes) to cutover your current index to the DiskDVFormat.
Mike McCandless
http://blog.mikemccandless.com
On Wed, Oct 2, 2013 at 2:11 PM, Steven Schlansker <st...@likeness.com> wrote:
> Hi,
>
> I have a search application using Lucene 4.4.0 with various BinaryDocValues and SortedSetDocValues.
> We use MMapDirectory to help keep the Java heap small / GC pause times short and instead rely on the OS buffer cache to keep things fast, which I gather is generally considered a "best practice" around here.
> As our index grows, I've noticed that we are getting GC pauses and later OOM errors when reloading a new index due to gigabytes of byte[][]s held by Lucene42DocValuesProducer, specifically the PagedBytes.Reader.blocks from within Lucene42DocValuesProducer.loadBinary
>
> I would have expected DocValues fields to use mapped bytes instead of copying into the Java heap much as the "main" index data is. Is this a technical limitation, a "we haven't gotten there yet" feature request, or something different entirely?
>
> Thanks for helping my understanding,
> Steven
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org