You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by siddharth teotia <si...@gmail.com> on 2019/11/07 04:27:37 UTC

Memory usage

Hi All

I have some questions about the memory usage. I would really appreciate if
someone can help answer these.

I understand from the docs that during reading/querying, Lucene uses
MMapDirectory (assuming it is supported on the platform). So the Java heap
overhead in this case will purely come from the objects that are
allocated/instantiated on the query path to process the query and build
results etc.  But the whole index itself will not be loaded into memory
because we memory mapped the file. Is my understanding correct? In this
case, we are better off not increasing the Java heap and keep as much
as possible available for the file system cache for mmap to do its job
efficiently.

However, are there any portions of index structures that are completely
loaded in memory regardless of whether it is MMapDirectory or not? If so,
are they loaded in Java heap or do we use off-heap (direct buffers) in
such cases?

Secondly, on the write path I think even though the writer opens a
MMapDirectory, the writes are gathered/buffered in memory upto a flush
threshold controlled by IndexWriterConfig. Is this buffering done in Java
heap or direct memory?

Thanks a lot for help
Siddharth

Re: Memory usage

Posted by siddharth teotia <si...@gmail.com>.

Thanks, Stephen. I have asked my questions at solr-user@lucene.apache.org

On Mon, Nov 11, 2019 at 11:27 AM Stephen Bianamara <sb...@panopto.com>
wrote:

> Siddharth -- Part of the confusion here is that this is not the right email
> list to ask. General is about releases, publicity, and things of that
> nature. Technical threads like this are more suited for
> solr-user@lucene.apache.org. Please subscribe there and redirect your
> question there instead.
>
> Best,
> Stephen
>
> On Mon, Nov 11, 2019 at 11:18 AM siddharth teotia <
> siddharthteotia@gmail.com>
> wrote:
>
> > Hi Michael
> >
> > Can you or someone from the community please help answer my questions?
> >
> > Thanks
> > Siddharth
> >
> > On Thu, Nov 7, 2019 at 7:50 AM siddharth teotia <
> siddharthteotia@gmail.com
> > >
> > wrote:
> >
> > > Hi Michael
> > >
> > > Thanks a lot for your response. Couple of more questions
> > >
> > > (1) During indexing, is there any knob to tell the writer to use
> off-heap
> > > for buffering. I didn't find anything in the docs so probably the
> answer
> > is
> > > no. Just confirming..
> > >
> > > (2) In my experiments, I have gone upto ingesting 5 million documents
> > into
> > > the lucene index and the number of segments created was 1. The writer
> was
> > > committed and closed after ingesting all the documents and after that
> > there
> > > is no need for us to index more. So essentially it is an immutable
> index.
> > > Basically I wanted to find the threshold for creating a new segment. Is
> > > that pretty high? Or if the writer is reopened, then the next set of
> > > documents will go into the next segment and so on? The reason for doing
> > > this is to find the total number of files (per index) that will be
> opened
> > > during querying. So far since it was a single segment, only that
> > segment's
> > > cfs file was opened.
> > >
> > > Thanks
> > > Siddharth
> > >
> > > On Thu, Nov 7, 2019, 6:39 AM Michael McCandless <
> > lucene@mikemccandless.com>
> > > wrote:
> > >
> > >> Hi Siddharth,
> > >>
> > >> Your understanding of MMapDirectory is correct -- only give your JVM
> > >> enough heap to not spend too much CPU on GC, and then let the OS use
> all
> > >> available remaining RAM to cache hot pages from your index.
> > >>
> > >> There are some structures Lucene loads into JVM heap, but even those
> are
> > >> being moved off-heap (accessed via Directory) recently such as FSTs
> used
> > >> for the terms index, and BKD index (for dimensional points).  I'm not
> > sure
> > >> exactly which structures are still in heap ... maybe the live
> documents
> > >> bitset?
> > >>
> > >> During indexing, the recently indexed documents are buffered in JVM
> > heap,
> > >> up until the IndexWriterConfig.setRAMBufferSizeMB and then they will
> be
> > >> written to the Directory as new segments.
> > >>
> > >> Mike McCandless
> > >>
> > >> http://blog.mikemccandless.com
> > >>
> > >>
> > >> On Wed, Nov 6, 2019 at 11:27 PM siddharth teotia <
> > >> siddharthteotia@gmail.com> wrote:
> > >>
> > >>> Hi All
> > >>>
> > >>> I have some questions about the memory usage. I would really
> appreciate
> > >>> if
> > >>> someone can help answer these.
> > >>>
> > >>> I understand from the docs that during reading/querying, Lucene uses
> > >>> MMapDirectory (assuming it is supported on the platform). So the Java
> > >>> heap
> > >>> overhead in this case will purely come from the objects that are
> > >>> allocated/instantiated on the query path to process the query and
> build
> > >>> results etc.  But the whole index itself will not be loaded into
> memory
> > >>> because we memory mapped the file. Is my understanding correct? In
> this
> > >>> case, we are better off not increasing the Java heap and keep as much
> > >>> as possible available for the file system cache for mmap to do its
> job
> > >>> efficiently.
> > >>>
> > >>> However, are there any portions of index structures that are
> completely
> > >>> loaded in memory regardless of whether it is MMapDirectory or not? If
> > so,
> > >>> are they loaded in Java heap or do we use off-heap (direct buffers)
> in
> > >>> such cases?
> > >>>
> > >>> Secondly, on the write path I think even though the writer opens a
> > >>> MMapDirectory, the writes are gathered/buffered in memory upto a
> flush
> > >>> threshold controlled by IndexWriterConfig. Is this buffering done in
> > Java
> > >>> heap or direct memory?
> > >>>
> > >>> Thanks a lot for help
> > >>> Siddharth
> > >>>
> > >>
> >
> > --
> > *Best Regards,*
> > *SIDDHARTH TEOTIA*
> > *2008C6PS540G*
> > *BITS PILANI- GOA CAMPUS*
> >
> > *+91 87911 75932*
> >
>
>
> --
> Thanks!
>
> Stephen Bianamara
> Search Technology - Technical Lead
>


-- 
*Best Regards,*
*SIDDHARTH TEOTIA*
*2008C6PS540G*
*BITS PILANI- GOA CAMPUS*

*+91 87911 75932*

Re: Memory usage

Posted by Stephen Bianamara <sb...@panopto.com>.

Siddharth -- Part of the confusion here is that this is not the right email
list to ask. General is about releases, publicity, and things of that
nature. Technical threads like this are more suited for
solr-user@lucene.apache.org. Please subscribe there and redirect your
question there instead.

Best,
Stephen

On Mon, Nov 11, 2019 at 11:18 AM siddharth teotia <si...@gmail.com>
wrote:

> Hi Michael
>
> Can you or someone from the community please help answer my questions?
>
> Thanks
> Siddharth
>
> On Thu, Nov 7, 2019 at 7:50 AM siddharth teotia <siddharthteotia@gmail.com
> >
> wrote:
>
> > Hi Michael
> >
> > Thanks a lot for your response. Couple of more questions
> >
> > (1) During indexing, is there any knob to tell the writer to use off-heap
> > for buffering. I didn't find anything in the docs so probably the answer
> is
> > no. Just confirming..
> >
> > (2) In my experiments, I have gone upto ingesting 5 million documents
> into
> > the lucene index and the number of segments created was 1. The writer was
> > committed and closed after ingesting all the documents and after that
> there
> > is no need for us to index more. So essentially it is an immutable index.
> > Basically I wanted to find the threshold for creating a new segment. Is
> > that pretty high? Or if the writer is reopened, then the next set of
> > documents will go into the next segment and so on? The reason for doing
> > this is to find the total number of files (per index) that will be opened
> > during querying. So far since it was a single segment, only that
> segment's
> > cfs file was opened.
> >
> > Thanks
> > Siddharth
> >
> > On Thu, Nov 7, 2019, 6:39 AM Michael McCandless <
> lucene@mikemccandless.com>
> > wrote:
> >
> >> Hi Siddharth,
> >>
> >> Your understanding of MMapDirectory is correct -- only give your JVM
> >> enough heap to not spend too much CPU on GC, and then let the OS use all
> >> available remaining RAM to cache hot pages from your index.
> >>
> >> There are some structures Lucene loads into JVM heap, but even those are
> >> being moved off-heap (accessed via Directory) recently such as FSTs used
> >> for the terms index, and BKD index (for dimensional points).  I'm not
> sure
> >> exactly which structures are still in heap ... maybe the live documents
> >> bitset?
> >>
> >> During indexing, the recently indexed documents are buffered in JVM
> heap,
> >> up until the IndexWriterConfig.setRAMBufferSizeMB and then they will be
> >> written to the Directory as new segments.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Nov 6, 2019 at 11:27 PM siddharth teotia <
> >> siddharthteotia@gmail.com> wrote:
> >>
> >>> Hi All
> >>>
> >>> I have some questions about the memory usage. I would really appreciate
> >>> if
> >>> someone can help answer these.
> >>>
> >>> I understand from the docs that during reading/querying, Lucene uses
> >>> MMapDirectory (assuming it is supported on the platform). So the Java
> >>> heap
> >>> overhead in this case will purely come from the objects that are
> >>> allocated/instantiated on the query path to process the query and build
> >>> results etc.  But the whole index itself will not be loaded into memory
> >>> because we memory mapped the file. Is my understanding correct? In this
> >>> case, we are better off not increasing the Java heap and keep as much
> >>> as possible available for the file system cache for mmap to do its job
> >>> efficiently.
> >>>
> >>> However, are there any portions of index structures that are completely
> >>> loaded in memory regardless of whether it is MMapDirectory or not? If
> so,
> >>> are they loaded in Java heap or do we use off-heap (direct buffers) in
> >>> such cases?
> >>>
> >>> Secondly, on the write path I think even though the writer opens a
> >>> MMapDirectory, the writes are gathered/buffered in memory upto a flush
> >>> threshold controlled by IndexWriterConfig. Is this buffering done in
> Java
> >>> heap or direct memory?
> >>>
> >>> Thanks a lot for help
> >>> Siddharth
> >>>
> >>
>
> --
> *Best Regards,*
> *SIDDHARTH TEOTIA*
> *2008C6PS540G*
> *BITS PILANI- GOA CAMPUS*
>
> *+91 87911 75932*
>


-- 
Thanks!

Stephen Bianamara
Search Technology - Technical Lead

Re: Memory usage

Posted by siddharth teotia <si...@gmail.com>.

Hi Michael

Can you or someone from the community please help answer my questions?

Thanks
Siddharth

On Thu, Nov 7, 2019 at 7:50 AM siddharth teotia <si...@gmail.com>
wrote:

> Hi Michael
>
> Thanks a lot for your response. Couple of more questions
>
> (1) During indexing, is there any knob to tell the writer to use off-heap
> for buffering. I didn't find anything in the docs so probably the answer is
> no. Just confirming..
>
> (2) In my experiments, I have gone upto ingesting 5 million documents into
> the lucene index and the number of segments created was 1. The writer was
> committed and closed after ingesting all the documents and after that there
> is no need for us to index more. So essentially it is an immutable index.
> Basically I wanted to find the threshold for creating a new segment. Is
> that pretty high? Or if the writer is reopened, then the next set of
> documents will go into the next segment and so on? The reason for doing
> this is to find the total number of files (per index) that will be opened
> during querying. So far since it was a single segment, only that segment's
> cfs file was opened.
>
> Thanks
> Siddharth
>
> On Thu, Nov 7, 2019, 6:39 AM Michael McCandless <lu...@mikemccandless.com>
> wrote:
>
>> Hi Siddharth,
>>
>> Your understanding of MMapDirectory is correct -- only give your JVM
>> enough heap to not spend too much CPU on GC, and then let the OS use all
>> available remaining RAM to cache hot pages from your index.
>>
>> There are some structures Lucene loads into JVM heap, but even those are
>> being moved off-heap (accessed via Directory) recently such as FSTs used
>> for the terms index, and BKD index (for dimensional points).  I'm not sure
>> exactly which structures are still in heap ... maybe the live documents
>> bitset?
>>
>> During indexing, the recently indexed documents are buffered in JVM heap,
>> up until the IndexWriterConfig.setRAMBufferSizeMB and then they will be
>> written to the Directory as new segments.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Nov 6, 2019 at 11:27 PM siddharth teotia <
>> siddharthteotia@gmail.com> wrote:
>>
>>> Hi All
>>>
>>> I have some questions about the memory usage. I would really appreciate
>>> if
>>> someone can help answer these.
>>>
>>> I understand from the docs that during reading/querying, Lucene uses
>>> MMapDirectory (assuming it is supported on the platform). So the Java
>>> heap
>>> overhead in this case will purely come from the objects that are
>>> allocated/instantiated on the query path to process the query and build
>>> results etc.  But the whole index itself will not be loaded into memory
>>> because we memory mapped the file. Is my understanding correct? In this
>>> case, we are better off not increasing the Java heap and keep as much
>>> as possible available for the file system cache for mmap to do its job
>>> efficiently.
>>>
>>> However, are there any portions of index structures that are completely
>>> loaded in memory regardless of whether it is MMapDirectory or not? If so,
>>> are they loaded in Java heap or do we use off-heap (direct buffers) in
>>> such cases?
>>>
>>> Secondly, on the write path I think even though the writer opens a
>>> MMapDirectory, the writes are gathered/buffered in memory upto a flush
>>> threshold controlled by IndexWriterConfig. Is this buffering done in Java
>>> heap or direct memory?
>>>
>>> Thanks a lot for help
>>> Siddharth
>>>
>>

-- 
*Best Regards,*
*SIDDHARTH TEOTIA*
*2008C6PS540G*
*BITS PILANI- GOA CAMPUS*

*+91 87911 75932*

Re: Memory usage

Posted by siddharth teotia <si...@gmail.com>.

Hi Michael

Thanks a lot for your response. Couple of more questions

(1) During indexing, is there any knob to tell the writer to use off-heap
for buffering. I didn't find anything in the docs so probably the answer is
no. Just confirming..

(2) In my experiments, I have gone upto ingesting 5 million documents into
the lucene index and the number of segments created was 1. The writer was
committed and closed after ingesting all the documents and after that there
is no need for us to index more. So essentially it is an immutable index.
Basically I wanted to find the threshold for creating a new segment. Is
that pretty high? Or if the writer is reopened, then the next set of
documents will go into the next segment and so on? The reason for doing
this is to find the total number of files (per index) that will be opened
during querying. So far since it was a single segment, only that segment's
cfs file was opened.

Thanks
Siddharth

On Thu, Nov 7, 2019, 6:39 AM Michael McCandless <lu...@mikemccandless.com>
wrote:

> Hi Siddharth,
>
> Your understanding of MMapDirectory is correct -- only give your JVM
> enough heap to not spend too much CPU on GC, and then let the OS use all
> available remaining RAM to cache hot pages from your index.
>
> There are some structures Lucene loads into JVM heap, but even those are
> being moved off-heap (accessed via Directory) recently such as FSTs used
> for the terms index, and BKD index (for dimensional points).  I'm not sure
> exactly which structures are still in heap ... maybe the live documents
> bitset?
>
> During indexing, the recently indexed documents are buffered in JVM heap,
> up until the IndexWriterConfig.setRAMBufferSizeMB and then they will be
> written to the Directory as new segments.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Nov 6, 2019 at 11:27 PM siddharth teotia <
> siddharthteotia@gmail.com> wrote:
>
>> Hi All
>>
>> I have some questions about the memory usage. I would really appreciate if
>> someone can help answer these.
>>
>> I understand from the docs that during reading/querying, Lucene uses
>> MMapDirectory (assuming it is supported on the platform). So the Java heap
>> overhead in this case will purely come from the objects that are
>> allocated/instantiated on the query path to process the query and build
>> results etc.  But the whole index itself will not be loaded into memory
>> because we memory mapped the file. Is my understanding correct? In this
>> case, we are better off not increasing the Java heap and keep as much
>> as possible available for the file system cache for mmap to do its job
>> efficiently.
>>
>> However, are there any portions of index structures that are completely
>> loaded in memory regardless of whether it is MMapDirectory or not? If so,
>> are they loaded in Java heap or do we use off-heap (direct buffers) in
>> such cases?
>>
>> Secondly, on the write path I think even though the writer opens a
>> MMapDirectory, the writes are gathered/buffered in memory upto a flush
>> threshold controlled by IndexWriterConfig. Is this buffering done in Java
>> heap or direct memory?
>>
>> Thanks a lot for help
>> Siddharth
>>
>

Re: Memory usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi Siddharth,

Your understanding of MMapDirectory is correct -- only give your JVM enough
heap to not spend too much CPU on GC, and then let the OS use all available
remaining RAM to cache hot pages from your index.

There are some structures Lucene loads into JVM heap, but even those are
being moved off-heap (accessed via Directory) recently such as FSTs used
for the terms index, and BKD index (for dimensional points).  I'm not sure
exactly which structures are still in heap ... maybe the live documents
bitset?

During indexing, the recently indexed documents are buffered in JVM heap,
up until the IndexWriterConfig.setRAMBufferSizeMB and then they will be
written to the Directory as new segments.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Nov 6, 2019 at 11:27 PM siddharth teotia <si...@gmail.com>
wrote:

> Hi All
>
> I have some questions about the memory usage. I would really appreciate if
> someone can help answer these.
>
> I understand from the docs that during reading/querying, Lucene uses
> MMapDirectory (assuming it is supported on the platform). So the Java heap
> overhead in this case will purely come from the objects that are
> allocated/instantiated on the query path to process the query and build
> results etc.  But the whole index itself will not be loaded into memory
> because we memory mapped the file. Is my understanding correct? In this
> case, we are better off not increasing the Java heap and keep as much
> as possible available for the file system cache for mmap to do its job
> efficiently.
>
> However, are there any portions of index structures that are completely
> loaded in memory regardless of whether it is MMapDirectory or not? If so,
> are they loaded in Java heap or do we use off-heap (direct buffers) in
> such cases?
>
> Secondly, on the write path I think even though the writer opens a
> MMapDirectory, the writes are gathered/buffered in memory upto a flush
> threshold controlled by IndexWriterConfig. Is this buffering done in Java
> heap or direct memory?
>
> Thanks a lot for help
> Siddharth
>