You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Gili Nachum <GI...@il.ibm.com> on 2013/01/23 13:59:04 UTC

MMapDirectory performance - Are searchable field values contiguously stored in FS block?

Hi,

I have a search workload that focuses on two fields in my 1GB index. I get
very good performance when loaded the index via MmapDirectory. I attribute
this performance to the Operating System File System (FS OS) cache, that
keeps the most recently used FS blocks RAM resident.

I would like to add 50 more fields to the index, increasing it size to
~50GB, A key factor is that these additional fields will be queried very
rarely.
Given this increase in index size, should I expect lower Queries/Sec rate
for the original search workload (that doesn't use the new fields)?

I would assume that if the values of each searchable field are stored in a
different set of FS blocks, then the 50 additional fields would make no
difference for the OS FS cache, as it would continue to behave like before,
keeping in RAM those most used FS blocks.
On the other hand, if values from different fields share the same FS
blocks, then the hot 2 fields values will be to scattered acrossed the FS
the OS cache useless. degradating performance back to I/O bounded.

Which is the case with Lucene 3.6?

Thanks.
Gili Nachum.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MMapDirectory performance - Are searchable field values contiguously stored in FS block?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Are the additional rarely used 48 fields used for searching?  Or, for
looking up stored fields?

If it's for searching then you should see good locality (efficient use
of the OS's IO cache) from the posting lists: each field's postings
are stored in a single chunk of the files, then the next field's
postings, etc.  Ie the storage is "column stride" (if columns are
fields and rows are documents).

But for stored fields, or term vectors, which are "row stride", you
won't see efficient use of the OS's IO cache.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jan 23, 2013 at 7:59 AM, Gili Nachum <GI...@il.ibm.com> wrote:
>
> Hi,
>
> I have a search workload that focuses on two fields in my 1GB index. I get
> very good performance when loaded the index via MmapDirectory. I attribute
> this performance to the Operating System File System (FS OS) cache, that
> keeps the most recently used FS blocks RAM resident.
>
> I would like to add 50 more fields to the index, increasing it size to
> ~50GB, A key factor is that these additional fields will be queried very
> rarely.
> Given this increase in index size, should I expect lower Queries/Sec rate
> for the original search workload (that doesn't use the new fields)?
>
> I would assume that if the values of each searchable field are stored in a
> different set of FS blocks, then the 50 additional fields would make no
> difference for the OS FS cache, as it would continue to behave like before,
> keeping in RAM those most used FS blocks.
> On the other hand, if values from different fields share the same FS
> blocks, then the hot 2 fields values will be to scattered acrossed the FS
> the OS cache useless. degradating performance back to I/O bounded.
>
> Which is the case with Lucene 3.6?
>
> Thanks.
> Gili Nachum.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org