You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2009/12/18 03:38:46 UTC

Discussions elsewhere about fsync, SortWriter memory costs

Greets,

In the JIRA issue LUCENE-2026, Mike McCandless and I have been talking about
how and when to protect Lucy against power-failure-induced index corruption.

  https://issues.apache.org/jira/browse/LUCENE-2026

Over on the Lucene java-user list, we're discussing how to rein in
SortWriter's memory costs:

  http://markmail.org/message/zcgb6ak24di42rev

Marvin Humphrey

Re: SortWriter memory costs

Posted by Nathan Kurz <na...@verse.com>.

On Sun, Dec 27, 2009 at 4:45 PM, Nathan Kurz <na...@verse.com> wrote:
> This would be hard, though, so I'd suggest just sticking with your
> current approach and making your hash slightly more efficient.  Then
> think about how you can handle non-text fields directly --- instead of
> ords and offsets, handle dates, times, ints, floats, ratings etc.
> directly.  You can keep using an ord file for text, but treat this as
> an optimization rather than a requirement.    And think about how to
> make the intersegment merges efficient.

Simplifying up rereading my poorly organized response:

I think numeric types are likely the main use for sorting.  While
converting strings to ords does convert allow them to be compared as
ints, this seems like it should be the special case rather than the
one determining the architecture.  I'd love to see a model of how
efficient date sorting based on UNIX timestamps would work first, and
then figure out how to add in free text afterward.  It might turn out
that ords are great, but I think that a direct sorting model might be
simpler and more general.

Nathan Kurz
nate@verse.com

Re: SortWriter memory costs

Posted by Nathan Kurz <na...@verse.com>.

On Sun, Dec 20, 2009 at 7:08 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Sat, Dec 19, 2009 at 12:36:37PM -0800, Nathan Kurz wrote:
>> > Over on the Lucene java-user list, we're discussing how to rein in
>> > SortWriter's memory costs:
>> >
>> >  http://markmail.org/message/zcgb6ak24di42rev
>>
>> I'm not sure that I understand your approach here.  Could you offer
>> some context?
>
> Sure, the easiest way to do that is to explain how SortWriter works now and
> why it consumes too much memory.  I'll use JavaScript syntax for the
> pseudocode.[1]

Thanks for the detailed description.  I'm more confident in my
understanding now, although I'm sure there are still gaps.  The
Javascript is a fine approach, although I think pseudo-C would be fine
too.

> Our problem is the unique_values hash.  For fields with low cardinality -- for
> instance if our "category" field has only a few possible values -- it never
> gets very big.  But for fields with high cardinality -- such as a "product_id"
> field where every value is unique -- it gets huge and consumes a ton of RAM.
>
> That's what we need to solve.

Patient -- Doctor, every time I poke myself in the eye with this stick
I get a sharp pain.
Doctor -- Well then stop poking yourself in the eye with a stick!

Patient -- Every time I try to sort by a text field on a gigantic
index for what should be a numeric field, my memory usage blows up.
Doctor -- Well, I suppose we could wrap the end of the stick with this
dirty rag to see if blunts the pain...

My dominant impression is that converting text fields to ordinals to
allow easier sorting is an ugly hack that should not be encouraged.
Yes, it works if you're willing to tolerate falling back on alternate
solutions for sharding and segments, able to live with inefficient
joins with text-matching, and are OK with making your indexes
essentially unmodifiable,   But isn't there a better way?

But let's come back to that later...

> When we finish off the segment, the present SortWriter writes three files for
> each sortable text field: ords, offsets, and UTF-8 character data:
>
>   seg_1/sort-2.ord  // <--- The "2" is the field number.
>   seg_1/sort-2.ix
>   seg_1/sort-2.dat

To check my understanding:

'ord' is for ordinal, and conceptually is the same as the term-id in a
sorted lexicon.  The '.ord' file is an array of ordinals indexed by
doc-id (although bit-compressed). Conceptually, looking up ord[doc-id]
would give you the ordinal of the value of 'category' for doc-id.

'ix' is for index, and 'dat' is for data.  The '.ix' file is an array
of offsets into the '.dat' file, so that to look up the text
representation of ordinal 'o', one would start reading at dat[o].
Wrapping, one looks up the text representation of 'category' for
doc-id using dat[ord[doc-id]].

> First, SortWriter creates an array of unique sorted values by sorting the keys
> in the unique_values hash.

It seems like the main problem would be the overhead of the hash.
Are you trying to use host language hashes, or are you doing something
custom?  My first thought would be to write a simple custom hash that
stores offsets into '.dat' rather than values.   If a string isn't in
the hash, append it to the .dat file (lexicon) and save the offset.
Common strings will thus tend to appear at the start of the file, rare
ones at the end.   Given the fixed width of the offsets, you should be
able to get the overhead of the hash down to something very small.

> Right now, SortWriter is the only DataWriter component that has a problem with
> RAM footprint -- all the other components (LexWriter, DocWriter,
> PostingsWriter, etc) do a good job of keeping RAM usage under control and
> scale up beyond the practical size for an index on one machine.

This is all per-segment, right?  How many docs are you planning to
target per segment?
How many do you have to get to before the RAM usage is out of control?

100 million seems like a lot, to me.  100 million docs with a plain
text Unique_ID field and each entry in the hash taking up 100 bytes
(worst case current?) would be 10GB, and at the boundary of
in-control.  If you can get that overhead down to 10 bytes, (1 GB) it
seems like you'd be firmly in control again, even if you have multiple
such fields.  And if not, you could do always do 1 million docs per
segment and merge them later.

> Yes.  Instead of using a hash to perform the uniquing, we'd use a file.
>
> Think of it this way: we're dispensing with the unique_values hash and using
> only the doc_values array.  At segment finish time, we use that doc_values
> array to create a sorted list of doc_ids.

OK, so you don't worry about duplicating strings in the data file ---
you just write out everything and save the offset.  Then when you are
done, you sort the index file by the values of the stored strings.

This seems like a small win in the case of 'Product ID', and a major
loss for a field like 'Gender' that has only a limited number of
values.  Do you really want to indirectly sort 100,000,000 strings
that have only two (OK, 7 for the Bay Area) distinct values?

The windowing function Toke alludes to would solve most of this, but
once you're doing this it doesn't seem that much harder to go all the
way and do a low-overhead hash.

My overall impression is that this approach would work, but that there
are some tradeoffs and it doesn't really buy you that much overall
efficiency.  Either you fit in RAM or you don't --- once you don't,
even an SSD isn't going to keep performance from diving off the cliff.

>> I'd wonder how you'd merge results between shards or sort the results
>> of text query.
>
> At search time, we use the mmap'd ords array to perform all sort comparisons
> within a segment -- which is lightning fast.  :)  For comparison across
> segments, we use actual values, which is slower but not that expensive in the
> grand scheme of things, since it scales with the number of segments rather
> than the number of documents.

I'm still not quite understanding your approach.  I just found this
email with GMail by searching for 'marvin', so let's use that as a
base case for ordering.  Making up syntax, I enter 'marvin' in the box
and Google translates it to something like this: "MATCH marvin IN
all-fields ORDER BY date LIMIT 10".

There are several ways to handle this query.  Which one are you aiming for?

1) Incrementally find hits for 'marvin', lookup ord[doc-id] at the
time of each hit, score as 1/ord, and keep the top 10.

2) Find all hits for marvin, sort them all using the mmapped ord
array, return the top 10.

3) Find all hits for marvin, then use this to filter an ord sorted
list of doc-ids, stopping once one has found 10 matches.

It sounds like you are aiming for 2), but this seems like it would be
unwieldy when searching for a common term:  it seems expensive to sort
a list that big.  I'm also not sure how to break ties by using the
score for the text term.   I'm most comfortable with 3) because of
it's flexibility, but can see the advantage of 1) for efficiency on
rare terms.

If you were to do something closer to 2), I'd be tempted to just look
up offset[doc-id] and use the string directly to do the sorting.

>> What's the main use case you are targetting?
>
> All queries with a SortSpec that sorts on a field and all RangeQueries use our
> mmap'd sort caches.

The explanation up to here is wonderfully clear, but this seems little
circular as a use case.   Is the Gmail example a reasonable one?   Is
there a better one?

For me, I think I useful case to think about would involve both full
text search and a range query, with results ordered by the value used
in the range query and ties broken by strength of text match.
Something like "MATCH word IN title WHERE rating > 3 ORDER BY rating
DESCENDING".

In almost all the non-contrived cases I can come up with, the ranges
are dates or numerics, not text fields.  I can come up with some cases
where I'd want to order alphabetically, but not that many.

Which brings me back to my opening rant...

Ordinals serve as a minimal monotonic perfect hash for the input
strings.    The actual value of the ordinal is not that useful ---
more commonly, you just want a fast way of comparing two strings, and
comparing two ordinals turns is faster.   If the lexicon itself is
ordered, the offsets into the lexicon (data file), while not minimal,
can will compare the same.

Since the cases that interest me most of those where the index can be
modified without being rebuilt, I'd rather not require that the
lexicon itself be ordered, since it's awfully convenient just to add a
new string by appending.   I'm not even sure you need to have the
ordinals precalculated at all --- an ordered list of (doc-id, offset)
pairs seems like it would be sufficient.

To be editable, you'd probably need to keep this in a BTree variant,
or some other page oriented ordered data structure.  The nice part of
this is that it would be entirely pay-as-you-go:  if you add to this
file directly, you'd have minimal memory usage at all steps of the
process, and once you've added the last document, there's no giant
sorting step that needs to happen.

This would be hard, though, so I'd suggest just sticking with your
current approach and making your hash slightly more efficient.  Then
think about how you can handle non-text fields directly --- instead of
ords and offsets, handle dates, times, ints, floats, ratings etc.
directly.  You can keep using an ord file for text, but treat this as
an optimization rather than a requirement.    And think about how to
make the intersegment merges efficient.

One thing I'm becoming more confident of is that handling of segments
and shards should be identical.  Multi-segment merges should not
require direct access to in-memory data structures, rather all data
necessary for aggregation should be presented along with the results.
 The results returned from a single segment, from a shard, or from the
whole shabang should be interchangeable.  If you can do this
efficiently, I think you are pretty much guaranteed to scale to
many-cores as well as many-machines.

Sorry if the ranting was too incoherent,

Nathan Kurz
nate@verse.com

SortWriter memory costs

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, Dec 19, 2009 at 12:36:37PM -0800, Nathan Kurz wrote:
> > Over on the Lucene java-user list, we're discussing how to rein in
> > SortWriter's memory costs:
> >
> >  http://markmail.org/message/zcgb6ak24di42rev
> 
> I'm not sure that I understand your approach here.  Could you offer
> some context?  

Sure, the easiest way to do that is to explain how SortWriter works now and
why it consumes too much memory.  I'll use JavaScript syntax for the
pseudocode.[1]

Assume each document in your index has a field, "category", which has been
marked as "sortable".  Assume that "category" has been assigned the field
number 2 for this segment.

As docs get added, SortWriter keeps track of each doc's value for the field
"category" in both a hash and an array.  

The hash stores unique values:

  var category = doc.fetch('category');
  sort_writer.unique_values['category'][category] = -1;

The array records the value for each doc_id:

  var unique_val = sort_writer.unique_values['category].find_key(category);
  sort_writer.doc_values['category'][doc_id] = unique_val;

Since the "doc_values" array only stores pointers to keys in the
"unique_values" hash, it doesn't grow very fast -- just one pointer per doc.

Our problem is the unique_values hash.  For fields with low cardinality -- for
instance if our "category" field has only a few possible values -- it never
gets very big.  But for fields with high cardinality -- such as a "product_id"
field where every value is unique -- it gets huge and consumes a ton of RAM.

That's what we need to solve.

When we finish off the segment, the present SortWriter writes three files for
each sortable text field: ords, offsets, and UTF-8 character data:

   seg_1/sort-2.ord  // <--- The "2" is the field number.
   seg_1/sort-2.ix
   seg_1/sort-2.dat

First, SortWriter creates an array of unique sorted values by sorting the keys
in the unique_values hash.  

  var sorted = sort_writer.unique_values[category].keys();
  sorted.sort();

The sort positions are then saved as values in the unique_values hash.

  for (var ord = 0; ord < sorted.length; ord++) {
    var key = sorted[ord];
    sort_writer.unique_values['category'][key] = ord;
  }

The sorted values are written out to disk as two files: offsets, and character
data.

  var ix_out  = sort_writer.folder.open_out('seg_1/sort-2.ix');
  var dat_out = sort_writer.folder.open_out('seg_1/sort-2.dat');
  for (var i = 0; i < sorted.length; i++) {
    var unique_val = sorted[i];
    ix_out.write_i64(dat_out.tell());
    dat_out.write_bytes(unique_val.get_ptr8(), unique_val.get_size());
  }
  ix_out.close();
  dat_out.close();

Next, we create the ordinals array by iterating over the doc_values array and
looking up each ord via the unique_values hash.

  for (var doc_id = 1; doc_id <= doc_max; doc_id++) {
    var key = sort_writer.doc_values['category'][doc_id];
    ords[doc_id] = sort_writer.unique_values['category'][key];
  }

The ords file is simply this ordinals array blasted out to disk after
compressing it to a smaller bit width appropriate to its cardinality. (2
unique values = one bit per doc, 13 unique values = four bits per doc, etc).

  var outstream = sort_writer.folder.open_out('seg_1/sort-2.ord');
  outstream.write_bytes(compressed_ords, compressed_ords_len);
  outstream.close();

Right now, SortWriter is the only DataWriter component that has a problem with
RAM footprint -- all the other components (LexWriter, DocWriter,
PostingsWriter, etc) do a good job of keeping RAM usage under control and
scale up beyond the practical size for an index on one machine.

> The plan is to store one giant file of all the field values, 

Yes.  Instead of using a hash to perform the uniquing, we'd use a file.

Think of it this way: we're dispensing with the unique_values hash and using
only the doc_values array.  At segment finish time, we use that doc_values
array to create a sorted list of doc_ids.

  var sorted_doc_ids = new Array(doc_max + 1);
  for (var i = 0; i <= doc_max; i++) { sorted_doc_ids[i] = i; }
  sorted_doc_ids.sort(compare_doc_ids_by_doc_values);

However, instead of actually storing CharBuf objects in the doc_values array,
we're going to save memory by storing file offsets pointing at a temporary
file where we've *serialized* those values.  During indexing, that costs us
one 64-bit integer per doc per sortable field, which will scale up OK.

To perform comparisons during sorting, we deserialize values from the
temporary file two at a time.

  var instream = sort_writer.folder.open_in('seg_1/sort_cache_writer_temp');
  var value_a  = CharBuf.new();
  var value_b  = CharBuf.new();

  function compare_doc_ids_by_doc_values(doc_a, doc_b) {
    var position_a = sort_writer.doc_values['category'][doc_a];
    var position_b = sort_writer.doc_values['category'][doc_b];
    instream.seek(position_a);
    read_value_into_charbuf(instream, value_a);
    instream.seek(position_b);
    read_value_into_charbuf(instream, value_b);
    return value_a.compare_to(value_b);
  }

We can derive the ords array using the sorted_doc_ids array and the temp file:

  var ord        = 0;
  var ords       = new Array(doc_max + 1);
  var last_value = new CharBuf();
  var this_value = new CharBuf();
  for (var i = 0; i < doc_max; i++) {
    var doc_id = sorted_doc_ids[i];
    var file_position = sort_writer.doc_values['category'][doc_id];
    instream.seek(file_position);
    read_value_into_charbuf(instream, this_value);
    if (!this_value.equals(last_value)) {
      ord++
      last_value = this_value;
    }
    ords[doc_id] = ord;
  }

This propose algorithm is similar to the one that Toke Eskildsen described on
java-user:

    http://markmail.org/message/tjgh3p6ugjgh5go6

However, it has been adapted for some of Lucy's constraints, e.g. no
read/write files.

> I'd wonder how you'd merge results between shards or sort the results
> of text query.   

At search time, we use the mmap'd ords array to perform all sort comparisons
within a segment -- which is lightning fast.  :)  For comparison across
segments, we use actual values, which is slower but not that expensive in the
grand scheme of things, since it scales with the number of segments rather
than the number of documents.

> What's the main use case you are targetting?

All queries with a SortSpec that sorts on a field and all RangeQueries use our
mmap'd sort caches.

Marvin Humphrey

[1] C isn't expressive enough for code samples, especially since it lacks
    syntax for hash tables and arrays.  Perl, Ruby, and Python all suffice,
    but I don't want to favor one binding language over another.  (You know
    Perl, Mike knows Python, who gets excluded from the conversation blah blah
    blah.)  JavaScript, which we *can't* write bindings for because it lacks
    IO, is sufficiently expressive, spawned the JSON we use in our metadata
    files, and is popular enough that most people should grok it.

Re: fsync

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Dec 23, 2009 at 12:31 AM, Nathan Kurz <na...@verse.com> wrote:

>> Whereas, using the filesystem really requires a file-flat data
>> structure?
>
> I guess it depends on your point of view: it would be hard (but not
> impossible) to do true objects in an mmapped file, but it would be
> very easy to do has-a type relationships using file offsets as
> pointers.  I tend to have a data-centric (rather than
> object-centric) point of view, but from here I don't see any data
> structures that would be significantly more difficult.

Interesting -- I guess if you "made" all the pointers relative (ie,
interpreted them so, when reading them), then you could make arbitrary
structures.

> Do you have a link that explains the FST you refer to?  I'm
> searching, and not finding anything that's a definite match.  "Field
> select table"?

Sorry -- FST = finite state transducer.  It adds optional "outputs"
along each edge, over a finite state machine.  When used for the terms
index, I think it'd be a symmetric letter trie -- ie, both prefixes
and suffixes are shared, with the "middle" part of the FST producing
the outputs (= index information for that term that uniquely crosses
that edge).

>> Ie, "going through the filesystem" and "going through shared
>> memory" are two alternatives for enabling efficient process-only
>> concurrency models.  They have interesting tradeoffs (I'll answer
>> more in 2026), but the fact that one of them is backed by a file by
>> the OS seems like a salient difference.
>
> For me, file backing doesn't seem like a big difference.  Fast
> moving changes will never hit disk, and I presume there is some way
> you can convince the system never to actually write out the slow
> changes (maybe mmap on a RamFS?).

What are fast & slow changes here?  Fast = new segments that get
created but then merged away before moving to stable storage?

> I think the real difference is between sharing between threads and
> sharing between processes --- basically, whether or not you can
> assume that the address space is identical in all the 'sharees'.

Yes, process only concurreny seems like the big difference.

> I'll mention that, given the New Year, at first I thought 2026 was
> your realistic time estimate rather than a tracking number.

Heh ;)

> I started thinking about how one could do objects with mmap, and
> came up with an approach that doesn't quite answer that question but
> might actually work out well for other problems: you could literally
> compile your index and link it in as a shared library.  Each term
> would be a symbol, and you'd use 'dlsym' to find the associated
> data.
>
> It's possible that you could even use library versioning to handle
> updates, and stuff like RTLD_NEXT to handle multiple
> segments. Perhaps a really bad idea, but I find it an intriguing
> one.  I wonder how fast using libdl would be compared to writing
> your own lookup tables.  I'd have to guess it's fairly efficient.

That is a wild idea!  I wonder how dlsym represents its information...

Mike

Re: fsync

Posted by Nathan Kurz <na...@verse.com>.

On Sun, Dec 20, 2009 at 6:15 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Sun, Dec 20, 2009 at 12:14 AM, Marvin Humphrey
> <ma...@rectangular.com> wrote:
>
>>> I also think that Mike is making too much distinction between
>>> "relying on the file system" and "using shared memory".  I think
>>> one can safely view them as two interfaces to the same underlying
>>> mechanism.
>
> Using the filesystem for sharing vs using shared memory seem quite
> different to me.  EG one could create a rich data structure (say an
> FST) to represent the terms dict in RAM, then share that terms dict
> amongst many processes, right?
>
> Whereas, using the filesystem really requires a file-flat data
> structure?

I guess it depends on your point of view:  it would be hard (but not
impossible) to do true objects in an mmapped file, but it would be
very easy to do has-a type relationships using file offsets as
pointers.  I tend to have a data-centric (rather than object-centric)
point of view, but from here I don't see any data structures that
would be significantly more difficult.

Do you have a link that explains the FST you refer to?  I'm searching,
and not finding anything that's a definite match.  "Field select
table"?

> Ie, "going through the filesystem" and "going through shared memory"
> are two alternatives for enabling efficient process-only concurrency
> models.  They have interesting tradeoffs (I'll answer more in 2026),
> but the fact that one of them is backed by a file by the OS seems like
> a salient difference.

For me, file backing doesn't seem like a big difference.   Fast moving
changes will never hit disk, and I presume there is some way you can
convince the system never to actually write out the slow changes
(maybe mmap on a RamFS?).  I think the real difference is between
sharing between threads and sharing between processes --- basically,
whether or not you can assume that the address space is identical in
all the 'sharees'.

I'll mention that, given the New Year, at first I thought 2026 was
your realistic time estimate rather than a tracking number.

---

I started thinking about how one could do objects with mmap, and came
up with an approach that doesn't quite answer that question but might
actually work out well for other problems:  you could literally
compile your index and link it in as a shared library.
Each term would be a symbol, and you'd use 'dlsym' to find the associated data.

It's possible that you could even use library versioning to handle
updates, and stuff like RTLD_NEXT to handle multiple segments. Perhaps
a really bad idea, but I find it an intriguing one.   I wonder how
fast using libdl would be compared to writing your own lookup tables.
I'd have to guess it's fairly efficient.

Nathan Kurz
nate@verse.com

Re: fsync

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Dec 20, 2009 at 12:14 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:

>> I also think that Mike is making too much distinction between
>> "relying on the file system" and "using shared memory".  I think
>> one can safely view them as two interfaces to the same underlying
>> mechanism.
>
> I agree with that, and it was kind of confusing since Mike had
> previously seemed to suggest that the flush() semantics were a
> "Lucy-ification" of the Lucene model.

I still need to answer on 2026, but this caught my eye first ;)

Using the filesystem for sharing vs using shared memory seem quite
different to me.  EG one could create a rich data structure (say an
FST) to represent the terms dict in RAM, then share that terms dict
amongst many processes, right?

Whereas, using the filesystem really requires a file-flat data
structure?

Ie, "going through the filesystem" and "going through shared memory"
are two alternatives for enabling efficient process-only concurrency
models.  They have interesting tradeoffs (I'll answer more in 2026),
but the fact that one of them is backed by a file by the OS seems like
a salient difference.

Net/net, I think the proposed Lucy flush v commit semantics is a good
approach, given Lucy's design constraints.  Just like Lucene's NRT,
Lucy users won't be forced to tradeoff reopen time for durability.

Mike

fsync

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, Dec 19, 2009 at 12:36:37PM -0800, Nathan Kurz wrote:

> I think your approach here makes great sense:  you can't prevent data
> corruption, you just want to reduce the chance of it happening to an
> acceptable level.  

Do you have any objections to the compromise plan we arrived at?

    /** Permanently commit all changes made to the index.  
     *
     * Until either Commit() or Flush() completes, read processes cannot see
     * any of the current changes.
     */
    public void
    Commit(Indexer *self);

    /** A lightweight and technically unsafe variant of Commit() which skips
     * calling fsync() to force the OS to write all changes to disk.  It
     * returns faster than Commit(), but leaves the index vulnerable to
     * corruption in the event of a power failure or OS crash.
     */
    public void
    Flush(IndexWriter *self);

    /** Perform the expensive setup for Commit() (or Flush()) in advance, so
     * that Commit() completes quickly.  (If Prepare_Commit() is not called
     * explicitly by the user, Commit() will call it internally.)
     * 
     * @param fsync If true, fsync() all changed files.
     */
    public void
    Prepare_Commit(Indexer *self, bool_t fsync);

    /* Restore the index to the state of the last successful Commit(),
     * discarding any changes which may have been Flushed() since then.
     */
    public void
    Rollback(Indexer *self);
    
> Thinking about how you could add an external log file seems like a better
> failsafe than trying to do full 'commits' within the writer, seeing as there
> is no guarantee those commits will actually hit disk.

I think the addition of Flush() and Rollback() will make it easier for top
level apps to manage a transaction log.

Mike's point about "unbounded" corruption is a good one.  Old segment data
which gets merged away can suddenly get destroyed if fsync'ing the new segment
is incomplete when the power goes out.

We guard against that by always keeping around the last snapshot which went
through the complete fsync dance.  Essentially, we'll need to keep around two
snapshots: the last flushed snapshot, and the last committed snapshot.  

By "snapshot", I mean both the snapshot file itself and all the files it
references.

It's true that we can't guarantee that the hard drive obeys the fsync call,
but I do think that offering an fsync'd Commit() as an option is a good idea
-- so long as we also offer the "atomic but not durable" option for those who
know what they're doing.

> I also think that Mike is making too much distinction between "relying
> on the file system" and "using shared memory".  I think one can safely
> view them as two interfaces to the same underlying mechanism.

I agree with that, and it was kind of confusing since Mike had previously
seemed to suggest that the flush() semantics were a "Lucy-ification" of the
Lucene model.  See the first section of my latest reply to him:

    https://issues.apache.org/jira/browse/LUCENE-2026?focusedCommentId=12792939&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12792939

Marvin Humphrey

Re: Discussions elsewhere about fsync, SortWriter memory costs

Posted by Nathan Kurz <na...@verse.com>.

On Thu, Dec 17, 2009 at 6:38 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> In the JIRA issue LUCENE-2026, Mike McCandless and I have been talking about
> how and when to protect Lucy against power-failure-induced index corruption.
>
>  https://issues.apache.org/jira/browse/LUCENE-2026

I think your approach here makes great sense:  you can't prevent data
corruption, you just want to reduce the chance of it happening to an
acceptable level.  Thinking about how you could add an external log
file seems like a better failsafe than trying to do full 'commits'
within the writer, seeing as there is no guarantee those commits will
actually hit disk.

I also think that Mike is making too much distinction between "relying
on the file system" and "using shared memory".  I think one can safely
view them as two interfaces to the same underlying mechanism.

> Over on the Lucene java-user list, we're discussing how to rein in
> SortWriter's memory costs:
>
>  http://markmail.org/message/zcgb6ak24di42rev

I'm not sure that I understand your approach here.  Could you offer
some context?  The plan is to store one giant file of all the field
values, and then simple lists of integer Doc values in sorted order?
I'd wonder how you'd merge results between shards or sort the results
of text query.   What's the main use case you are targetting?

--nate