You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2009/12/20 06:14:20 UTC

fsync

On Sat, Dec 19, 2009 at 12:36:37PM -0800, Nathan Kurz wrote:

> I think your approach here makes great sense:  you can't prevent data
> corruption, you just want to reduce the chance of it happening to an
> acceptable level.  

Do you have any objections to the compromise plan we arrived at?

    /** Permanently commit all changes made to the index.  
     *
     * Until either Commit() or Flush() completes, read processes cannot see
     * any of the current changes.
     */
    public void
    Commit(Indexer *self);

    /** A lightweight and technically unsafe variant of Commit() which skips
     * calling fsync() to force the OS to write all changes to disk.  It
     * returns faster than Commit(), but leaves the index vulnerable to
     * corruption in the event of a power failure or OS crash.
     */
    public void
    Flush(IndexWriter *self);

    /** Perform the expensive setup for Commit() (or Flush()) in advance, so
     * that Commit() completes quickly.  (If Prepare_Commit() is not called
     * explicitly by the user, Commit() will call it internally.)
     * 
     * @param fsync If true, fsync() all changed files.
     */
    public void
    Prepare_Commit(Indexer *self, bool_t fsync);

    /* Restore the index to the state of the last successful Commit(),
     * discarding any changes which may have been Flushed() since then.
     */
    public void
    Rollback(Indexer *self);
    
> Thinking about how you could add an external log file seems like a better
> failsafe than trying to do full 'commits' within the writer, seeing as there
> is no guarantee those commits will actually hit disk.

I think the addition of Flush() and Rollback() will make it easier for top
level apps to manage a transaction log.

Mike's point about "unbounded" corruption is a good one.  Old segment data
which gets merged away can suddenly get destroyed if fsync'ing the new segment
is incomplete when the power goes out.

We guard against that by always keeping around the last snapshot which went
through the complete fsync dance.  Essentially, we'll need to keep around two
snapshots: the last flushed snapshot, and the last committed snapshot.  

By "snapshot", I mean both the snapshot file itself and all the files it
references.

It's true that we can't guarantee that the hard drive obeys the fsync call,
but I do think that offering an fsync'd Commit() as an option is a good idea
-- so long as we also offer the "atomic but not durable" option for those who
know what they're doing.

> I also think that Mike is making too much distinction between "relying
> on the file system" and "using shared memory".  I think one can safely
> view them as two interfaces to the same underlying mechanism.

I agree with that, and it was kind of confusing since Mike had previously
seemed to suggest that the flush() semantics were a "Lucy-ification" of the
Lucene model.  See the first section of my latest reply to him:

    https://issues.apache.org/jira/browse/LUCENE-2026?focusedCommentId=12792939&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12792939

Marvin Humphrey

Re: fsync

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Dec 23, 2009 at 12:31 AM, Nathan Kurz <na...@verse.com> wrote:

>> Whereas, using the filesystem really requires a file-flat data
>> structure?
>
> I guess it depends on your point of view: it would be hard (but not
> impossible) to do true objects in an mmapped file, but it would be
> very easy to do has-a type relationships using file offsets as
> pointers.  I tend to have a data-centric (rather than
> object-centric) point of view, but from here I don't see any data
> structures that would be significantly more difficult.

Interesting -- I guess if you "made" all the pointers relative (ie,
interpreted them so, when reading them), then you could make arbitrary
structures.

> Do you have a link that explains the FST you refer to?  I'm
> searching, and not finding anything that's a definite match.  "Field
> select table"?

Sorry -- FST = finite state transducer.  It adds optional "outputs"
along each edge, over a finite state machine.  When used for the terms
index, I think it'd be a symmetric letter trie -- ie, both prefixes
and suffixes are shared, with the "middle" part of the FST producing
the outputs (= index information for that term that uniquely crosses
that edge).

>> Ie, "going through the filesystem" and "going through shared
>> memory" are two alternatives for enabling efficient process-only
>> concurrency models.  They have interesting tradeoffs (I'll answer
>> more in 2026), but the fact that one of them is backed by a file by
>> the OS seems like a salient difference.
>
> For me, file backing doesn't seem like a big difference.  Fast
> moving changes will never hit disk, and I presume there is some way
> you can convince the system never to actually write out the slow
> changes (maybe mmap on a RamFS?).

What are fast & slow changes here?  Fast = new segments that get
created but then merged away before moving to stable storage?

> I think the real difference is between sharing between threads and
> sharing between processes --- basically, whether or not you can
> assume that the address space is identical in all the 'sharees'.

Yes, process only concurreny seems like the big difference.

> I'll mention that, given the New Year, at first I thought 2026 was
> your realistic time estimate rather than a tracking number.

Heh ;)

> I started thinking about how one could do objects with mmap, and
> came up with an approach that doesn't quite answer that question but
> might actually work out well for other problems: you could literally
> compile your index and link it in as a shared library.  Each term
> would be a symbol, and you'd use 'dlsym' to find the associated
> data.
>
> It's possible that you could even use library versioning to handle
> updates, and stuff like RTLD_NEXT to handle multiple
> segments. Perhaps a really bad idea, but I find it an intriguing
> one.  I wonder how fast using libdl would be compared to writing
> your own lookup tables.  I'd have to guess it's fairly efficient.

That is a wild idea!  I wonder how dlsym represents its information...

Mike

Re: fsync

Posted by Nathan Kurz <na...@verse.com>.

On Sun, Dec 20, 2009 at 6:15 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Sun, Dec 20, 2009 at 12:14 AM, Marvin Humphrey
> <ma...@rectangular.com> wrote:
>
>>> I also think that Mike is making too much distinction between
>>> "relying on the file system" and "using shared memory".  I think
>>> one can safely view them as two interfaces to the same underlying
>>> mechanism.
>
> Using the filesystem for sharing vs using shared memory seem quite
> different to me.  EG one could create a rich data structure (say an
> FST) to represent the terms dict in RAM, then share that terms dict
> amongst many processes, right?
>
> Whereas, using the filesystem really requires a file-flat data
> structure?

I guess it depends on your point of view:  it would be hard (but not
impossible) to do true objects in an mmapped file, but it would be
very easy to do has-a type relationships using file offsets as
pointers.  I tend to have a data-centric (rather than object-centric)
point of view, but from here I don't see any data structures that
would be significantly more difficult.

Do you have a link that explains the FST you refer to?  I'm searching,
and not finding anything that's a definite match.  "Field select
table"?

> Ie, "going through the filesystem" and "going through shared memory"
> are two alternatives for enabling efficient process-only concurrency
> models.  They have interesting tradeoffs (I'll answer more in 2026),
> but the fact that one of them is backed by a file by the OS seems like
> a salient difference.

For me, file backing doesn't seem like a big difference.   Fast moving
changes will never hit disk, and I presume there is some way you can
convince the system never to actually write out the slow changes
(maybe mmap on a RamFS?).  I think the real difference is between
sharing between threads and sharing between processes --- basically,
whether or not you can assume that the address space is identical in
all the 'sharees'.

I'll mention that, given the New Year, at first I thought 2026 was
your realistic time estimate rather than a tracking number.

---

I started thinking about how one could do objects with mmap, and came
up with an approach that doesn't quite answer that question but might
actually work out well for other problems:  you could literally
compile your index and link it in as a shared library.
Each term would be a symbol, and you'd use 'dlsym' to find the associated data.

It's possible that you could even use library versioning to handle
updates, and stuff like RTLD_NEXT to handle multiple segments. Perhaps
a really bad idea, but I find it an intriguing one.   I wonder how
fast using libdl would be compared to writing your own lookup tables.
I'd have to guess it's fairly efficient.

Nathan Kurz
nate@verse.com

Re: fsync

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Dec 20, 2009 at 12:14 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:

>> I also think that Mike is making too much distinction between
>> "relying on the file system" and "using shared memory".  I think
>> one can safely view them as two interfaces to the same underlying
>> mechanism.
>
> I agree with that, and it was kind of confusing since Mike had
> previously seemed to suggest that the flush() semantics were a
> "Lucy-ification" of the Lucene model.

I still need to answer on 2026, but this caught my eye first ;)

Using the filesystem for sharing vs using shared memory seem quite
different to me.  EG one could create a rich data structure (say an
FST) to represent the terms dict in RAM, then share that terms dict
amongst many processes, right?

Whereas, using the filesystem really requires a file-flat data
structure?

Ie, "going through the filesystem" and "going through shared memory"
are two alternatives for enabling efficient process-only concurrency
models.  They have interesting tradeoffs (I'll answer more in 2026),
but the fact that one of them is backed by a file by the OS seems like
a salient difference.

Net/net, I think the proposed Lucy flush v commit semantics is a good
approach, given Lucy's design constraints.  Just like Lucene's NRT,
Lucy users won't be forced to tradeoff reopen time for durability.

Mike