You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/01/19 16:10:30 UTC

[jira] Commented: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

    [ https://issues.apache.org/jira/browse/LUCENE-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466073 ] 

Michael McCandless commented on LUCENE-710:
-------------------------------------------

There has been some great design discussions / iterations recently
on how to approach this:

    http://www.gossamer-threads.com/lists/lucene/java-dev/44162

    http://www.gossamer-threads.com/lists/lucene/java-dev/44236


I think we've iterated to a good approach now.  Here's the summary:

  * First, add an option to IndexWriter to "commit (write segments_N)
    only on close" vs writing a segments_N every time there is a
    flush, merge, etc., during a single IndexWriter session.

    This means a reader won't see anything a writer has been doing
    until it's closed.

    We would still have an "autoCommit" true/false (default true) to
    keep backwards compatibility.  If true, the IndexWriter writes a
    new segments_N every time it flushes, merges segments, etc.; else
    it only writes one on close.

    We would add an "abort()" to IndexWriter to not commit, clean up
    any temp files created, and rollback.

    "Commit on close" will also address / enable fixes for other
    issues like prevent readers from refreshing half way through
    something like "bulk delete then bulk add", preventing readers
    from refreshing during optimize() thus tying up lots of disk
    space, enabling a write session to be transactional (all or
    none), etc.


  * Second, change how IndexFileDeleter works: have it keep track of
    which commits are still live and which one is pending (as the
    SegmentInfos in IndexWriter, not yet written to disk).

    Allow IndexFileDeleter to be subclassed to implement different
    "deletion policies".

    The base IndexFileDeleter class will use ref counts to figure out
    which individual index files are still referenced by one or more
    "segments_N" commits or by the uncommitted "in-memory"
    SegmentInfos.  Then the policy is invoked on commit (and also on
    init) and can choose which commits (if any) to now remove.

    Add constructors to IndexWriter allowing you to pass in your own
    deleter. The default policy would still be "delete all past
    commits as soon as a new commit is written" (this is how deleting
    happens today).

    For NFS we can then try different policies as discussed on those
    threads above (there were at least 4 proposals).  They all have
    different tradeoffs.  I would open separate issues for these
    policies after this issue is resolved.


> Implement "point in time" searching without relying on filesystem semantics
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-710
>                 URL: https://issues.apache.org/jira/browse/LUCENE-710
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> This was touched on in recent discussion on dev list:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/41700#41700
> and then more recently on the user list:
>   http://www.gossamer-threads.com/lists/lucene/java-user/42088
> Lucene's "point in time" searching currently relies on how the
> underlying storage handles deletion files that are held open for
> reading.
> This is highly variable across filesystems.  For example, UNIX-like
> filesystems usually do "close on last delete", and Windows filesystem
> typically refuses to delete a file open for reading (so Lucene retries
> later).  But NFS just removes the file out from under the reader, and
> for that reason "point in time" searching doesn't work on NFS
> (see LUCENE-673 ).
> With the lockless commits changes (LUCENE-701 ), it's quite simple to
> re-implement "point in time searching" so as to not rely on filesystem
> semantics: we can just keep more than the last segments_N file (as
> well as all files they reference).
> This is also in keeping with the design goal of "rely on as little as
> possible from the filesystem".  EG with lockless we no longer re-use
> filenames (don't rely on filesystem cache being coherent) and we no
> longer use file renaming (because on Windows it can fails).  This
> would be another step of not relying on semantics of "deleting open
> files".  The less we require from filesystem the more portable Lucene
> will be!
> Where it gets interesting is what "policy" we would then use for
> removing segments_N files.  The policy now is "remove all but the last
> one".  I think we would keep this policy as the default.  Then you
> could imagine other policies:
>   * Keep past N day's worth
>   * Keep the last N
>   * Keep only those in active use by a reader somewhere (note: tricky
>     how to reliably figure this out when readers have crashed, etc.)
>   * Keep those "marked" as rollback points by some transaction, or
>     marked explicitly as a "snaphshot".
>   * Or, roll your own: the "policy" would be an interface or abstract
>     class and you could make your own implementation.
> I think for this issue we could just create the framework
> (interface/abstract class for "policy" and invoke it from
> IndexFileDeleter) and then implement the current policy (delete all
> but most recent segments_N) as the default policy.
> In separate issue(s) we could then create the above more interesting
> policies.
> I think there are some important advantages to doing this:
>   * "Point in time" searching would work on NFS (it doesn't now
>     because NFS doesn't do "delete on last close"; see LUCENE-673 )
>     and any other Directory implementations that don't work
>     currently.
>   * Transactional semantics become a possibility: you can set a
>     snapshot, do a bunch of stuff to your index, and then rollback to
>     the snapshot at a later time.
>   * If a reader crashes or machine gets rebooted, etc, it could choose
>     to re-open the snapshot it had previously been using, whereas now
>     the reader must always switch to the last commit point.
>   * Searchers could search the same snapshot for follow-on actions.
>     Meaning, user does search, then next page, drill down (Solr),
>     drill up, etc.  These are each separate trips to the server and if
>     searcher has been re-opened, user can get inconsistent results (=
>     lost trust).  But with, one series of search interactions could
>     explicitly stay on the snapshot it had started with.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org