You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2007/01/14 22:36:34 UTC

adding "explicit commits" to Lucene?

Team,

I've been struggling to find a clean solution for LUCENE-710, when I
thought of a simple addition to Lucene ("explicit commits") that would
I think resolve LUCENE-710 and would fix a few other outstanding
issues when readers are using a "live" index (being updated by a
writer).

The basic idea is to add an explicit "commit" operation to Lucene.

This is the same nice feature Solr has, but just a different
implementation (in Lucene core, in a single index, instead).  The
commit makes a "point in time" snapshot (term borrowed from Solr!)
available for searching.

The implementation is surprisingly simple (see below) and completely
backwards compatible.

I'd like to get some feedback on the idea/implementation.


Details...: right now, Lucene writes a new segments_N file at various
times: when a writer (or reader that's writing deletes/norms) needs to
flush its pending changes to disk; when a writer merges segments; when
a writer is closed; multiple times during optimize/addIndexes; etc.

These times are not controllable / predictable to the developer using
Lucene.

A new reader always opens the last segments_N written, and, when a
reader uses isCurrent() to check whether it should re-open (the
suggested way), that method always returns false (meaning you should
re-open) if there are any new segments_N files.

So it's somewhat uncontrollable to the developer what state the index
is in when you [re-]open a reader.

People work around this today by adding logic above Lucene so that the
writer separately communicates to readers when is a good time to
refresh.  But with "explicit commits", readers could instead look
directly at the index and pick the right segments_N to refresh to.

I'm proposing that we separate the writing of a new segments_N file
into those writes that are done automatically by Lucene (I'll call
these "checkpoints") from meaningful (to the application) commits that
are done explicitly by the developer at known times (I'll call this
"committing a snapshot").  I would add a new boolean mode to
IndexWriter called "autoCommit", and a new public method "commit()" to
IndexWriter and IndexReader (we'd have to rename the current protected
commit() in IndexReader)

When autoCommit is true, this means every write of a segments_N file
will be "commit a snapshot", meaning readers will then use it for
searching.  This will be the default and this is exactly how Lucene
behaves today.  So this change is completely backwards compatible.

When autoCommit is false, then when Lucene chooses to save a
segments_N file it's just a "checkpoint": a reader would not open or
re-open to the checkpoint.  This means the developer must then call
IndexWriter.commit() or IndexReader.commit() in order to "commit a
snapshot" at the right time, thereby telling readers that this
segments_N file is a valid one to switch to for searching.


The implementation is very simple (I have an initial coarse prototype
working with all but the last bullet):

   * If a segments_N file is just a checkpoint, it's named
     "segmentsx_N" (note the added 'x'); if it's a snapshot, it's named
     "segments_N".  No other changes to the index format.

   * A reader by default opens the latest snapshot but can optionally
     open a specific N (segments_N) snapshot.

   * A writer by default starts from the most recent "checkpoint" but
     may also take a specific checkpoint or snapshot point N
     (segments_N) to start from (to allow rollback).

   * Change IndexReader.isCurrent() to see if there are any newer
     snapshots but disregard newer checkpoints.

   * When a writer is in autoCommit=false mode, it always writes to the
     next segmentsx_N; else it writes to segments_N.

   * The commit() method would just write to the next segments_N file
     and return the N it had written (in case application needs to
     re-use it later).

   * IndexFileDeleter would need to have a slightly smarter policy when
     autoCommit=false, ie, "don't delete anything referenced by either
     the past N snapshots or if the snapshot was obsoleted less than X
     minutes ago".


I think there are some compelling things this could solve:

   * The "delete then add" problem (really a special but very common
     case of general transactions):

     Right now when you want to update a bunch of documents in a Lucene
     index, it's best to open a reader, do a "batch delete", close the
     reader, open a writer, do a "batch add", close the writer.  This
     is the suggested way.

     The open risk here is that a reader could refresh at any time
     during these operations, and find that a bunch of documents have
     been deleted but not yet added again.

     Whereas, with autoCommit false you could do this entire operation
     (batch delete then batch add), and then call the final commit() in
     the end, and readers would know not to re-open the index until
     that final commit() succeeded.

   * The "using too much disk space during optimize" problem:

     This came up on the user's list recently: if you aggressively
     refresh readers while optimize() is running, you can tie up much
     more disk space than you'd expect, because your readers are
     holding open all the [possibly very large] intermediate segments.

     Whereas, if autoCommit is false, then developer calls optimize()
     and then calls commit(), the readers would know not to re-open
     until optimize was complete.

   * More general transactions:

     It has come up a fair number of times how to make Lucene
     transactional, either by itself ("do the following complex series
     of index operations but if there is any failure, rollback to the
     start, and don't expose result to searcher until all operations
     are done") or as part of a larger transaction eg involving a
     relational database.

     EG, if you want to add a big set of documents to Lucene, but not
     make them searchable until they are all added, or until a specific
     time (eg Monday @ 9 AM), you can't do that easily today but it
     would be simple with explicit commits.

     I believe this change would make transactions work correctly with
     Lucene.

   * LUCENE-710 ("implement point in time searching without relying on
     filesystem semantics"), also known as "getting Lucene to work
     correctly over NFS".

     I think this issue is nearly solved when autoCommit=false, as long
     as we can adopt a shared policy on "when readers refresh" to match
     the new deletion policy (described above).  Basically, as long as
     the deleter and readers are playing by the same "refresh rules"
     and the writer gives the readers enough time to switch/warm, then
     the deleter should never delete something in use by a reader.



There are also some neat future things made possible:

   * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
     could have a more efficient implementation (just like Solr) when
     autoCommit is false, because deletes don't need to be flushed
     until commit() is called.  Whereas, now, they must be aggressively
     flushed on each checkpoint.

   * More generally, because "checkpoints" do not need to be usable by
     a reader/searcher, other neat optimizations might be possible.

     EG maybe the merge policy could be improved if it knows that
     certain segments are "just checkpoints" and are not involved in
     searching.

   * I could simplify the approach for my recent addIndexes changes
     (LUCENE-702) to use this, instead of it's current approach (wish I
     had thought of this sooner: ugh!.).

   * A single index could hold many snapshots, and, we could enable a
     reader to explicitly open against an older snapshot.  EG maybe you
     take weekly and a monthly snapshot because you sometimes want to
     go back and "run a search on last week's catalog".

Feedback?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

That is true, but you need to use the same techniques as any db. You  
need to write a tx log file. This has the semantics that you know if  
it has committed. Juts like a db. You check that is has committed  
before writing anything to the actual index. Since Lucene does not  
modify any segments, it  is trivial to restart if this portion fails.  
Just delete the uncommitted segments on startup, and replay the tx log.

As for the ParallelReader, that doesn't make sense to me (but I am  
admitting don't understand the purpose), since the javadoc states  
that that all sub-indexes must be updated in the same manner. Where  
does the benefit come from then? It seems you are actually performing  
more operations (with 2 sub-indexes you are writing twice as many  
documents - same amount of field data though). Is there some other  
information besides the javadoc that explains the usage/benefit ?

Using a federated search where different fields are in different  
indexes would be very difficult as you state, and involve long join  
lists (and the scoring logic is VERY difficult unless you create a  
new "memory index" containing all the results, and then run the  
complete query against this.

Putting the documents in different indexes and joining/weighing the  
results is rather easy and works quite well.


On Jan 16, 2007, at 12:38 AM, Chuck Williams wrote:

> robert engels wrote on 01/15/2007 08:11 PM:
>> If that is all you need, I think it is far simpler:
>>
>> If you have an OID, then al that is required is to a write to a
>> separate disk file the operations (delete this OID, insert this
>> document, etc...)
>>
>> Once the file is permanently on disk. Then it is simple to just keep
>> playing the file back until it succeeds.
> There is no guarantee a given operation will ever succeed so this
> doesn't work.
>>
>> This is what we do in our search server.
>>
>> I am not completely familiar with parallel reader, but in reading the
>> JavaDoc I don't see the benefit - since you have to write the
>> documents to both indexes anyway??? Why is it of any benefit to break
>> the document into multiple parts?
> I'm sure Doug had reasons to write it.  My reason to use it is for  
> fast
> bulk updates, updating one subindex without having to update the  
> others.
>>
>> If you have OIDs available, parallel reader can be accomplished in a
>> far simpler and more efficient manner - we have a completely  
>> federated
>> server implementation that was trivial - less < 100 lines of code. We
>> did it simpler, and create a hash from the OID, and store the  
>> document
>> into a different index depending on the has, then run the query  
>> across
>> all indexes in parallel, joining the results.
> Lucene has this built in via MultiSearcher and RemoteSearchable.   
> It is
> a bit more complex due to the necessity to normalize Weights, e.g. to
> ensure the same docFreq's which reflect the union of all indexes are
> used for the search in each.
>
> Federated searching addresses different requirements than
> ParallelReader.  Yes, I agree that ParallelReader could be done using
> UID's, but believe it would be a considerably more expensive
> representation to search.  The method used in federated search to
> distribute the same query to each index is not applicable.   
> Breaking the
> query up into parts that are applied against each parallel index, with
> each query part referencing only the fields in a single parallel  
> index,
> would be a challenge with complex nested queries supporting all of the
> operators, and much less efficient than ParallelReader.  Modifying all
> the primitive Query subclasses to use UID's instead of doc-ids's would
> be an alternative, but would be a lot of work and not nearly as
> efficient as the existing Lucene index representation that sorts
> postings by doc-id.
>
> To illustrate this, consider the simple query, f:a AND g:b, where f  
> and
> g are in two different parallel indexes.  Performing the f  and g
> queries separately on the different indexes to get possibly very long
> lists of results and then joining those by UID will be much slower  
> than
> BooleanQuery operating on ParallelReader with doc-id sorted postings.
> The alternative of a UID-based BooleanQuery would have similar
> challenges unless the postings were sorted by UID.  But hey, that's
> permanent doc-ids.
>
> Chuck
>
>>
>> On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote:
>>
>>> My interest is transactions, not making doc-id's permanent.
>>> Specifically, the ability to ensure that a group of adds either  
>>> all go
>>> into the index or none go into the index, and to ensure that if  
>>> none go
>>> into the index that the index is not changed in any way.
>>>
>>> I have UID's but they cannot ensure the latter property, i.e. they
>>> cannot ensure side-effect-free rollbacks.
>>>
>>> Yes, if you have no reliance on internal Lucene structures like  
>>> doc-id's
>>> and segments, then that shouldn't matter.  But many capabilities  
>>> have
>>> such reliance for good reasons.  E.g., ParallelReader, which is a  
>>> public
>>> supported class in Lucene, requires doc-id synchronization.   
>>> There are
>>> similar good reasons for an application to take advantage of doc- 
>>> ids.
>>>
>>> Lucene uses doc-id's in many of its API's and so it is not  
>>> surprising
>>> that many applications rely on them, and I'm sure misuse them not  
>>> fully
>>> understanding the semantics and uncertainties of doc-id changes  
>>> due to
>>> merging segments with deletes.
>>>
>>> Applications can use doc-ids for legitimate and beneficial purposes
>>> while remaining semantically valid.  Making such capabilities  
>>> efficient
>>> and robust in all cases is facilitated by application control  
>>> over when
>>> doc-id's and segment structure change at a granularity larger  
>>> than the
>>> single Document.
>>>
>>> If I had a vote it would be +1 on the direction Michael has  
>>> proposed,
>>> assuming it can be done robustly and without performance penalty.
>>>
>>> Chuck
>>>
>>>
>>> robert engels wrote on 01/15/2007 07:34 PM:
>>>> I honestly think that having a unique OID as an indexed field and
>>>> putting a layer on top of Lucene is the best solution to all of  
>>>> this.
>>>> It makes it almost trivial, and you can implement transaction  
>>>> handling
>>>> in a variety of ways.
>>>>
>>>> Attempting to make the doc ids "permanent" is a tough challenge,
>>>> considering the orignal design called for them to be "non  
>>>> permanent".
>>>>
>>>> It seems doubtful that you cannot have some sort of primary key any
>>>> way and be this concerned about the transactional nature of Lucene.
>>>>
>>>> I vote -1 on all of this. I think it will detract from the  
>>>> simple and
>>>> efficient storage mechanism that Lucene uses.
>>>>
>>>> On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
>>>>
>>>>> Ning Li wrote on 01/15/2007 06:29 PM:
>>>>>> On 1/14/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>>>>>>   * The "support deleteDocuments in IndexWriter" (LUCENE-565)
>>>>>>> feature
>>>>>>>     could have a more efficient implementation (just like  
>>>>>>> Solr) when
>>>>>>>     autoCommit is false, because deletes don't need to be  
>>>>>>> flushed
>>>>>>>     until commit() is called.  Whereas, now, they must be
>>>>>>> aggressively
>>>>>>>     flushed on each checkpoint.
>>>>>>
>>>>>> If a reader can only open snapshots both for search and for
>>>>>> modification, I think another change is needed besides the ones
>>>>>> listed: assume the latest snapshot is segments_5 and the latest
>>>>>> checkpoint is segmentsx_7 with 2 new segments, then a reader  
>>>>>> opens
>>>>>> snapshot segments_5, performs a few deletes and writes a new
>>>>>> checkpoint segmentsx_8. The summary file segmentsx_8 should  
>>>>>> include
>>>>>> the 2 new segments which are in segmentsx_7 but not in  
>>>>>> segments_5.
>>>>>> Such segments to include are easily identifiable only if they  
>>>>>> are not
>>>>>> merged with segments in the latest snapshot... All these won't be
>>>>>> necessary if a reader always opens the latest checkpoint for
>>>>>> modification, which will also support deletion of non-committed
>>>>>> documents.
>>>>> This problem seems worse.  I don't see how a reader and a  
>>>>> writer can
>>>>> independently compute and write checkpoints.  The adds in the  
>>>>> writer
>>>>> don't just create new segments, they replace existing ones through
>>>>> merging.  And the merging changes doc-ids by expunging  
>>>>> deletes.  It
>>>>> seems that all deletes must be based on the most recent  
>>>>> checkpoint, or
>>>>> merging of checkpoints to create the next snapshot will be
>>>>> considerably
>>>>> more complex.
>>>>>
>>>>> Chuck
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Chuck Williams <ch...@manawiz.com>.

robert engels wrote on 01/15/2007 08:11 PM:
> If that is all you need, I think it is far simpler:
>
> If you have an OID, then al that is required is to a write to a
> separate disk file the operations (delete this OID, insert this
> document, etc...)
>
> Once the file is permanently on disk. Then it is simple to just keep
> playing the file back until it succeeds.
There is no guarantee a given operation will ever succeed so this
doesn't work.
>
> This is what we do in our search server.
>
> I am not completely familiar with parallel reader, but in reading the
> JavaDoc I don't see the benefit - since you have to write the
> documents to both indexes anyway??? Why is it of any benefit to break
> the document into multiple parts?
I'm sure Doug had reasons to write it.  My reason to use it is for fast
bulk updates, updating one subindex without having to update the others.
>
> If you have OIDs available, parallel reader can be accomplished in a
> far simpler and more efficient manner - we have a completely federated
> server implementation that was trivial - less < 100 lines of code. We
> did it simpler, and create a hash from the OID, and store the document
> into a different index depending on the has, then run the query across
> all indexes in parallel, joining the results.
Lucene has this built in via MultiSearcher and RemoteSearchable.  It is
a bit more complex due to the necessity to normalize Weights, e.g. to
ensure the same docFreq's which reflect the union of all indexes are
used for the search in each.

Federated searching addresses different requirements than
ParallelReader.  Yes, I agree that ParallelReader could be done using
UID's, but believe it would be a considerably more expensive
representation to search.  The method used in federated search to
distribute the same query to each index is not applicable.  Breaking the
query up into parts that are applied against each parallel index, with
each query part referencing only the fields in a single parallel index,
would be a challenge with complex nested queries supporting all of the
operators, and much less efficient than ParallelReader.  Modifying all
the primitive Query subclasses to use UID's instead of doc-ids's would
be an alternative, but would be a lot of work and not nearly as
efficient as the existing Lucene index representation that sorts
postings by doc-id.

To illustrate this, consider the simple query, f:a AND g:b, where f and
g are in two different parallel indexes.  Performing the f  and g
queries separately on the different indexes to get possibly very long
lists of results and then joining those by UID will be much slower than
BooleanQuery operating on ParallelReader with doc-id sorted postings. 
The alternative of a UID-based BooleanQuery would have similar
challenges unless the postings were sorted by UID.  But hey, that's
permanent doc-ids.

Chuck

>
> On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote:
>
>> My interest is transactions, not making doc-id's permanent.
>> Specifically, the ability to ensure that a group of adds either all go
>> into the index or none go into the index, and to ensure that if none go
>> into the index that the index is not changed in any way.
>>
>> I have UID's but they cannot ensure the latter property, i.e. they
>> cannot ensure side-effect-free rollbacks.
>>
>> Yes, if you have no reliance on internal Lucene structures like doc-id's
>> and segments, then that shouldn't matter.  But many capabilities have
>> such reliance for good reasons.  E.g., ParallelReader, which is a public
>> supported class in Lucene, requires doc-id synchronization.  There are
>> similar good reasons for an application to take advantage of doc-ids.
>>
>> Lucene uses doc-id's in many of its API's and so it is not surprising
>> that many applications rely on them, and I'm sure misuse them not fully
>> understanding the semantics and uncertainties of doc-id changes due to
>> merging segments with deletes.
>>
>> Applications can use doc-ids for legitimate and beneficial purposes
>> while remaining semantically valid.  Making such capabilities efficient
>> and robust in all cases is facilitated by application control over when
>> doc-id's and segment structure change at a granularity larger than the
>> single Document.
>>
>> If I had a vote it would be +1 on the direction Michael has proposed,
>> assuming it can be done robustly and without performance penalty.
>>
>> Chuck
>>
>>
>> robert engels wrote on 01/15/2007 07:34 PM:
>>> I honestly think that having a unique OID as an indexed field and
>>> putting a layer on top of Lucene is the best solution to all of this.
>>> It makes it almost trivial, and you can implement transaction handling
>>> in a variety of ways.
>>>
>>> Attempting to make the doc ids "permanent" is a tough challenge,
>>> considering the orignal design called for them to be "non permanent".
>>>
>>> It seems doubtful that you cannot have some sort of primary key any
>>> way and be this concerned about the transactional nature of Lucene.
>>>
>>> I vote -1 on all of this. I think it will detract from the simple and
>>> efficient storage mechanism that Lucene uses.
>>>
>>> On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
>>>
>>>> Ning Li wrote on 01/15/2007 06:29 PM:
>>>>> On 1/14/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>>>>>   * The "support deleteDocuments in IndexWriter" (LUCENE-565)
>>>>>> feature
>>>>>>     could have a more efficient implementation (just like Solr) when
>>>>>>     autoCommit is false, because deletes don't need to be flushed
>>>>>>     until commit() is called.  Whereas, now, they must be
>>>>>> aggressively
>>>>>>     flushed on each checkpoint.
>>>>>
>>>>> If a reader can only open snapshots both for search and for
>>>>> modification, I think another change is needed besides the ones
>>>>> listed: assume the latest snapshot is segments_5 and the latest
>>>>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>>>>> snapshot segments_5, performs a few deletes and writes a new
>>>>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>>>>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>>>>> Such segments to include are easily identifiable only if they are not
>>>>> merged with segments in the latest snapshot... All these won't be
>>>>> necessary if a reader always opens the latest checkpoint for
>>>>> modification, which will also support deletion of non-committed
>>>>> documents.
>>>> This problem seems worse.  I don't see how a reader and a writer can
>>>> independently compute and write checkpoints.  The adds in the writer
>>>> don't just create new segments, they replace existing ones through
>>>> merging.  And the merging changes doc-ids by expunging deletes.  It
>>>> seems that all deletes must be based on the most recent checkpoint, or
>>>> merging of checkpoints to create the next snapshot will be
>>>> considerably
>>>> more complex.
>>>>
>>>> Chuck
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

If that is all you need, I think it is far simpler:

If you have an OID, then al that is required is to a write to a  
separate disk file the operations (delete this OID, insert this  
document, etc...)

Once the file is permanently on disk. Then it is simple to just keep  
playing the file back until it succeeds.

This is what we do in our search server.

I am not completely familiar with parallel reader, but in reading the  
JavaDoc I don't see the benefit - since you have to write the  
documents to both indexes anyway??? Why is it of any benefit to break  
the document into multiple parts?

If you have OIDs available, parallel reader can be accomplished in a  
far simpler and more efficient manner - we have a completely  
federated server implementation that was trivial - less < 100 lines  
of code. We did it simpler, and create a hash from the OID, and store  
the document into a different index depending on the has, then run  
the query across all indexes in parallel, joining the results.

On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote:

> My interest is transactions, not making doc-id's permanent.
> Specifically, the ability to ensure that a group of adds either all go
> into the index or none go into the index, and to ensure that if  
> none go
> into the index that the index is not changed in any way.
>
> I have UID's but they cannot ensure the latter property, i.e. they
> cannot ensure side-effect-free rollbacks.
>
> Yes, if you have no reliance on internal Lucene structures like doc- 
> id's
> and segments, then that shouldn't matter.  But many capabilities have
> such reliance for good reasons.  E.g., ParallelReader, which is a  
> public
> supported class in Lucene, requires doc-id synchronization.  There are
> similar good reasons for an application to take advantage of doc-ids.
>
> Lucene uses doc-id's in many of its API's and so it is not surprising
> that many applications rely on them, and I'm sure misuse them not  
> fully
> understanding the semantics and uncertainties of doc-id changes due to
> merging segments with deletes.
>
> Applications can use doc-ids for legitimate and beneficial purposes
> while remaining semantically valid.  Making such capabilities  
> efficient
> and robust in all cases is facilitated by application control over  
> when
> doc-id's and segment structure change at a granularity larger than the
> single Document.
>
> If I had a vote it would be +1 on the direction Michael has proposed,
> assuming it can be done robustly and without performance penalty.
>
> Chuck
>
>
> robert engels wrote on 01/15/2007 07:34 PM:
>> I honestly think that having a unique OID as an indexed field and
>> putting a layer on top of Lucene is the best solution to all of this.
>> It makes it almost trivial, and you can implement transaction  
>> handling
>> in a variety of ways.
>>
>> Attempting to make the doc ids "permanent" is a tough challenge,
>> considering the orignal design called for them to be "non permanent".
>>
>> It seems doubtful that you cannot have some sort of primary key any
>> way and be this concerned about the transactional nature of Lucene.
>>
>> I vote -1 on all of this. I think it will detract from the simple and
>> efficient storage mechanism that Lucene uses.
>>
>> On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
>>
>>> Ning Li wrote on 01/15/2007 06:29 PM:
>>>> On 1/14/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>>>>   * The "support deleteDocuments in IndexWriter" (LUCENE-565)  
>>>>> feature
>>>>>     could have a more efficient implementation (just like Solr)  
>>>>> when
>>>>>     autoCommit is false, because deletes don't need to be flushed
>>>>>     until commit() is called.  Whereas, now, they must be  
>>>>> aggressively
>>>>>     flushed on each checkpoint.
>>>>
>>>> If a reader can only open snapshots both for search and for
>>>> modification, I think another change is needed besides the ones
>>>> listed: assume the latest snapshot is segments_5 and the latest
>>>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>>>> snapshot segments_5, performs a few deletes and writes a new
>>>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>>>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>>>> Such segments to include are easily identifiable only if they  
>>>> are not
>>>> merged with segments in the latest snapshot... All these won't be
>>>> necessary if a reader always opens the latest checkpoint for
>>>> modification, which will also support deletion of non-committed
>>>> documents.
>>> This problem seems worse.  I don't see how a reader and a writer can
>>> independently compute and write checkpoints.  The adds in the writer
>>> don't just create new segments, they replace existing ones through
>>> merging.  And the merging changes doc-ids by expunging deletes.  It
>>> seems that all deletes must be based on the most recent  
>>> checkpoint, or
>>> merging of checkpoints to create the next snapshot will be  
>>> considerably
>>> more complex.
>>>
>>> Chuck
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

Yes it is !

This is what I am getting at when I said the design is moving all  
over the place.

The thread is "explicit commits", so I apologize for getting lost.

It just seems that we should design a new high-level class class  
Repository and design that API. It might use Lucene IndexReader and  
IndexWriter internally, but those should be made into package level  
interfaces.

That way all of the implementation of all these "needs" can be  
properly hidden, and we can have a coherent API that really  
represents the available functionality.

It may be that certain users don't need point in time searching,  
durable commits, or some of the other features. By hiding this behind  
another layer I think it will be far easier to implement and maintain.

On Jan 16, 2007, at 3:29 PM, Yonik Seeley wrote:

> On 1/16/07, robert engels <re...@ix.netcom.com> wrote:
>> You have the same problem if there is an existing reader open, so
>> what is the difference? You can't remove the segments there either.
>
> The disk space for the segments is currently removed if no one has
> them open... this is quite a bit different than guaranteeing that a
> reader in the future will be able to open an index in the past.
>
> -Yonik
>
>> On Jan 16, 2007, at 3:18 PM, Yonik Seeley wrote:
>>
>> > On 1/16/07, Doug Cutting <cu...@apache.org> wrote:
>> >> Remind me, why do we have to update the segments file except at
>> >> close?
>> >> I'm sure there's a good reason, and that's central to this
>> >> discussion.
>> >
>> > If segments are removed because of a merge, a new reader coming  
>> along
>> > will have problems opening the index if the segments file isn't
>> > updated to reflect that.
>> >
>> > One could keep around all old segments until a close() but that  
>> would
>> > cost disk space.
>> >
>> > -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Yonik Seeley wrote:
> On 1/16/07, robert engels <re...@ix.netcom.com> wrote:
>> You have the same problem if there is an existing reader open, so
>> what is the difference? You can't remove the segments there either.
> 
> The disk space for the segments is currently removed if no one has
> them open... this is quite a bit different than guaranteeing that a
> reader in the future will be able to open an index in the past.

Also remember the original issue here is that NFS (and perhaps other
filesystems where people would naturally expect Lucene to "just work")
doesn't provide this "protection" of not deleting files that are open
for reading.

This was the starting point that led to the idea of "explicit
commits".

If writer and reader can agree that certain segments_N are "valuable"
(commits) and others are just the automatic checkpoints that Lucene
needs to do (flushing, merging) then by agreeing on how deletes will
happen to the "valuable" segments_N files, readers can refresh
accordingly and never have their segments deleted out from under
them.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Chuck Williams <ch...@manawiz.com>.

Yonik Seeley wrote on 01/16/2007 11:29 AM:
> On 1/16/07, robert engels <re...@ix.netcom.com> wrote:
>> You have the same problem if there is an existing reader open, so
>> what is the difference? You can't remove the segments there either.
>
> The disk space for the segments is currently removed if no one has
> them open... this is quite a bit different than guaranteeing that a
> reader in the future will be able to open an index in the past.

To me the key benefit of explicit commits is that ongoing adds and their
associated merges update only the segments of the current snapshot.  The
current snapshot can be aborted, falling back to the last checkpoint
without having made any changes to its segments at all.  Once a commit
is done the committed snapshot becomes the new checkpoint.

Lucene does not have this desirable property now even for adding a
single document, since that document may cause a merge with consequences
arbitrarily deep into the index.

For the single-transaction use case it is only necessary that the
segments in the current checkpoint and those in the current snapshot are
maintained.  Revising the current snapshot can delete segments in the
prior snapshot, and committing can delete segments in the prior checkpoint.

Of course support for multiple parallel transactions would be even
better, but is also a huge can of worms as anyone who has spent time
chasing database deadlocks and understanding all the different types of
locks that modern databases use can attest.

The single-transaction case seems straightforward to implement per
Michael's suggestion and enables valuable use cases as the thread has
enumerated.

Chuck

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Yonik Seeley <yo...@apache.org>.

On 1/16/07, robert engels <re...@ix.netcom.com> wrote:
> You have the same problem if there is an existing reader open, so
> what is the difference? You can't remove the segments there either.

The disk space for the segments is currently removed if no one has
them open... this is quite a bit different than guaranteeing that a
reader in the future will be able to open an index in the past.

-Yonik

> On Jan 16, 2007, at 3:18 PM, Yonik Seeley wrote:
>
> > On 1/16/07, Doug Cutting <cu...@apache.org> wrote:
> >> Remind me, why do we have to update the segments file except at
> >> close?
> >> I'm sure there's a good reason, and that's central to this
> >> discussion.
> >
> > If segments are removed because of a merge, a new reader coming along
> > will have problems opening the index if the segments file isn't
> > updated to reflect that.
> >
> > One could keep around all old segments until a close() but that would
> > cost disk space.
> >
> > -Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

You have the same problem if there is an existing reader open, so  
what is the difference? You can't remove the segments there either.

On Jan 16, 2007, at 3:18 PM, Yonik Seeley wrote:

> On 1/16/07, Doug Cutting <cu...@apache.org> wrote:
>> Remind me, why do we have to update the segments file except at  
>> close?
>> I'm sure there's a good reason, and that's central to this  
>> discussion.
>
> If segments are removed because of a merge, a new reader coming along
> will have problems opening the index if the segments file isn't
> updated to reflect that.
>
> One could keep around all old segments until a close() but that would
> cost disk space.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Doug Cutting wrote:
> Yonik Seeley wrote:
>> One could keep around all old segments until a close() but that would
>> cost disk space.
> 
> One could optimize that so that intermediate segments, created since 
> open, would be deleted.  So, for example, batch indexing starting with 
> an empty index could freely delete segments as they're obsoleted, since 
> no one else should yet reference them.

This is in fact what I did to make addIndexes(*) calls transactional
(so that on disk full or other error, the index reverted to the
starting point) for LUCENE-702.  Intermediate segments are freely
deleted but ones that existed at the start are not.  And, no
segments_N file is written until the end.

But this wouldn't solve the "bulk delete then add" case because the
reader needs to close and sync its state to disk in such a way that
other readers using isCurrent() would know not to refresh.

Another use case is optimize() while readers are using the index.  If
you have readers refreshing during optimize (based on isCurrent()) you
can accidentally tie up tons of disk space, temporarily.  Yet you also
want the writer to "checkpoint" periodically to disk so that if it
crashes you could resume from the last checkpoint rather than roll
back way to the start of your optimize.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> One could keep around all old segments until a close() but that would
> cost disk space.

One could optimize that so that intermediate segments, created since 
open, would be deleted.  So, for example, batch indexing starting with 
an empty index could freely delete segments as they're obsoleted, since 
no one else should yet reference them.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> If segments are removed because of a merge, a new reader coming along
> will have problems opening the index if the segments file isn't
> updated to reflect that.
> 
> One could keep around all old segments until a close() but that would
> cost disk space.

Won't "explicit commits" have the same disk costs?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Yonik Seeley <yo...@apache.org>.

On 1/16/07, Doug Cutting <cu...@apache.org> wrote:
> Remind me, why do we have to update the segments file except at close?
> I'm sure there's a good reason, and that's central to this discussion.

If segments are removed because of a merge, a new reader coming along
will have problems opening the index if the segments file isn't
updated to reflect that.

One could keep around all old segments until a close() but that would
cost disk space.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Andi Vajda <va...@osafoundation.org>.

On Tue, 16 Jan 2007, Doug Cutting wrote:

> Michael McCandless wrote:
>> We could indeed simply tie "close" to mean "commit now", and not add a
>> separate "commit" method.
>> 
>> But what about the "bulk delete then bulk add" case?  Ideally if a
>> reader refreshes by checking "isCurrent()" it shouldn't ever open the
>> index "at a bad time".  Ie, we need a way to open a reader, delete a
>> bunch of docs, close it *without* committing, open a writer, add a
>> bunch of docs, and then do the commit, all so that any readers that
>> are refreshing would know not to open the segments_N that was
>> committed with all the deletes but none of the adds.  This is one use
>> case that explicit commits would address.
>
> One could also implement this with a Directory that permits checkpointing and 
> rollback.  Would that be any simpler?

The Berkeley DB Directory implementation supports ACID transactions, rollback, 
checkpointing, and all this good db stuff. Having an API for explicitely 
committing a Lucene index would make it easier to use Berkeley DB transactions 
with DbDirectory. Currently, because there is no notion of a transaction 
inside the Lucene core, the entire transaction logic used along with 
DbDirectory has to be external and it wraps operations that are too coarse 
or too expensive such as opening and closing readers and writers.

Having an API that explicitely tells the underlying Directory to commit or 
rollback whatever it was doing would be excellent. On the other hand, 
implementing support for this at a higher level than the Directory framework 
would essentially duplicate what a database-based Directory implementation 
already has access to through the underlying database. As with the Lock class, 
it would be great to have it easy to implement a no-op version of this Lucene 
transaction support that defers to the underlying database for the actual
transaction work.

Numerous threads in the past here have ended with 'a Lucene index is not a 
database'....

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Chuck Williams <ch...@manawiz.com>.

Michael McCandless wrote on 01/16/2007 12:09 PM:
> Doug Cutting wrote:
>> Michael McCandless wrote:
>>> We could indeed simply tie "close" to mean "commit now", and not add a
>>> separate "commit" method.
>>>
>>> But what about the "bulk delete then bulk add" case?  Ideally if a
>>> reader refreshes by checking "isCurrent()" it shouldn't ever open the
>>> index "at a bad time".  Ie, we need a way to open a reader, delete a
>>> bunch of docs, close it *without* committing, open a writer, add a
>>> bunch of docs, and then do the commit, all so that any readers that
>>> are refreshing would know not to open the segments_N that was
>>> committed with all the deletes but none of the adds.  This is one use
>>> case that explicit commits would address.

I've found batched deleteAdd update to be bit more complex than this in
two respects.  First, the index is vulnerable after the deletes and
before the added revisions as an error could cause loss of information. 
My current application must journal everything deleted to account for
this.  The proposed commits would alleviate that need since a failed
deleteAdd batch could be aborted.  Second, updates may need to hold the
revisions to the documents in memory for performance, currency of
simultaneous access, or other reasons.  Memory limits may restrict how
many of these revised documents can be held.  This leads to a
limited-memory-driven requirement to break the deleteAdd batch into
multiple subbatches.  So, it should be possible to implement a set of
deleteAdd batches as a single transaction, not just one batch.  The
original proposal meets this requirement.

>>
>> One could also implement this with a Directory that permits
>> checkpointing and rollback.  Would that be any simpler?
>
> True (I think?).  Maybe we could push the "transactional" behavior
> down lower (into Directory) in Lucene.  Though all implementations of
> Directory would need to implement their own transactional behavior vs
> one implementation at the reader/writer layer?
>
> As long as checkpointing is decoupled from the opening/closing of
> readers and writers then I think this would support this use case.
>
> So basically, the Directory layer would "mimic" the inode model (and
> hard links) that unix filesystems provide?  Or maybe the Directory
> would not make any changes visible to a reader until a writer did a
> "checkpoint"?  But then how would this work across machines (on a
> shared filesystem)?  I'm not sure I see how we could effectively (or
> more simply) push this down into Directory instead of at the
> reader/writer layer.

This seems more complex and less flexible for no benefit.  It's
analogous to a database pushing its transaction model into is file
storage component.  Transactions are a first class concept with
semantics at the index level.  The original proposal at the index level
seems to me to be easy to implement, easy to understand and easy to use.

Chuck

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Doug Cutting wrote:
> Michael McCandless wrote:
>> We could indeed simply tie "close" to mean "commit now", and not add a
>> separate "commit" method.
>>
>> But what about the "bulk delete then bulk add" case?  Ideally if a
>> reader refreshes by checking "isCurrent()" it shouldn't ever open the
>> index "at a bad time".  Ie, we need a way to open a reader, delete a
>> bunch of docs, close it *without* committing, open a writer, add a
>> bunch of docs, and then do the commit, all so that any readers that
>> are refreshing would know not to open the segments_N that was
>> committed with all the deletes but none of the adds.  This is one use
>> case that explicit commits would address.
> 
> One could also implement this with a Directory that permits 
> checkpointing and rollback.  Would that be any simpler?

True (I think?).  Maybe we could push the "transactional" behavior
down lower (into Directory) in Lucene.  Though all implementations of
Directory would need to implement their own transactional behavior vs
one implementation at the reader/writer layer?

As long as checkpointing is decoupled from the opening/closing of
readers and writers then I think this would support this use case.

So basically, the Directory layer would "mimic" the inode model (and
hard links) that unix filesystems provide?  Or maybe the Directory
would not make any changes visible to a reader until a writer did a
"checkpoint"?  But then how would this work across machines (on a
shared filesystem)?  I'm not sure I see how we could effectively (or
more simply) push this down into Directory instead of at the
reader/writer layer.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jan 16, 2007, at 1:51 PM, Doug Cutting wrote:

> One could also implement this with a Directory that permits  
> checkpointing and rollback.  Would that be any simpler?

FWIW, explicit commits, including deletes from the IndexWriter class,  
come along for the ride with the KinoSearch merge model.  That, plus  
faster indexing and lower, more predictable memory consumption.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Doug Cutting <cu...@apache.org>.

Michael McCandless wrote:
> We could indeed simply tie "close" to mean "commit now", and not add a
> separate "commit" method.
> 
> But what about the "bulk delete then bulk add" case?  Ideally if a
> reader refreshes by checking "isCurrent()" it shouldn't ever open the
> index "at a bad time".  Ie, we need a way to open a reader, delete a
> bunch of docs, close it *without* committing, open a writer, add a
> bunch of docs, and then do the commit, all so that any readers that
> are refreshing would know not to open the segments_N that was
> committed with all the deletes but none of the adds.  This is one use
> case that explicit commits would address.

One could also implement this with a Directory that permits 
checkpointing and rollback.  Would that be any simpler?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Doug Cutting wrote:
> robert engels wrote:
>> I really wish Doug would comment on all of these proposed changes...
> 
> I wish he would too!
> 
> Ideally the segments file would only be updated when one commits, by 
> closing the index, or perhaps by calling a new method.  So, if you 
> abort, all documents added since the last commit would not be indexed.
> 
> The advantage I see of adding a new method would be that one would save 
> the cost of creating a new writer, but creating writers is pretty cheap, 
> so I don't yet see an overwhelming mandate for a new method.
> 
> Remind me, why do we have to update the segments file except at close? 
> I'm sure there's a good reason, and that's central to this discussion.
> 
> It is sometimes nice to keep multiple versions of an index that share 
> much of their state.  Today this is possible using 'cp -lr' and rsync on 
> unix, but it might be better if the APIs directly supported this sort of 
> thing.  I think one can do this with a suitably clever Directory 
> implementation, so perhaps no API changes are required for this either.

We could indeed simply tie "close" to mean "commit now", and not add a
separate "commit" method.

But what about the "bulk delete then bulk add" case?  Ideally if a
reader refreshes by checking "isCurrent()" it shouldn't ever open the
index "at a bad time".  Ie, we need a way to open a reader, delete a
bunch of docs, close it *without* committing, open a writer, add a
bunch of docs, and then do the commit, all so that any readers that
are refreshing would know not to open the segments_N that was
committed with all the deletes but none of the adds.  This is one use
case that explicit commits would address.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Doug Cutting <cu...@apache.org>.

robert engels wrote:
> I really wish Doug would comment on all of these proposed changes...

I wish he would too!

Ideally the segments file would only be updated when one commits, by 
closing the index, or perhaps by calling a new method.  So, if you 
abort, all documents added since the last commit would not be indexed.

The advantage I see of adding a new method would be that one would save 
the cost of creating a new writer, but creating writers is pretty cheap, 
so I don't yet see an overwhelming mandate for a new method.

Remind me, why do we have to update the segments file except at close? 
I'm sure there's a good reason, and that's central to this discussion.

It is sometimes nice to keep multiple versions of an index that share 
much of their state.  Today this is possible using 'cp -lr' and rsync on 
unix, but it might be better if the APIs directly supported this sort of 
thing.  I think one can do this with a suitably clever Directory 
implementation, so perhaps no API changes are required for this either.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

I really wish Doug would comment on all of these proposed changes...

I seems that after you account for all of the constraints (e.g.  
IndexReader must be current snashot...) you are going to end up right  
back where you started.

It propose that this work should be done in some sort of facade or  
"server facade", and this mucking with the core Lucene classes is  
making the API needlessly complex. It seems to me that we are making  
many of the current deterministic operations (except in the case of  
critical disk failures, or locking failure), non-deterministic (i.e.  
maybe my delete document call will fail).

I think we should all step back and decide what is is we are exactly  
trying to do. Maybe this has already been done and someone can point  
me to the appropriate documentation?

1. improve search performance?
2. improve indexing performance?
3. improve durability of index changes?
4. persistent search results?
5. improve concurrency and determinism of searching while indexing?

It seems that a lot of the current proposed patches are attempting to  
solve one or more of these problems, but there does not seem to be a  
general coherent approach. There also does not appear to be any list  
of constraints governing what will be considered a valid approach.

It is almost "well I need this little feature for something I am  
doing, so I propose ..."

It may be that to solve all of these "properly" requires Lucene 3.0  
with a completely different API and infrastructure.

Just my thoughts.

On Jan 16, 2007, at 2:23 PM, Doron Cohen wrote:

> Michael McCandless <lu...@mikemccandless.com> wrote on 16/01/2007
> 12:13:47:
>
>> Ning Li wrote:
>>> On 1/16/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>>> Good catch Ning!  And, I agree, when a reader plans to make
>>>> modifications to the index, I think the best solution is to require
>>>> that the reader has opened most recent "segments*_N" (be that a
>>>> snapshot or a checkpoint).  Really a reader is actually a  
>>>> "writer" in
>>>> this context.  This means we need a way to open a reader against  
>>>> the
>>>> most recent checkpoint as well (I will add that).
>>>>
>>>> This is very much consistent with how a reader now checks if it is
>>>> still current when someone first tries to change a del/norm: if  
>>>> it's
>>>> not still current (ie, another writer has written a new segments_N
>>>> file) then an IOException is raised with "IndexReader out of  
>>>> date and
>>>> no longer valid for delete, undelete, or setNorm operations".  I  
>>>> think
>>>> with explicit commits that same requirement & check would apply.
>>>
>>> This means a reader can open a checkpoint for search. But the  
>>> purpose
>>> of "explicit commits" is that only snapshots are opened for search,
>>> not checkpoints. Can we just trust applications won't open a
>>> checkpoint for search? Or should we explicitly guard against it?
>>
>> Ahh good point.
>>
>> I think I'll add "openForWriting(*)" static methods to IndexReader.
>> These will acquire the write lock, and will open the latest
>> segments*_N (commit or checkpoint).  This way you can't open a
>> checkpoint unless there are no others writers on the index.
>>
>> We could go further and have IndexSearcher not accept an IndexReader
>> opened against a checkpoint, but I'm included not to check for
>> (prevent) this, for starters.  I'd rather not preclude possibly
>> interesting future use cases too early.
>
> Is this blocking applications that first perform a search, in order to
> decide which docs to delete by docid?
>
> Two other options in
> http://article.gmane.org/gmane.comp.jakarta.lucene.devel/16581 ...?
>
>>
>> Mike
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Chuck Williams wrote:
> I don't see how to do commits without at least some new methods.
> There needs to be some way to roll back changes rather than committing
> them.  If the commit action is IndexWriter.close() (even if just an
> interface) the user still needs another method to roll back.

Right, I think we'd add an "IndexWriter.abort()" method?  Once you call
that your IndexWriter has freed its state, released write lock,
removed any files it had created, etc, but has not committed any
changes to the index.

> There are reasons to close an IndexWriter other than committing changes,
> such as to flush all the ram segments to disk to free memory or save
> state.  We now have IndexWriter.flushRamSegments() for this case, but
> are there others?

One case I thought of is allowing optmize to "resume" from however far
it had gotten if the first one hit disk full.

I think the feeling here (from the feedback I've heard so far) is
there aren't enough such cases to warrant separating commit from close
and the small change to the index format?

> As was already pointed out to delete documents you have to find them,
> which may require a reader accessing the current snapshot rather than
> the current checkpoint.  There needs to be some way to specify this
> distinction.

I think we'd just add a package protected IndexReader.open that takes
SegmentInfos directly?  IndexWriter would then pass in its
(uncommitted) SegmentInfos.  Or, if IndexWriter opens SegmentReaders
directly (I think the patch for LUCENE-565 does currently) then this
would not be necessary.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Chuck Williams <ch...@manawiz.com>.

I don't see how to do commits without at least some new methods.

There needs to be some way to roll back changes rather than committing
them.  If the commit action is IndexWriter.close() (even if just an
interface) the user still needs another method to roll back.

There are reasons to close an IndexWriter other than committing changes,
such as to flush all the ram segments to disk to free memory or save
state.  We now have IndexWriter.flushRamSegments() for this case, but
are there others?

As was already pointed out to delete documents you have to find them,
which may require a reader accessing the current snapshot rather than
the current checkpoint.  There needs to be some way to specify this
distinction.

Chuck

Yonik Seeley wrote on 01/17/2007 06:48 AM:
> On 1/17/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>> If this approach works well we could at some point deprecate the
>> delete* methods on IndexReader and make package protected versions
>> that IndexWriter calls.
>
> If we do API changes in the future, it would be nice to make the
> search side more efficient w.r.t. deleted documents... at least remove
> the synchronization for isDeleted for read-only readers, and perhaps
> even have a subclass that is a no-op for isDeleted for read-only
> readers.
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Yonik Seeley <yo...@apache.org>.

On 1/17/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> If this approach works well we could at some point deprecate the
> delete* methods on IndexReader and make package protected versions
> that IndexWriter calls.

If we do API changes in the future, it would be nice to make the
search side more efficient w.r.t. deleted documents... at least remove
the synchronization for isDeleted for read-only readers, and perhaps
even have a subclass that is a no-op for isDeleted for read-only
readers.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: jruby anyone?

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi.
This is off topic but I think Rubens would make for a nice name :-)
(I like his art).
Lukas

On 1/19/07, Steven Rowe <sa...@syr.edu> wrote:
>
> Steven Parkes wrote:
> > 3) my luke-like app (luki? lucky? juki? ???)
>
> Don't forget there is already Lucli in the sandbox.
>
> How about: ruke, rube, ruben, rubene, lucene-in-the-sky-with-ruby :)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

If that is the case, then I think it is far easier and efficient to  
use OID's to track which documents should be deleted, and which  
should be added at commit time.

It seems that people using 'transactional' code probably have OIDs.

Then you just created an ordered list of operations (delete OID A,  
insert document B (which always deleted B's OID first), insert  
document D), etc.

It is easy to write this to a tx log file as the requests come in,  
and then play back until it completes.

This also puts you in a better position to use the KS sort pool  
model, and perform a single segment write.

On Jan 17, 2007, at 2:11 PM, Marvin Humphrey wrote:

>
> On Jan 17, 2007, at 12:00 PM, robert engels wrote:
>
>> Under this new scenario, what is the result of this:
>>
>> I open the IndexWriter.
>>
>> I delete all documents with Term A.
>> I add a new document with Term A.
>> I delete all documents with Term A.
>>
>> Is the new document correctly removed?
>
> Not in KS right now.
>
> Hmm.
>
> I think what you'd have to do is save all the terms, then run a  
> deletion op against the new segment after it's written.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jan 17, 2007, at 12:00 PM, robert engels wrote:

> Under this new scenario, what is the result of this:
>
> I open the IndexWriter.
>
> I delete all documents with Term A.
> I add a new document with Term A.
> I delete all documents with Term A.
>
> Is the new document correctly removed?

Not in KS right now.

Hmm.

I think what you'd have to do is save all the terms, then run a  
deletion op against the new segment after it's written.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Yonik Seeley wrote:
> On 1/17/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>> Let's call the new approach "commit on close".  The idea (Doug's
>> suggestion from yesterday) is very simple: an IndexWriter should never
>> write a new segments_N file until close() is called.  So a reader
>> never sees anything the writer is doing until the writer is closed.
> 
> Would that remain optional?

Yes.  For backwards compatibility we would by default initialize
IndexWriter with autoCommit "true" (the behavior now).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Yonik Seeley <yo...@apache.org>.

On 1/17/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> Let's call the new approach "commit on close".  The idea (Doug's
> suggestion from yesterday) is very simple: an IndexWriter should never
> write a new segments_N file until close() is called.  So a reader
> never sees anything the writer is doing until the writer is closed.

Would that remain optional?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: jruby anyone?

Posted by Steven Rowe <sa...@syr.edu>.

Steven Parkes wrote:
> 3) my luke-like app (luki? lucky? juki? ???)

Don't forget there is already Lucli in the sandbox.

How about: ruke, rube, ruben, rubene, lucene-in-the-sky-with-ruby :)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

jruby anyone?

Posted by Steven Parkes <st...@esseff.org>.

I've been playing with using jruby to script, drive, and otherwise
front-end Lucene activities. I wondered originally about using it to
develop a domain specific language for exploring Lucene performance.
Now, with Doron's stuff, that may not really be necessary (and, besides,
throwing jruby into the mix might skew performance, though if it's just
the outer scripting being done in ruby, it might be okay.)

Anyway, coming up to speed on jruby, I also looked at rails. I've
started on a luke-like jruby-on-rails application and, well, at least it
doesn't suck. Very quick to prototype things (though in the spirit of
full-disclosure, that's coming from someone who's a big proponent of
dynamically-typed languages and willfully incompetent when it comes to
GUI toolkits).

It includes 

1) jruby wrappers for the Lucene stuff (at least what I've needed so
far)
2) rails models for some of the Lucene stuff
3) my luke-like app (luki? lucky? juki? ???)

It's still fairly early. The web app is pretty crude, but functional for
the basics. (One of the things I'm wonder is whether this platform (once
one gets the platform running) is easier for people to add functionality
to.) 

I've been trying to make the jruby bindings rubiesque, at least what I
think is rubiesque. This means to me, more like the putative grand
unified index interface talked about here on dev.

I'm still thinking through the web/rails model. Rails normally uses an
rdbms backend and maintains a connection to the db. Right now, I'm not
keeping a connection to the index. Rails also reifies tables as classes,
so I've been playing with what that means for Lucene.

This is just diagnostic stuff, performance isn't really and issue.
What's done in ruby is mostly UI stuff and once you're in java, it runs
at speed. I did come across one place where this wasn't the case. To do
term ordering by doc freq, you need to get all the terms and sort them.
This is fairly fast in java, for say 80K terms. Making 80K very
lightweight calls from ruby to java is not a good idea at this point. So
I factored out the necessary sorting code into a java util method and
call it from ruby. Doesn't change the api. Integrates nicely. The jruby
guys have done a good job.

My goal is to get this to the point where I can generate a war file that
can be loaded into any web container. So far, I've just been running it
under ruby's webrick container. And it does require bleeding edge
(trunk) jruby and rails.

Like I said, it's early, but if others are interested in looking at it?
Maybe contrib/jruby?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Ning Li <ni...@gmail.com>.

On 1/17/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> robert engels wrote:
> > Under this new scenario, what is the result of this:
> >
> > I open the IndexWriter.
> >
> > I delete all documents with Term A.
> > I add a new document with Term A.
> > I delete all documents with Term A.
> >
> > Is the new document correctly removed?
>
> This is actually a question about the proposed patch for LUCENE-565
> "supporting deleteDocuments in IndexWriter".  Remember: this is
> separate from "commit on close".  We can do either one first.  Anyway,
> the answer to your question is "yes" for the most recent patch.  See:

The answer is "yes" to all patch versions submitted, including the
most recent. :-)

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

robert engels wrote:
> Under this new scenario, what is the result of this:
> 
> I open the IndexWriter.
> 
> I delete all documents with Term A.
> I add a new document with Term A.
> I delete all documents with Term A.
> 
> Is the new document correctly removed?

This is actually a question about the proposed patch for LUCENE-565
"supporting deleteDocuments in IndexWriter".  Remember: this is
separate from "commit on close".  We can do either one first.  Anyway,
the answer to your question is "yes" for the most recent patch.  See:

     http://issues.apache.org/jira/browse/LUCENE-565#action_12459506

Solr has similar careful logic to handle this interleaved add/delete
case.

> Also, do we have any documentation that describes the new file 
> format/naming conventions similar to the original Lucene file format 
> documentation?
 >
> Is someone going to do this before the code changes? Seems like it would 
> be easier to review than scanning the code or 1000 emails.

Actually there index file format is unchanged.

Let's call the new approach "commit on close".  The idea (Doug's
suggestion from yesterday) is very simple: an IndexWriter should never
write a new segments_N file until close() is called.  So a reader
never sees anything the writer is doing until the writer is closed.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

Under this new scenario, what is the result of this:

I open the IndexWriter.

I delete all documents with Term A.
I add a new document with Term A.
I delete all documents with Term A.

Is the new document correctly removed?

Also, do we have any documentation that describes the new file format/ 
naming conventions similar to the original Lucene file format  
documentation?

Is someone going to do this before the code changes? Seems like it  
would be easier to review than scanning the code or 1000 emails.

Thanks.

On Jan 17, 2007, at 1:52 PM, Marvin Humphrey wrote:

>
> On Jan 17, 2007, at 6:30 AM, Michael McCandless wrote:
>
>> In fact I think we could just continue to use IndexReader to actually
>> perform the deletions (like the patch(es) in LUCENE-565 and also like
>> Solr I believe)?
>
> +1  (: advisory vote :)
>
> KinoSearch, too.
>
>> It's just that IndexWriter is the one opening the IndexReader (or
>> SegmentReaders) in order to flush the deletes, and it's doing so
>> "within" its session (so that deletes & re-adds are committed into a
>> single new segments_N).
>
> 'zactly.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

robert engels wrote:
> In thinking about this some more, I think Doug's idea of doing this in a 
> directory implementation has a lot of merit.
> 
> It probably requires a few new methods to the Directory class, but in 
> all (almost?) cases they should be able to be no-ops.
> 
> I think the end result is that the code might be a lot cleaner, easier 
> to understand in the simple case, and easier to maintain backwards 
> compatibility, especially for those Lucene users that already have 
> layers around the the low-level classes.
> 
> Something akin to : you open the reader for a using a directory created 
> at a point in time, and it manages what version of files the Reader sees.
> 
> I think it would also need commit() and rollback(), statements for those 
> directories that support transactions.

I agree this is a neat idea.  But how would it actually work without
relying on specifics of filesystem behaviour when deleting open files?
Especially over NFS?

BTW, I do think we need to support NFS.  It's ubiquitous (the default
remote filesystem) for Unix and so it's the obvious way to allow
readers to share an index.  Our users automatically expect this should
work, and they're right.  Likely performance will not be "stellar" but
for many apps that doesn't matter.

The point of LUCENE-710 (that started this discussion) is to fix
Lucene to not rely on the particular filesystem details of "deleting
open files" because NFS does not protect such files.

Doug's first suggestion ("commit on close") seems better to me.

First, we wouldn't have to make duplicated code in each Directory,
just once in IndexWriter.  Second, it's a small change (no index format
change).  Third, it can work over NFS so long as we relax how
aggressively IndexFileDeleter removes prior commits and we can ensure
readers correspondingly refresh.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

In thinking about this some more, I think Doug's idea of doing this  
in a directory implementation has a lot of merit.

It probably requires a few new methods to the Directory class, but in  
all (almost?) cases they should be able to be no-ops.

I think the end result is that the code might be a lot cleaner,  
easier to understand in the simple case, and easier to maintain  
backwards compatibility, especially for those Lucene users that  
already have layers around the the low-level classes.

Something akin to : you open the reader for a using a directory  
created at a point in time, and it manages what version of files the  
Reader sees.

I think it would also need commit() and rollback(), statements for  
those directories that support transactions.

On Jan 17, 2007, at 1:52 PM, Marvin Humphrey wrote:

>
> On Jan 17, 2007, at 6:30 AM, Michael McCandless wrote:
>
>> In fact I think we could just continue to use IndexReader to actually
>> perform the deletions (like the patch(es) in LUCENE-565 and also like
>> Solr I believe)?
>
> +1  (: advisory vote :)
>
> KinoSearch, too.
>
>> It's just that IndexWriter is the one opening the IndexReader (or
>> SegmentReaders) in order to flush the deletes, and it's doing so
>> "within" its session (so that deletes & re-adds are committed into a
>> single new segments_N).
>
> 'zactly.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jan 17, 2007, at 6:30 AM, Michael McCandless wrote:

> In fact I think we could just continue to use IndexReader to actually
> perform the deletions (like the patch(es) in LUCENE-565 and also like
> Solr I believe)?

+1  (: advisory vote :)

KinoSearch, too.

> It's just that IndexWriter is the one opening the IndexReader (or
> SegmentReaders) in order to flush the deletes, and it's doing so
> "within" its session (so that deletes & re-adds are committed into a
> single new segments_N).

'zactly.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

In fact I think we could just continue to use IndexReader to actually
perform the deletions (like the patch(es) in LUCENE-565 and also like
Solr I believe)?

It's just that IndexWriter is the one opening the IndexReader (or
SegmentReaders) in order to flush the deletes, and it's doing so
"within" its session (so that deletes & re-adds are committed into a
single new segments_N).

Ie what we want to avoid is forcing the developer to open/close the
IndexReader in order to do deletes because that "close" is really a
"commit and close".

If this approach works well we could at some point deprecate the
delete* methods on IndexReader and make package protected versions
that IndexWriter calls.

Mike

Grant Ingersoll wrote:
> Perhaps the answer would be to add some interfaces, such as (interface 
> names are just for argument purposes):
> 
> IndexWriter -- contains signatures of write methods
> 
> DocumentDeleter - contains signatures of deletion methods
> 
> IndexReader -signatures of read methods
> 
> If this were how users interacted w/ Lucene, then the implementation 
> could be in the same implementation that unified the working with the 
> index.
> 
> I think we may want to seriously consider, for 3.0, a revision of the 
> front facing methods such as writing and reading.
> 
> 
> 
> On Jan 17, 2007, at 7:46 AM, Nadav Har'El wrote:
> 
>> On Wed, Jan 17, 2007, Michael McCandless wrote about "Re: adding 
>> "explicit commits" to Lucene?":
>>> Perhaps instead of a single "grand unified Index" class in the future,
>>> we aim to move all index write methods into IndexWriter?  This
>>> is then simple for users to use: if you want to change anything about
>>> the index, use an IndexWriter; if you want to do searches, use an
>>> IndexReader.  If we aim for this as our eventual goal, LUCENE-565 in
>>> fact gets us quite a bit closer.
>>
>> It's important to remember, that whatever class knows how to do 
>> deletions,
>> this class will need to replicate much of the IndexReader functionality.
>> Why? Because just like an IndexReader, the deleting class needs to 
>> know how
>> to find documents matching a term, and like an IndexReader (and unlike an
>> InderWriter) it may need to open all segments, not just the one 
>> segment that
>> is being written.
>>
>> So perhaps a "grand unified Index" does make sense, instead of 
>> repeating the
>> same code and/or functionality in both IndexReader and IndexWriter.
>>
>>
>> --Nadav Har'El                        |    Wednesday, Jan 17 2007, 27 
>> Tevet 5767
>> nyh@math.technion.ac.il             
>> |-----------------------------------------
>> Phone +972-523-790466, ICQ 13349191 |I had a lovely evening. 
>> Unfortunately,
>> http://nadav.harel.org.il           |this wasn't it. - Groucho Marx
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
> 
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org
> 
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Grant Ingersoll <gs...@apache.org>.

Perhaps the answer would be to add some interfaces, such as  
(interface names are just for argument purposes):

IndexWriter -- contains signatures of write methods

DocumentDeleter - contains signatures of deletion methods

IndexReader -signatures of read methods

If this were how users interacted w/ Lucene, then the implementation  
could be in the same implementation that unified the working with the  
index.

I think we may want to seriously consider, for 3.0, a revision of the  
front facing methods such as writing and reading.



On Jan 17, 2007, at 7:46 AM, Nadav Har'El wrote:

> On Wed, Jan 17, 2007, Michael McCandless wrote about "Re: adding  
> "explicit commits" to Lucene?":
>> Perhaps instead of a single "grand unified Index" class in the  
>> future,
>> we aim to move all index write methods into IndexWriter?  This
>> is then simple for users to use: if you want to change anything about
>> the index, use an IndexWriter; if you want to do searches, use an
>> IndexReader.  If we aim for this as our eventual goal, LUCENE-565 in
>> fact gets us quite a bit closer.
>
> It's important to remember, that whatever class knows how to do  
> deletions,
> this class will need to replicate much of the IndexReader  
> functionality.
> Why? Because just like an IndexReader, the deleting class needs to  
> know how
> to find documents matching a term, and like an IndexReader (and  
> unlike an
> InderWriter) it may need to open all segments, not just the one  
> segment that
> is being written.
>
> So perhaps a "grand unified Index" does make sense, instead of  
> repeating the
> same code and/or functionality in both IndexReader and IndexWriter.
>
>
> -- 
> Nadav Har'El                        |    Wednesday, Jan 17 2007, 27  
> Tevet 5767
> nyh@math.technion.ac.il              
> |-----------------------------------------
> Phone +972-523-790466, ICQ 13349191 |I had a lovely evening.  
> Unfortunately,
> http://nadav.harel.org.il           |this wasn't it. - Groucho Marx
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Nadav Har'El <ny...@math.technion.ac.il>.

On Wed, Jan 17, 2007, Michael McCandless wrote about "Re: adding "explicit commits" to Lucene?":
> Perhaps instead of a single "grand unified Index" class in the future,
> we aim to move all index write methods into IndexWriter?  This
> is then simple for users to use: if you want to change anything about
> the index, use an IndexWriter; if you want to do searches, use an
> IndexReader.  If we aim for this as our eventual goal, LUCENE-565 in
> fact gets us quite a bit closer.

It's important to remember, that whatever class knows how to do deletions,
this class will need to replicate much of the IndexReader functionality.
Why? Because just like an IndexReader, the deleting class needs to know how
to find documents matching a term, and like an IndexReader (and unlike an
InderWriter) it may need to open all segments, not just the one segment that
is being written.

So perhaps a "grand unified Index" does make sense, instead of repeating the
same code and/or functionality in both IndexReader and IndexWriter.


-- 
Nadav Har'El                        |    Wednesday, Jan 17 2007, 27 Tevet 5767
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I had a lovely evening. Unfortunately,
http://nadav.harel.org.il           |this wasn't it. - Groucho Marx

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Grant Ingersoll wrote:
> 
> On Jan 16, 2007, at 3:55 PM, Michael McCandless wrote:
> 
>> Doron Cohen wrote:
>>> Michael McCandless <lu...@mikemccandless.com> wrote on 16/01/2007
>>> 12:13:47:
>>>> Ning Li wrote:
>> Re those 2 ideas: I do agree the whole division of certain kinds of
>> index changes into a reader and other ones into a writer, is confusing
>> to our users.  I think our ideal eventual solution is a single "grand
>> unified" Index class that efficiently does all things that IndexWriter
>> and IndexReader do today.  (I think this is closest to your 2nd option
>> in that link).
>>
> 
> I'm all for efficiency, but I'd be wary of a grand unifying class 
> combining Reader/Writer operations as you've already hit the nail on the 
> head that the reader/writers are too confusing, combining them, without 
> significant refactoring simplification sounds like it would just make 
> matters worse since they both already have a fair number of methods on 
> them.  To me, we need to better define what a Reader does and what a 
> Writer does and make the appropriate changes to the APIs.

I agree: it's not clear how best to improve the confusion, short term
or long term.

I think the confusion indeed stems from the fact that IndexReader is
used to make certain changes (deletes, norms) and IndexWriter is used
for others (addDocuments, optimize, addIndices), and the fact that the
nearly ubiquitous use case of reindexing changed documents requires
one open IndexReader, do deletes, close (commit), open IndexWriter, do
adds, close (commit).

Perhaps instead of a single "grand unified Index" class in the future,
we aim to move all index write methods into IndexWriter?  This
is then simple for users to use: if you want to change anything about
the index, use an IndexWriter; if you want to do searches, use an
IndexReader.  If we aim for this as our eventual goal, LUCENE-565 in
fact gets us quite a bit closer.

Furthermore, if we did that, we can indeed simplify "explicit commits"
to mean "nothing is visible to readers until you close your writer".
This is the "commit only on close" idea that Doug suggested.  With
this new approach there would be no added commit() method.  And, a
writer would not write any segments_N (checkpoint) files until it is
closed.

I like this new implemenation for "explicit commits" better (thanks
everyone for the feedback!).  I will work through this / think about
it more.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 16, 2007, at 3:55 PM, Michael McCandless wrote:

> Doron Cohen wrote:
>> Michael McCandless <lu...@mikemccandless.com> wrote on 16/01/2007
>> 12:13:47:
>>> Ning Li wrote:
> Re those 2 ideas: I do agree the whole division of certain kinds of
> index changes into a reader and other ones into a writer, is confusing
> to our users.  I think our ideal eventual solution is a single "grand
> unified" Index class that efficiently does all things that IndexWriter
> and IndexReader do today.  (I think this is closest to your 2nd option
> in that link).
>

I'm all for efficiency, but I'd be wary of a grand unifying class  
combining Reader/Writer operations as you've already hit the nail on  
the head that the reader/writers are too confusing, combining them,  
without significant refactoring simplification sounds like it would  
just make matters worse since they both already have a fair number of  
methods on them.  To me, we need to better define what a Reader does  
and what a Writer does and make the appropriate changes to the APIs.

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Doron Cohen wrote:
> Michael McCandless <lu...@mikemccandless.com> wrote on 16/01/2007
> 12:13:47:
> 
>> Ning Li wrote:
>>> On 1/16/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>>> Good catch Ning!  And, I agree, when a reader plans to make
>>>> modifications to the index, I think the best solution is to require
>>>> that the reader has opened most recent "segments*_N" (be that a
>>>> snapshot or a checkpoint).  Really a reader is actually a "writer" in
>>>> this context.  This means we need a way to open a reader against the
>>>> most recent checkpoint as well (I will add that).
>>>>
>>>> This is very much consistent with how a reader now checks if it is
>>>> still current when someone first tries to change a del/norm: if it's
>>>> not still current (ie, another writer has written a new segments_N
>>>> file) then an IOException is raised with "IndexReader out of date and
>>>> no longer valid for delete, undelete, or setNorm operations".  I think
>>>> with explicit commits that same requirement & check would apply.
>>> This means a reader can open a checkpoint for search. But the purpose
>>> of "explicit commits" is that only snapshots are opened for search,
>>> not checkpoints. Can we just trust applications won't open a
>>> checkpoint for search? Or should we explicitly guard against it?
>> Ahh good point.
>>
>> I think I'll add "openForWriting(*)" static methods to IndexReader.
>> These will acquire the write lock, and will open the latest
>> segments*_N (commit or checkpoint).  This way you can't open a
>> checkpoint unless there are no others writers on the index.
>>
>> We could go further and have IndexSearcher not accept an IndexReader
>> opened against a checkpoint, but I'm included not to check for
>> (prevent) this, for starters.  I'd rather not preclude possibly
>> interesting future use cases too early.
> 
> Is this blocking applications that first perform a search, in order to
> decide which docs to delete by docid?

I don't think we're preventing this use case, even if we decide to
guard against "searching on a checkpoint" (which I think we shouldn't
do just yet).

If you do an explicit commit from your writer, close it, then open a
reader, you can run searches and delete the resulting docids.  This is
in fact Solr's approach today (a commit is forced if you do a
deleteByQuery).

> Two other options in
> http://article.gmane.org/gmane.comp.jakarta.lucene.devel/16581 ...?

Re those 2 ideas: I do agree the whole division of certain kinds of
index changes into a reader and other ones into a writer, is confusing
to our users.  I think our ideal eventual solution is a single "grand
unified" Index class that efficiently does all things that IndexWriter
and IndexReader do today.  (I think this is closest to your 2nd option
in that link).

I think the "support deleteDocuments in IndexWriter" (LUCENE-565) is
an awesome first step.  But I think these steps are separate from
enabling explicit commits.  Explicit commits should allow LUCENE-565
to have a more efficient implementation, but we should still work
through them separately.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Doron Cohen <DO...@il.ibm.com>.

Michael McCandless <lu...@mikemccandless.com> wrote on 16/01/2007
12:13:47:

> Ning Li wrote:
> > On 1/16/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> >> Good catch Ning!  And, I agree, when a reader plans to make
> >> modifications to the index, I think the best solution is to require
> >> that the reader has opened most recent "segments*_N" (be that a
> >> snapshot or a checkpoint).  Really a reader is actually a "writer" in
> >> this context.  This means we need a way to open a reader against the
> >> most recent checkpoint as well (I will add that).
> >>
> >> This is very much consistent with how a reader now checks if it is
> >> still current when someone first tries to change a del/norm: if it's
> >> not still current (ie, another writer has written a new segments_N
> >> file) then an IOException is raised with "IndexReader out of date and
> >> no longer valid for delete, undelete, or setNorm operations".  I think
> >> with explicit commits that same requirement & check would apply.
> >
> > This means a reader can open a checkpoint for search. But the purpose
> > of "explicit commits" is that only snapshots are opened for search,
> > not checkpoints. Can we just trust applications won't open a
> > checkpoint for search? Or should we explicitly guard against it?
>
> Ahh good point.
>
> I think I'll add "openForWriting(*)" static methods to IndexReader.
> These will acquire the write lock, and will open the latest
> segments*_N (commit or checkpoint).  This way you can't open a
> checkpoint unless there are no others writers on the index.
>
> We could go further and have IndexSearcher not accept an IndexReader
> opened against a checkpoint, but I'm included not to check for
> (prevent) this, for starters.  I'd rather not preclude possibly
> interesting future use cases too early.

Is this blocking applications that first perform a search, in order to
decide which docs to delete by docid?

Two other options in
http://article.gmane.org/gmane.comp.jakarta.lucene.devel/16581 ...?

>
> Mike
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Ning Li wrote:
> On 1/16/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>> Good catch Ning!  And, I agree, when a reader plans to make
>> modifications to the index, I think the best solution is to require
>> that the reader has opened most recent "segments*_N" (be that a
>> snapshot or a checkpoint).  Really a reader is actually a "writer" in
>> this context.  This means we need a way to open a reader against the
>> most recent checkpoint as well (I will add that).
>>
>> This is very much consistent with how a reader now checks if it is
>> still current when someone first tries to change a del/norm: if it's
>> not still current (ie, another writer has written a new segments_N
>> file) then an IOException is raised with "IndexReader out of date and
>> no longer valid for delete, undelete, or setNorm operations".  I think
>> with explicit commits that same requirement & check would apply.
> 
> This means a reader can open a checkpoint for search. But the purpose
> of "explicit commits" is that only snapshots are opened for search,
> not checkpoints. Can we just trust applications won't open a
> checkpoint for search? Or should we explicitly guard against it?

Ahh good point.

I think I'll add "openForWriting(*)" static methods to IndexReader.
These will acquire the write lock, and will open the latest
segments*_N (commit or checkpoint).  This way you can't open a
checkpoint unless there are no others writers on the index.

We could go further and have IndexSearcher not accept an IndexReader
opened against a checkpoint, but I'm included not to check for
(prevent) this, for starters.  I'd rather not preclude possibly
interesting future use cases too early.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Ning Li <ni...@gmail.com>.

On 1/16/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> Good catch Ning!  And, I agree, when a reader plans to make
> modifications to the index, I think the best solution is to require
> that the reader has opened most recent "segments*_N" (be that a
> snapshot or a checkpoint).  Really a reader is actually a "writer" in
> this context.  This means we need a way to open a reader against the
> most recent checkpoint as well (I will add that).
>
> This is very much consistent with how a reader now checks if it is
> still current when someone first tries to change a del/norm: if it's
> not still current (ie, another writer has written a new segments_N
> file) then an IOException is raised with "IndexReader out of date and
> no longer valid for delete, undelete, or setNorm operations".  I think
> with explicit commits that same requirement & check would apply.

This means a reader can open a checkpoint for search. But the purpose
of "explicit commits" is that only snapshots are opened for search,
not checkpoints. Can we just trust applications won't open a
checkpoint for search? Or should we explicitly guard against it?

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

OK, catching up here and trying to merge threads together otherwise
I'm going to lose my mind!:

Chuck Williams wrote:
 >
 > Ning Li wrote:
 >>
 >> If a reader can only open snapshots both for search and for
 >> modification, I think another change is needed besides the ones
 >> listed: assume the latest snapshot is segments_5 and the latest
 >> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
 >> snapshot segments_5, performs a few deletes and writes a new
 >> checkpoint segmentsx_8. The summary file segmentsx_8 should include
 >> the 2 new segments which are in segmentsx_7 but not in segments_5.
 >> Such segments to include are easily identifiable only if they are not
 >> merged with segments in the latest snapshot... All these won't be
 >> necessary if a reader always opens the latest checkpoint for
 >> modification, which will also support deletion of non-committed
 >> documents.
 >>
 > This problem seems worse.  I don't see how a reader and a writer can
 > independently compute and write checkpoints.  The adds in the writer
 > don't just create new segments, they replace existing ones through
 > merging.  And the merging changes doc-ids by expunging deletes.  It
 > seems that all deletes must be based on the most recent checkpoint, or
 > merging of checkpoints to create the next snapshot will be considerably
 > more complex.

Good catch Ning!  And, I agree, when a reader plans to make
modifications to the index, I think the best solution is to require
that the reader has opened most recent "segments*_N" (be that a
snapshot or a checkpoint).  Really a reader is actually a "writer" in
this context.  This means we need a way to open a reader against the
most recent checkpoint as well (I will add that).

This is very much consistent with how a reader now checks if it is
still current when someone first tries to change a del/norm: if it's
not still current (ie, another writer has written a new segments_N
file) then an IOException is raised with "IndexReader out of date and
no longer valid for delete, undelete, or setNorm operations".  I think
with explicit commits that same requirement & check would apply.

Chuck Williams wrote:

 > My interest is transactions, not making doc-id's permanent.
 > Specifically, the ability to ensure that a group of adds either all go
 > into the index or none go into the index, and to ensure that if none go
 > into the index that the index is not changed in any way.

Right, I see "explicit commits" as a very simple implementation to
provide a powerful base functionality to Lucene.  This base
functionality can indeed enable or make easier/more performant many
neat things above it (the permanent docids discussion, Chuck's highly
performant ParallelWriter, delayed flushing of pending deletes, etc)
but I'd like to keep a clean separation and focus first only on making
the most minimal yet self-contained "explicit commits" work and then
separately build out on top of it.  Progress not perfection!

Doron Cohon wrote:

 > As a database application, to my understanding the (newly suggested)
 > transaction support in Lucene is single tx. I can't see how multiple
 > tx can be done within Lucene (and I don't think it should be
 > done). Even if it was possible, I think indexing would become very
 > inefficient. I think the motivation for adding (some) tx support is
 > different, and tx support would be minimal, definitely not multiple
 > tx.

Ning Li wrote:

 > Lastly, hopefully the term "transaction" won't cause any confusion
 > since this "explicit commit" is much simpler than database
 > transaction where a database can guarantee the ACID properties for
 > each of multiple concurrent transactions.

I agree "explicit commits" is in fact a reduced version of the more
general ACID transactions that relational DBs provide.  I really don't
want to call it "transactions" for this reason: that label would
automatically oversell the capability, then only to later disappoint
our users.  Always best to "under promise and over deliver" and the
label "transactions" would do just the reverse.  But yes explicit
commits is basically a "single transaction".

 > If I had a vote it would be +1 on the direction Michael has proposed,
 > assuming it can be done robustly and without performance penalty.

I don't anticipate any performance issues.  The implementation is so
amazingly trivial!  The only index format change is a new name for
those segments_N files that were just the automatic checkpoints that
Lucene does.  Otherwise the index format is unchanged.  And then
additional logic for a reader/writer to decide which one of these to
read/write.

The only really "interesting" change is to the IndexFileDeleter: it
now must be more careful in how it figures out which index files are
safe to delete (this is the part I'm working on now).  I will
definitely test performance (with the new benchmarking suite!)  but I
don't expect any changes for the better or worse with just "explicit
commits".

The things that then become possible once you have explicit commits
should give us good potential performance improvements, error
recoverability, etc. in the future.  But that's the future and I'm
focusing on "now" :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Doron Cohen <DO...@il.ibm.com>.

The problem Ning pointed out seems to stem from the two roles of
IndexReader:
(1) reading (read only) the Index for searching and for inspecting its
content;
(2) modifying the index by deleting documents;

This is further complicated by the fact that often a reader is used for
search and then returned docs are deleted by docid.

Perhaps one possibility is to define DocumentDeleter as a subclass of
IndexReader searcher. It would always open the top most generation. It
would (as today) fail to delete if it is not the top most generation. It
would support search, but would be recommended to be used only for update
purposes. Mmmm...  It is becoming too complex I'm afraid.

So a better (?) option: (1) add to IndexWriter deleteByTerm() (and
deleteByQuery()) (like NewIndexModifier..) - these deletion methods would
then be performed on top most generation - same as addDocument(); (2)
IndexReader delete() methods would fail (as today) if it is not top most
generation - so it would only work when all previous changes were committed
(which is always true if an application is using (the default) auto
commit).

One comment about permanent IDs (PIDs) - I think that Lucene's choice to
not maintain PIDs on behalf of applications is the right way to go. For
efficiency, even if PIDs were maintained by Lucene, internal changing IDs
would exist and low level operations would use those IDs. But in addition
Lucene would need to maintain the mapping between the two - IDs and PIDs -
and notify an application adding a doc what PID was assigned to it, etc.
Seems better to leave this for applications.

Doron

Chuck Williams <ch...@manawiz.com> wrote on 15/01/2007 21:49:05:

> My interest is transactions, not making doc-id's permanent.
> Specifically, the ability to ensure that a group of adds either all go
> into the index or none go into the index, and to ensure that if none go
> into the index that the index is not changed in any way.
>
> I have UID's but they cannot ensure the latter property, i.e. they
> cannot ensure side-effect-free rollbacks.
>
> Yes, if you have no reliance on internal Lucene structures like doc-id's
> and segments, then that shouldn't matter.  But many capabilities have
> such reliance for good reasons.  E.g., ParallelReader, which is a public
> supported class in Lucene, requires doc-id synchronization.  There are
> similar good reasons for an application to take advantage of doc-ids.
>
> Lucene uses doc-id's in many of its API's and so it is not surprising
> that many applications rely on them, and I'm sure misuse them not fully
> understanding the semantics and uncertainties of doc-id changes due to
> merging segments with deletes.
>
> Applications can use doc-ids for legitimate and beneficial purposes
> while remaining semantically valid.  Making such capabilities efficient
> and robust in all cases is facilitated by application control over when
> doc-id's and segment structure change at a granularity larger than the
> single Document.
>
> If I had a vote it would be +1 on the direction Michael has proposed,
> assuming it can be done robustly and without performance penalty.
>
> Chuck
>
>
> robert engels wrote on 01/15/2007 07:34 PM:
> > I honestly think that having a unique OID as an indexed field and
> > putting a layer on top of Lucene is the best solution to all of this.
> > It makes it almost trivial, and you can implement transaction handling
> > in a variety of ways.
> >
> > Attempting to make the doc ids "permanent" is a tough challenge,
> > considering the orignal design called for them to be "non permanent".
> >
> > It seems doubtful that you cannot have some sort of primary key any
> > way and be this concerned about the transactional nature of Lucene.
> >
> > I vote -1 on all of this. I think it will detract from the simple and
> > efficient storage mechanism that Lucene uses.
> >
> > On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
> >
> >> Ning Li wrote on 01/15/2007 06:29 PM:
> >>> On 1/14/07, Michael McCandless <lu...@mikemccandless.com> wrote:
> >>>>   * The "support deleteDocuments in IndexWriter" (LUCENE-565)
feature
> >>>>     could have a more efficient implementation (just like Solr) when
> >>>>     autoCommit is false, because deletes don't need to be flushed
> >>>>     until commit() is called.  Whereas, now, they must be
aggressively
> >>>>     flushed on each checkpoint.
> >>>
> >>> If a reader can only open snapshots both for search and for
> >>> modification, I think another change is needed besides the ones
> >>> listed: assume the latest snapshot is segments_5 and the latest
> >>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
> >>> snapshot segments_5, performs a few deletes and writes a new
> >>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
> >>> the 2 new segments which are in segmentsx_7 but not in segments_5.
> >>> Such segments to include are easily identifiable only if they are not
> >>> merged with segments in the latest snapshot... All these won't be
> >>> necessary if a reader always opens the latest checkpoint for
> >>> modification, which will also support deletion of non-committed
> >>> documents.
> >> This problem seems worse.  I don't see how a reader and a writer can
> >> independently compute and write checkpoints.  The adds in the writer
> >> don't just create new segments, they replace existing ones through
> >> merging.  And the merging changes doc-ids by expunging deletes.  It
> >> seems that all deletes must be based on the most recent checkpoint, or
> >> merging of checkpoints to create the next snapshot will be
considerably
> >> more complex.
> >>
> >> Chuck
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Chuck Williams <ch...@manawiz.com>.

My interest is transactions, not making doc-id's permanent. 
Specifically, the ability to ensure that a group of adds either all go
into the index or none go into the index, and to ensure that if none go
into the index that the index is not changed in any way.

I have UID's but they cannot ensure the latter property, i.e. they
cannot ensure side-effect-free rollbacks.

Yes, if you have no reliance on internal Lucene structures like doc-id's
and segments, then that shouldn't matter.  But many capabilities have
such reliance for good reasons.  E.g., ParallelReader, which is a public
supported class in Lucene, requires doc-id synchronization.  There are
similar good reasons for an application to take advantage of doc-ids.

Lucene uses doc-id's in many of its API's and so it is not surprising
that many applications rely on them, and I'm sure misuse them not fully
understanding the semantics and uncertainties of doc-id changes due to
merging segments with deletes.

Applications can use doc-ids for legitimate and beneficial purposes
while remaining semantically valid.  Making such capabilities efficient
and robust in all cases is facilitated by application control over when
doc-id's and segment structure change at a granularity larger than the
single Document.

If I had a vote it would be +1 on the direction Michael has proposed,
assuming it can be done robustly and without performance penalty.

Chuck

robert engels wrote on 01/15/2007 07:34 PM:
> I honestly think that having a unique OID as an indexed field and
> putting a layer on top of Lucene is the best solution to all of this.
> It makes it almost trivial, and you can implement transaction handling
> in a variety of ways.
>
> Attempting to make the doc ids "permanent" is a tough challenge,
> considering the orignal design called for them to be "non permanent".
>
> It seems doubtful that you cannot have some sort of primary key any
> way and be this concerned about the transactional nature of Lucene.
>
> I vote -1 on all of this. I think it will detract from the simple and
> efficient storage mechanism that Lucene uses.
>
> On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
>
>> Ning Li wrote on 01/15/2007 06:29 PM:
>>> On 1/14/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>>>   * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
>>>>     could have a more efficient implementation (just like Solr) when
>>>>     autoCommit is false, because deletes don't need to be flushed
>>>>     until commit() is called.  Whereas, now, they must be aggressively
>>>>     flushed on each checkpoint.
>>>
>>> If a reader can only open snapshots both for search and for
>>> modification, I think another change is needed besides the ones
>>> listed: assume the latest snapshot is segments_5 and the latest
>>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>>> snapshot segments_5, performs a few deletes and writes a new
>>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>>> Such segments to include are easily identifiable only if they are not
>>> merged with segments in the latest snapshot... All these won't be
>>> necessary if a reader always opens the latest checkpoint for
>>> modification, which will also support deletion of non-committed
>>> documents.
>> This problem seems worse.  I don't see how a reader and a writer can
>> independently compute and write checkpoints.  The adds in the writer
>> don't just create new segments, they replace existing ones through
>> merging.  And the merging changes doc-ids by expunging deletes.  It
>> seems that all deletes must be based on the most recent checkpoint, or
>> merging of checkpoints to create the next snapshot will be considerably
>> more complex.
>>
>> Chuck
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

I honestly think that having a unique OID as an indexed field and  
putting a layer on top of Lucene is the best solution to all of this.
It makes it almost trivial, and you can implement transaction  
handling in a variety of ways.

Attempting to make the doc ids "permanent" is a tough challenge,  
considering the orignal design called for them to be "non permanent".

It seems doubtful that you cannot have some sort of primary key any  
way and be this concerned about the transactional nature of Lucene.

I vote -1 on all of this. I think it will detract from the simple and  
efficient storage mechanism that Lucene uses.

On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:

> Ning Li wrote on 01/15/2007 06:29 PM:
>> On 1/14/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>>   * The "support deleteDocuments in IndexWriter" (LUCENE-565)  
>>> feature
>>>     could have a more efficient implementation (just like Solr) when
>>>     autoCommit is false, because deletes don't need to be flushed
>>>     until commit() is called.  Whereas, now, they must be  
>>> aggressively
>>>     flushed on each checkpoint.
>>
>> If a reader can only open snapshots both for search and for
>> modification, I think another change is needed besides the ones
>> listed: assume the latest snapshot is segments_5 and the latest
>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>> snapshot segments_5, performs a few deletes and writes a new
>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>> Such segments to include are easily identifiable only if they are not
>> merged with segments in the latest snapshot... All these won't be
>> necessary if a reader always opens the latest checkpoint for
>> modification, which will also support deletion of non-committed
>> documents.
> This problem seems worse.  I don't see how a reader and a writer can
> independently compute and write checkpoints.  The adds in the writer
> don't just create new segments, they replace existing ones through
> merging.  And the merging changes doc-ids by expunging deletes.  It
> seems that all deletes must be based on the most recent checkpoint, or
> merging of checkpoints to create the next snapshot will be  
> considerably
> more complex.
>
> Chuck
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Chuck Williams <ch...@manawiz.com>.

Ning Li wrote on 01/15/2007 06:29 PM:
> On 1/14/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>>   * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
>>     could have a more efficient implementation (just like Solr) when
>>     autoCommit is false, because deletes don't need to be flushed
>>     until commit() is called.  Whereas, now, they must be aggressively
>>     flushed on each checkpoint.
>
> If a reader can only open snapshots both for search and for
> modification, I think another change is needed besides the ones
> listed: assume the latest snapshot is segments_5 and the latest
> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
> snapshot segments_5, performs a few deletes and writes a new
> checkpoint segmentsx_8. The summary file segmentsx_8 should include
> the 2 new segments which are in segmentsx_7 but not in segments_5.
> Such segments to include are easily identifiable only if they are not
> merged with segments in the latest snapshot... All these won't be
> necessary if a reader always opens the latest checkpoint for
> modification, which will also support deletion of non-committed
> documents.
This problem seems worse.  I don't see how a reader and a writer can
independently compute and write checkpoints.  The adds in the writer
don't just create new segments, they replace existing ones through
merging.  And the merging changes doc-ids by expunging deletes.  It
seems that all deletes must be based on the most recent checkpoint, or
merging of checkpoints to create the next snapshot will be considerably
more complex.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Ning Li <ni...@gmail.com>.

On 1/14/07, Michael McCandless <lu...@mikemccandless.com> wrote:
>   * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
>     could have a more efficient implementation (just like Solr) when
>     autoCommit is false, because deletes don't need to be flushed
>     until commit() is called.  Whereas, now, they must be aggressively
>     flushed on each checkpoint.

The idea of adding "explicit commits" is good! And in time - I was
just about to submit a latest patch for LUCENE-565. With this feature,
the frequency of reader open/close on large old segments could be
reduced when autoCommit is false.

Based on your proposal, however, an application wouldn't be able to
delete any documents that have not been committed since a reader
always opens a snapshot (segments_N), but not a checkpoint
(segmentsx_N). This functionality will be supported by LUCENE-565, but
I wonder if it should be supported in general. So maybe a reader can
open the latest checkpoint for modification, but only snapshots for
search...

If a reader can only open snapshots both for search and for
modification, I think another change is needed besides the ones
listed: assume the latest snapshot is segments_5 and the latest
checkpoint is segmentsx_7 with 2 new segments, then a reader opens
snapshot segments_5, performs a few deletes and writes a new
checkpoint segmentsx_8. The summary file segmentsx_8 should include
the 2 new segments which are in segmentsx_7 but not in segments_5.
Such segments to include are easily identifiable only if they are not
merged with segments in the latest snapshot... All these won't be
necessary if a reader always opens the latest checkpoint for
modification, which will also support deletion of non-committed
documents.

Lastly, hopefully the term "transaction" won't cause any confusion
since this "explicit commit" is much simpler than database transaction
where a database can guarantee the ACID properties for each of
multiple concurrent transactions.

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

Actually, my comment below was not quite accurate. It only matter on  
multiple CPU machines if you are writing everything to a memory index  
first.

If writing to a filesystem, then multiple threads on a single  
processor would allow more documents to be inverted while the disk  
write were occurring, as long as both COULD be done concurrently.


On Jan 15, 2007, at 12:28 PM, robert engels wrote:

> I looked at doing a similar thing with the parallel 'inverting'.
>
> I then decided that it will only make a difference on a multiple  
> CPU machine, so I put it on the back burner.
>
> But if you have code already done...
>
> On Jan 15, 2007, at 12:24 PM, Chuck Williams wrote:
>
>> robert engels wrote on 01/15/2007 08:01 AM:
>>> Is your parallel adding code available?
>>>
>> There is an early version in LUCENE-600, but without the enhancements
>> described.  I didn't update that version because it didn't capture  
>> any
>> interest and requires Java 1.5 and so it seems will not be committed.
>>
>> I could update jira with the new version, but would have to create a
>> clean patch that applies again the lucene head.  My local copy is
>> diverged due to a number of uncommitted patches and so patches  
>> generated
>> from it contain other stuff.
>>
>> My use case for parallel subindexes is as an enabler for fast bulk
>> updates.  Only the subindexes containing changing fields need to be
>> updated, so long as the update algorithm does not change doc-ids.   
>> Even
>> though this requires rewriting entire segments using techniques  
>> similar
>> to those used in merging (but not purging deleted docs), I'm still
>> getting 30x (when many fields changed) to many hundreds-x (when  
>> only a
>> few fields changing) faster update performance than the batched
>> delete-add method on very large indexes (million of documents,  
>> some very
>> large).
>>
>> Chuck
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

I looked at doing a similar thing with the parallel 'inverting'.

I then decided that it will only make a difference on a multiple CPU  
machine, so I put it on the back burner.

But if you have code already done...

On Jan 15, 2007, at 12:24 PM, Chuck Williams wrote:

> robert engels wrote on 01/15/2007 08:01 AM:
>> Is your parallel adding code available?
>>
> There is an early version in LUCENE-600, but without the enhancements
> described.  I didn't update that version because it didn't capture any
> interest and requires Java 1.5 and so it seems will not be committed.
>
> I could update jira with the new version, but would have to create a
> clean patch that applies again the lucene head.  My local copy is
> diverged due to a number of uncommitted patches and so patches  
> generated
> from it contain other stuff.
>
> My use case for parallel subindexes is as an enabler for fast bulk
> updates.  Only the subindexes containing changing fields need to be
> updated, so long as the update algorithm does not change doc-ids.   
> Even
> though this requires rewriting entire segments using techniques  
> similar
> to those used in merging (but not purging deleted docs), I'm still
> getting 30x (when many fields changed) to many hundreds-x (when only a
> few fields changing) faster update performance than the batched
> delete-add method on very large indexes (million of documents, some  
> very
> large).
>
> Chuck
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Chuck Williams <ch...@manawiz.com>.

robert engels wrote on 01/15/2007 08:01 AM:
> Is your parallel adding code available?
>
There is an early version in LUCENE-600, but without the enhancements
described.  I didn't update that version because it didn't capture any
interest and requires Java 1.5 and so it seems will not be committed.

I could update jira with the new version, but would have to create a
clean patch that applies again the lucene head.  My local copy is
diverged due to a number of uncommitted patches and so patches generated
from it contain other stuff.

My use case for parallel subindexes is as an enabler for fast bulk
updates.  Only the subindexes containing changing fields need to be
updated, so long as the update algorithm does not change doc-ids.  Even
though this requires rewriting entire segments using techniques similar
to those used in merging (but not purging deleted docs), I'm still
getting 30x (when many fields changed) to many hundreds-x (when only a
few fields changing) faster update performance than the batched
delete-add method on very large indexes (million of documents, some very
large).

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

Is your parallel adding code available?

On Jan 15, 2007, at 11:54 AM, Chuck Williams wrote:

>
> Michael McCandless wrote on 01/15/2007 01:49 AM:
>> Chuck,
>>
>>> Possibly related, one of the ways I improved concurrency in
>>> ParallelWriter was to break up IndexWriter.addDocument() into one  
>>> method
>>> to invert the document and create a RAMSegment and a second  
>>> method that
>>> takes the RAMSegment and merges it into the index.  This allows
>>> inversions to be processed in parallel, while merging is already a
>>> critical section.  (Side thought:  I've been wondering how hard  
>>> it would
>>> be to make merging not a critical section).  I had thought of the  
>>> method
>>> to take the RAMSegment and merge it to be the "commit" part of
>>> addDocument().
>>
>>> Your notion of commit is much better and more flexible, but  
>>> perhaps you
>>> could include this inversion/merge separation as well?
>>
>> I'm a little confused on what this would mean?  Do you mean  
>> opening up
>> separate public methods: one to invert (and get a segment back) and
>> one to append (and possibly merge) a segment to the index (keeping  
>> the
>> existing addDocument that would then just call these two)?  How would
>> this buy you more concurrency (since the current method indeed only
>> synchronizes around the merge part)?  Oh: would you behind the scenes
>> take each "single doc" segment and pre-merge them privatelyx,
>> concurrently, possibly up to many levels, privately, and then finally
>> add the merged segment into the index?  Ie, the beginnings of
>> "concurrent merge" described above?
>>
>> Actually couldn't we do this change today (ie without waiting for
>> explicit commits)?  It seems like a separable change.
>
> Yes, I've already made this change so it is independent, creating
> invertDocument(), addInvertedDocument() and abortInvertedDocument().
> This enables more concurrency in ParallelWriter because there are no
> synchronization restrictions at all on calling invertDocument().
> addInvertedDocument() has a synchronization requirement:  it can be
> called in parallel for each subdocument corresponding to the same
> document, but not for subdocuments corresponding to different  
> documents
> as this could break the required parallel subindex doc-id
> correspondence.  Because addDocument() (which is just
> addInvertedDocument(invertDocument())) contains the call to
> addInvertedDocument() it has the same synchronization requirement,
> preventing the extra parallelism in the invertDocument() calls.
>
> It seemed to me that this could be related to the your explicit- 
> commits
> idea since it also breaks up writes into an uncommitted local portion
> and committed portion.
>
> Hope you put your explicit commits idea together soon!
>
> Chuck
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Chuck Williams wrote:
> Michael McCandless wrote on 01/15/2007 01:49 AM:
>> Chuck,
>>
>>> Possibly related, one of the ways I improved concurrency in
>>> ParallelWriter was to break up IndexWriter.addDocument() into one method
>>> to invert the document and create a RAMSegment and a second method that
>>> takes the RAMSegment and merges it into the index.  This allows
>>> inversions to be processed in parallel, while merging is already a
>>> critical section.  (Side thought:  I've been wondering how hard it would
>>> be to make merging not a critical section).  I had thought of the method
>>> to take the RAMSegment and merge it to be the "commit" part of
>>> addDocument().
 >>>
>>> Your notion of commit is much better and more flexible, but perhaps you
>>> could include this inversion/merge separation as well?
 >>
>> I'm a little confused on what this would mean?  Do you mean opening up
>> separate public methods: one to invert (and get a segment back) and
>> one to append (and possibly merge) a segment to the index (keeping the
>> existing addDocument that would then just call these two)?  How would
>> this buy you more concurrency (since the current method indeed only
>> synchronizes around the merge part)?  Oh: would you behind the scenes
>> take each "single doc" segment and pre-merge them privatelyx,
>> concurrently, possibly up to many levels, privately, and then finally
>> add the merged segment into the index?  Ie, the beginnings of
>> "concurrent merge" described above?
>>
>> Actually couldn't we do this change today (ie without waiting for
>> explicit commits)?  It seems like a separable change.
> 
> Yes, I've already made this change so it is independent, creating
> invertDocument(), addInvertedDocument() and abortInvertedDocument(). 
> This enables more concurrency in ParallelWriter because there are no
> synchronization restrictions at all on calling invertDocument(). 
> addInvertedDocument() has a synchronization requirement:  it can be
> called in parallel for each subdocument corresponding to the same
> document, but not for subdocuments corresponding to different documents
> as this could break the required parallel subindex doc-id
> correspondence.  Because addDocument() (which is just
> addInvertedDocument(invertDocument())) contains the call to
> addInvertedDocument() it has the same synchronization requirement,
> preventing the extra parallelism in the invertDocument() calls.
> 
> It seemed to me that this could be related to the your explicit-commits
> idea since it also breaks up writes into an uncommitted local portion
> and committed portion.

Ahh I think I see: you needed to tease out that fine detail on what
synchronization is actually required (the fact that sub-documents can
be done entirely in parallel, but cross-documents cannot).  And the
sub-documents indeed give you excellent concurrency (if you make lots
of sub-documents) on boxes that have the CPU resources to allocate.
This is a neat change, but I think separate from from explicit commits
so I think we should keep them decoupled at this point.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Chuck Williams <ch...@manawiz.com>.

Michael McCandless wrote on 01/15/2007 01:49 AM:
> Chuck,
>
>> Possibly related, one of the ways I improved concurrency in
>> ParallelWriter was to break up IndexWriter.addDocument() into one method
>> to invert the document and create a RAMSegment and a second method that
>> takes the RAMSegment and merges it into the index.  This allows
>> inversions to be processed in parallel, while merging is already a
>> critical section.  (Side thought:  I've been wondering how hard it would
>> be to make merging not a critical section).  I had thought of the method
>> to take the RAMSegment and merge it to be the "commit" part of
>> addDocument().
>
>> Your notion of commit is much better and more flexible, but perhaps you
>> could include this inversion/merge separation as well?
>
> I'm a little confused on what this would mean?  Do you mean opening up
> separate public methods: one to invert (and get a segment back) and
> one to append (and possibly merge) a segment to the index (keeping the
> existing addDocument that would then just call these two)?  How would
> this buy you more concurrency (since the current method indeed only
> synchronizes around the merge part)?  Oh: would you behind the scenes
> take each "single doc" segment and pre-merge them privatelyx,
> concurrently, possibly up to many levels, privately, and then finally
> add the merged segment into the index?  Ie, the beginnings of
> "concurrent merge" described above?
>
> Actually couldn't we do this change today (ie without waiting for
> explicit commits)?  It seems like a separable change.

Yes, I've already made this change so it is independent, creating
invertDocument(), addInvertedDocument() and abortInvertedDocument(). 
This enables more concurrency in ParallelWriter because there are no
synchronization restrictions at all on calling invertDocument(). 
addInvertedDocument() has a synchronization requirement:  it can be
called in parallel for each subdocument corresponding to the same
document, but not for subdocuments corresponding to different documents
as this could break the required parallel subindex doc-id
correspondence.  Because addDocument() (which is just
addInvertedDocument(invertDocument())) contains the call to
addInvertedDocument() it has the same synchronization requirement,
preventing the extra parallelism in the invertDocument() calls.

It seemed to me that this could be related to the your explicit-commits
idea since it also breaks up writes into an uncommitted local portion
and committed portion.

Hope you put your explicit commits idea together soon!

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Chuck,

> This seems to me to be a great idea, especially the ability to support
> index transactions.
> 
> ParallelWriter (original implementation in LUCENE-600 -- I have a much
> better one now) provides a companion writer to ParallelReader.  It takes
> a Document, breaks it up into subdocuments associated with parallel
> indexes that partition the fields, and writes those subdocuments into
> their respective parallel indexes.  ParallelReader requires that the
> parallel indexes remain doc-id synchronized, which severely limits the
> opportunity for concurrent writing due to the possibility of the reader
> reopening when the indexes are out of sync (more Documents in one than
> another) and due to errors writing some subdocument(s) of a set when the
> others succeed.
> 
> The new version of ParallelWriter, not in jira yet, provides more
> concurrency and provides better error recovery than the version there
> now, but it still limited in possible concurrency and in the worst case
> (when other recovery options fail) may have to fully optimize the
> indexes to back out the case were only a subset of the subdocuments
> derived from a given document fail to write.  The root cause for the
> horrible error recovery case is the uncontrollable and unrevertable
> merging that may arise from adding a single document.
> 
> I believe what you propose would provide the foundation to fully solve
> these problems efficiently, yielding much more concurrency and
> guaranteeing efficient error recovery in ParallelWriter.  Also it would
> simplify some other cases where transactional integrity is essential in
> my current app.  So this really sounds great.

Neat!!  This sounds like a perfect fit: with explicit commits in the
index you should be able to greatly simplify ParallelWriter because
you're safe knowing readers would never open an "update in progress"
(ie a checkpoint segmentsx_N), and if you hit any error, you can
easily re-open your ParallelWriter against the last committed snapshot
(segments_N).  Ie your error recovery becomes trivial and correct.

I had not thought of this use case.  I think there are lots of
important use cases lurking out there that are enabled once we
have explicit commits.

> Possibly related, one of the ways I improved concurrency in
> ParallelWriter was to break up IndexWriter.addDocument() into one method
> to invert the document and create a RAMSegment and a second method that
> takes the RAMSegment and merges it into the index.  This allows
> inversions to be processed in parallel, while merging is already a
> critical section.  (Side thought:  I've been wondering how hard it would
> be to make merging not a critical section).  I had thought of the method
> to take the RAMSegment and merge it to be the "commit" part of
> addDocument().

Re side thought:

I think this may be another use case enabled by explicit commits: you
could imagine separate threads building up / merging their own private
set of segments and then merely adding them into the primary index.
What explicit commits can buy you is the fact that all these "private
segments" need not be made searchable until a commit() is called.  So
in-between commits there should be alot of room for concurrency in
merging segments.

> Your notion of commit is much better and more flexible, but perhaps you
> could include this inversion/merge separation as well?

I'm a little confused on what this would mean?  Do you mean opening up
separate public methods: one to invert (and get a segment back) and
one to append (and possibly merge) a segment to the index (keeping the
existing addDocument that would then just call these two)?  How would
this buy you more concurrency (since the current method indeed only
synchronizes around the merge part)?  Oh: would you behind the scenes
take each "single doc" segment and pre-merge them privatelyx,
concurrently, possibly up to many levels, privately, and then finally
add the merged segment into the index?  Ie, the beginnings of
"concurrent merge" described above?

Actually couldn't we do this change today (ie without waiting for
explicit commits)?  It seems like a separable change.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Ning Li <ni...@gmail.com>.

On 1/16/07, Yonik Seeley <yo...@apache.org> wrote:
> On 1/15/07, Chuck Williams <ch...@manawiz.com> wrote:
> > (Side thought:  I've been wondering how hard it would
> > be to make merging not a critical section).
>
> It would be very nice if segment merging didn't block the addition of
> new documents... it really doesn't need to.  I don't think it would be
> too hard to fix, but I haven't had the time to tackle it.

I've had one working for a while now. It's based on LUCENE-565.
Segment merging does not block addition or deletion of documents.

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Yonik Seeley <yo...@apache.org>.

On 1/15/07, Chuck Williams <ch...@manawiz.com> wrote:
> (Side thought:  I've been wondering how hard it would
> be to make merging not a critical section).

It would be very nice if segment merging didn't block the addition of
new documents... it really doesn't need to.  I don't think it would be
too hard to fix, but I haven't had the time to tackle it.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Chuck Williams <ch...@manawiz.com>.

Micahel,

This seems to me to be a great idea, especially the ability to support
index transactions.

ParallelWriter (original implementation in LUCENE-600 -- I have a much
better one now) provides a companion writer to ParallelReader.  It takes
a Document, breaks it up into subdocuments associated with parallel
indexes that partition the fields, and writes those subdocuments into
their respective parallel indexes.  ParallelReader requires that the
parallel indexes remain doc-id synchronized, which severely limits the
opportunity for concurrent writing due to the possibility of the reader
reopening when the indexes are out of sync (more Documents in one than
another) and due to errors writing some subdocument(s) of a set when the
others succeed.

The new version of ParallelWriter, not in jira yet, provides more
concurrency and provides better error recovery than the version there
now, but it still limited in possible concurrency and in the worst case
(when other recovery options fail) may have to fully optimize the
indexes to back out the case were only a subset of the subdocuments
derived from a given document fail to write.  The root cause for the
horrible error recovery case is the uncontrollable and unrevertable
merging that may arise from adding a single document.

I believe what you propose would provide the foundation to fully solve
these problems efficiently, yielding much more concurrency and
guaranteeing efficient error recovery in ParallelWriter.  Also it would
simplify some other cases where transactional integrity is essential in
my current app.  So this really sounds great.

Possibly related, one of the ways I improved concurrency in
ParallelWriter was to break up IndexWriter.addDocument() into one method
to invert the document and create a RAMSegment and a second method that
takes the RAMSegment and merges it into the index.  This allows
inversions to be processed in parallel, while merging is already a
critical section.  (Side thought:  I've been wondering how hard it would
be to make merging not a critical section).  I had thought of the method
to take the RAMSegment and merge it to be the "commit" part of
addDocument().

Your notion of commit is much better and more flexible, but perhaps you
could include this inversion/merge separation as well?

Chuck


Michael McCandless wrote on 01/14/2007 11:36 AM:
> Team,
>
> I've been struggling to find a clean solution for LUCENE-710, when I
> thought of a simple addition to Lucene ("explicit commits") that would
> I think resolve LUCENE-710 and would fix a few other outstanding
> issues when readers are using a "live" index (being updated by a
> writer).
>
> The basic idea is to add an explicit "commit" operation to Lucene.
>
> This is the same nice feature Solr has, but just a different
> implementation (in Lucene core, in a single index, instead).  The
> commit makes a "point in time" snapshot (term borrowed from Solr!)
> available for searching.
>
> The implementation is surprisingly simple (see below) and completely
> backwards compatible.
>
> I'd like to get some feedback on the idea/implementation.
>
>
> Details...: right now, Lucene writes a new segments_N file at various
> times: when a writer (or reader that's writing deletes/norms) needs to
> flush its pending changes to disk; when a writer merges segments; when
> a writer is closed; multiple times during optimize/addIndexes; etc.
>
> These times are not controllable / predictable to the developer using
> Lucene.
>
> A new reader always opens the last segments_N written, and, when a
> reader uses isCurrent() to check whether it should re-open (the
> suggested way), that method always returns false (meaning you should
> re-open) if there are any new segments_N files.
>
> So it's somewhat uncontrollable to the developer what state the index
> is in when you [re-]open a reader.
>
> People work around this today by adding logic above Lucene so that the
> writer separately communicates to readers when is a good time to
> refresh.  But with "explicit commits", readers could instead look
> directly at the index and pick the right segments_N to refresh to.
>
> I'm proposing that we separate the writing of a new segments_N file
> into those writes that are done automatically by Lucene (I'll call
> these "checkpoints") from meaningful (to the application) commits that
> are done explicitly by the developer at known times (I'll call this
> "committing a snapshot").  I would add a new boolean mode to
> IndexWriter called "autoCommit", and a new public method "commit()" to
> IndexWriter and IndexReader (we'd have to rename the current protected
> commit() in IndexReader)
>
> When autoCommit is true, this means every write of a segments_N file
> will be "commit a snapshot", meaning readers will then use it for
> searching.  This will be the default and this is exactly how Lucene
> behaves today.  So this change is completely backwards compatible.
>
> When autoCommit is false, then when Lucene chooses to save a
> segments_N file it's just a "checkpoint": a reader would not open or
> re-open to the checkpoint.  This means the developer must then call
> IndexWriter.commit() or IndexReader.commit() in order to "commit a
> snapshot" at the right time, thereby telling readers that this
> segments_N file is a valid one to switch to for searching.
>
>
> The implementation is very simple (I have an initial coarse prototype
> working with all but the last bullet):
>
>   * If a segments_N file is just a checkpoint, it's named
>     "segmentsx_N" (note the added 'x'); if it's a snapshot, it's named
>     "segments_N".  No other changes to the index format.
>
>   * A reader by default opens the latest snapshot but can optionally
>     open a specific N (segments_N) snapshot.
>
>   * A writer by default starts from the most recent "checkpoint" but
>     may also take a specific checkpoint or snapshot point N
>     (segments_N) to start from (to allow rollback).
>
>   * Change IndexReader.isCurrent() to see if there are any newer
>     snapshots but disregard newer checkpoints.
>
>   * When a writer is in autoCommit=false mode, it always writes to the
>     next segmentsx_N; else it writes to segments_N.
>
>   * The commit() method would just write to the next segments_N file
>     and return the N it had written (in case application needs to
>     re-use it later).
>
>   * IndexFileDeleter would need to have a slightly smarter policy when
>     autoCommit=false, ie, "don't delete anything referenced by either
>     the past N snapshots or if the snapshot was obsoleted less than X
>     minutes ago".
>
>
> I think there are some compelling things this could solve:
>
>   * The "delete then add" problem (really a special but very common
>     case of general transactions):
>
>     Right now when you want to update a bunch of documents in a Lucene
>     index, it's best to open a reader, do a "batch delete", close the
>     reader, open a writer, do a "batch add", close the writer.  This
>     is the suggested way.
>
>     The open risk here is that a reader could refresh at any time
>     during these operations, and find that a bunch of documents have
>     been deleted but not yet added again.
>
>     Whereas, with autoCommit false you could do this entire operation
>     (batch delete then batch add), and then call the final commit() in
>     the end, and readers would know not to re-open the index until
>     that final commit() succeeded.
>
>   * The "using too much disk space during optimize" problem:
>
>     This came up on the user's list recently: if you aggressively
>     refresh readers while optimize() is running, you can tie up much
>     more disk space than you'd expect, because your readers are
>     holding open all the [possibly very large] intermediate segments.
>
>     Whereas, if autoCommit is false, then developer calls optimize()
>     and then calls commit(), the readers would know not to re-open
>     until optimize was complete.
>
>   * More general transactions:
>
>     It has come up a fair number of times how to make Lucene
>     transactional, either by itself ("do the following complex series
>     of index operations but if there is any failure, rollback to the
>     start, and don't expose result to searcher until all operations
>     are done") or as part of a larger transaction eg involving a
>     relational database.
>
>     EG, if you want to add a big set of documents to Lucene, but not
>     make them searchable until they are all added, or until a specific
>     time (eg Monday @ 9 AM), you can't do that easily today but it
>     would be simple with explicit commits.
>
>     I believe this change would make transactions work correctly with
>     Lucene.
>
>   * LUCENE-710 ("implement point in time searching without relying on
>     filesystem semantics"), also known as "getting Lucene to work
>     correctly over NFS".
>
>     I think this issue is nearly solved when autoCommit=false, as long
>     as we can adopt a shared policy on "when readers refresh" to match
>     the new deletion policy (described above).  Basically, as long as
>     the deleter and readers are playing by the same "refresh rules"
>     and the writer gives the readers enough time to switch/warm, then
>     the deleter should never delete something in use by a reader.
>
>
>
> There are also some neat future things made possible:
>
>   * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
>     could have a more efficient implementation (just like Solr) when
>     autoCommit is false, because deletes don't need to be flushed
>     until commit() is called.  Whereas, now, they must be aggressively
>     flushed on each checkpoint.
>
>   * More generally, because "checkpoints" do not need to be usable by
>     a reader/searcher, other neat optimizations might be possible.
>
>     EG maybe the merge policy could be improved if it knows that
>     certain segments are "just checkpoints" and are not involved in
>     searching.
>
>   * I could simplify the approach for my recent addIndexes changes
>     (LUCENE-702) to use this, instead of it's current approach (wish I
>     had thought of this sooner: ugh!.).
>
>   * A single index could hold many snapshots, and, we could enable a
>     reader to explicitly open against an older snapshot.  EG maybe you
>     take weekly and a monthly snapshot because you sometimes want to
>     go back and "run a search on last week's catalog".
>
> Feedback?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by robert engels <re...@ix.netcom.com>.

I think that you will find a much larger performance decrease in  
doing things this way - if the external resource is a db, or any  
networked accessed resource.

When even just a single document is changed in the Lucene index you  
could have MILLIONS of changes to internal doc ids (if say an early  
document was deleted).

Seems far better to store the external id in the Lucene index.

You will find that performance penalty of looking up the Lucene  
document by the external id (vs. the internal doc #), to be far less  
than the performance penalty of updating every document in the  
external index when the Lucene index is merged.

The only case I can see this would be of any benefit is if the Lucene  
index RARELY if EVER changes - anything else, and you will have big  
problems.

Now, if the Lucene is changed to support point in time searching  
(basically never delete any index files), you might be able to do  
what you this. Just create a Directory only creating the segments up  
to that time.

Sounds VERY messy to me.

On Jan 15, 2007, at 3:12 PM, Doron Cohen wrote:

> Also related is the request made several times in the list to be  
> able to
> control when docids are changing, for applications that need to  
> maintain
> some mapping between external IDs to Lucene docs but for some  
> performance
> reasons cannot afford to only count on storing external (DB) IDs in
> Lucene's field. For instance, recent discussion "Making document  
> numbers
> persistent" in java-user.
>
> So, an application controlled commit would allow an application not to
> "experience" document numbering changes - no docid changes would  
> affect the
> application until a commit is issued. So the application would be  
> able to
> call optimize and then issue a commit, thereby exposing docid changes.
>
> One disadvantage of controlling ids changes like this is that  
> search would
> have to stale long behind index updates, unless optimize is called.
>
> Therefore, - that's another issue of course - I am wondering if  
> there might
> be interest in allowing applications to control whether deleted  
> docs are
> allowed to be removed/squeezed-out or not.
>
> Michael McCandless <lu...@mikemccandless.com> wrote on 14/01/2007
> 13:36:34:
>
>> Team,
>>
>> I've been struggling to find a clean solution for LUCENE-710, when I
>> thought of a simple addition to Lucene ("explicit commits") that  
>> would
>> I think resolve LUCENE-710 and would fix a few other outstanding
>> issues when readers are using a "live" index (being updated by a
>> writer).
>>
>> The basic idea is to add an explicit "commit" operation to Lucene.
>>
>> This is the same nice feature Solr has, but just a different
>> implementation (in Lucene core, in a single index, instead).  The
>> commit makes a "point in time" snapshot (term borrowed from Solr!)
>> available for searching.
>>
>> The implementation is surprisingly simple (see below) and completely
>> backwards compatible.
>>
>> I'd like to get some feedback on the idea/implementation.
>>
>>
>> Details...: right now, Lucene writes a new segments_N file at various
>> times: when a writer (or reader that's writing deletes/norms)  
>> needs to
>> flush its pending changes to disk; when a writer merges segments;  
>> when
>> a writer is closed; multiple times during optimize/addIndexes; etc.
>>
>> These times are not controllable / predictable to the developer using
>> Lucene.
>>
>> A new reader always opens the last segments_N written, and, when a
>> reader uses isCurrent() to check whether it should re-open (the
>> suggested way), that method always returns false (meaning you should
>> re-open) if there are any new segments_N files.
>>
>> So it's somewhat uncontrollable to the developer what state the index
>> is in when you [re-]open a reader.
>>
>> People work around this today by adding logic above Lucene so that  
>> the
>> writer separately communicates to readers when is a good time to
>> refresh.  But with "explicit commits", readers could instead look
>> directly at the index and pick the right segments_N to refresh to.
>>
>> I'm proposing that we separate the writing of a new segments_N file
>> into those writes that are done automatically by Lucene (I'll call
>> these "checkpoints") from meaningful (to the application) commits  
>> that
>> are done explicitly by the developer at known times (I'll call this
>> "committing a snapshot").  I would add a new boolean mode to
>> IndexWriter called "autoCommit", and a new public method "commit 
>> ()" to
>> IndexWriter and IndexReader (we'd have to rename the current  
>> protected
>> commit() in IndexReader)
>>
>> When autoCommit is true, this means every write of a segments_N file
>> will be "commit a snapshot", meaning readers will then use it for
>> searching.  This will be the default and this is exactly how Lucene
>> behaves today.  So this change is completely backwards compatible.
>>
>> When autoCommit is false, then when Lucene chooses to save a
>> segments_N file it's just a "checkpoint": a reader would not open or
>> re-open to the checkpoint.  This means the developer must then call
>> IndexWriter.commit() or IndexReader.commit() in order to "commit a
>> snapshot" at the right time, thereby telling readers that this
>> segments_N file is a valid one to switch to for searching.
>>
>>
>> The implementation is very simple (I have an initial coarse prototype
>> working with all but the last bullet):
>>
>>    * If a segments_N file is just a checkpoint, it's named
>>      "segmentsx_N" (note the added 'x'); if it's a snapshot, it's  
>> named
>>      "segments_N".  No other changes to the index format.
>>
>>    * A reader by default opens the latest snapshot but can optionally
>>      open a specific N (segments_N) snapshot.
>>
>>    * A writer by default starts from the most recent "checkpoint" but
>>      may also take a specific checkpoint or snapshot point N
>>      (segments_N) to start from (to allow rollback).
>>
>>    * Change IndexReader.isCurrent() to see if there are any newer
>>      snapshots but disregard newer checkpoints.
>>
>>    * When a writer is in autoCommit=false mode, it always writes  
>> to the
>>      next segmentsx_N; else it writes to segments_N.
>>
>>    * The commit() method would just write to the next segments_N file
>>      and return the N it had written (in case application needs to
>>      re-use it later).
>>
>>    * IndexFileDeleter would need to have a slightly smarter policy  
>> when
>>      autoCommit=false, ie, "don't delete anything referenced by  
>> either
>>      the past N snapshots or if the snapshot was obsoleted less  
>> than X
>>      minutes ago".
>>
>>
>> I think there are some compelling things this could solve:
>>
>>    * The "delete then add" problem (really a special but very common
>>      case of general transactions):
>>
>>      Right now when you want to update a bunch of documents in a  
>> Lucene
>>      index, it's best to open a reader, do a "batch delete", close  
>> the
>>      reader, open a writer, do a "batch add", close the writer.  This
>>      is the suggested way.
>>
>>      The open risk here is that a reader could refresh at any time
>>      during these operations, and find that a bunch of documents have
>>      been deleted but not yet added again.
>>
>>      Whereas, with autoCommit false you could do this entire  
>> operation
>>      (batch delete then batch add), and then call the final commit 
>> () in
>>      the end, and readers would know not to re-open the index until
>>      that final commit() succeeded.
>>
>>    * The "using too much disk space during optimize" problem:
>>
>>      This came up on the user's list recently: if you aggressively
>>      refresh readers while optimize() is running, you can tie up much
>>      more disk space than you'd expect, because your readers are
>>      holding open all the [possibly very large] intermediate  
>> segments.
>>
>>      Whereas, if autoCommit is false, then developer calls optimize()
>>      and then calls commit(), the readers would know not to re-open
>>      until optimize was complete.
>>
>>    * More general transactions:
>>
>>      It has come up a fair number of times how to make Lucene
>>      transactional, either by itself ("do the following complex  
>> series
>>      of index operations but if there is any failure, rollback to the
>>      start, and don't expose result to searcher until all operations
>>      are done") or as part of a larger transaction eg involving a
>>      relational database.
>>
>>      EG, if you want to add a big set of documents to Lucene, but not
>>      make them searchable until they are all added, or until a  
>> specific
>>      time (eg Monday @ 9 AM), you can't do that easily today but it
>>      would be simple with explicit commits.
>>
>>      I believe this change would make transactions work correctly  
>> with
>>      Lucene.
>>
>>    * LUCENE-710 ("implement point in time searching without  
>> relying on
>>      filesystem semantics"), also known as "getting Lucene to work
>>      correctly over NFS".
>>
>>      I think this issue is nearly solved when autoCommit=false, as  
>> long
>>      as we can adopt a shared policy on "when readers refresh" to  
>> match
>>      the new deletion policy (described above).  Basically, as  
>> long as
>>      the deleter and readers are playing by the same "refresh rules"
>>      and the writer gives the readers enough time to switch/warm,  
>> then
>>      the deleter should never delete something in use by a reader.
>>
>>
>>
>> There are also some neat future things made possible:
>>
>>    * The "support deleteDocuments in IndexWriter" (LUCENE-565)  
>> feature
>>      could have a more efficient implementation (just like Solr) when
>>      autoCommit is false, because deletes don't need to be flushed
>>      until commit() is called.  Whereas, now, they must be  
>> aggressively
>>      flushed on each checkpoint.
>>
>>    * More generally, because "checkpoints" do not need to be  
>> usable by
>>      a reader/searcher, other neat optimizations might be possible.
>>
>>      EG maybe the merge policy could be improved if it knows that
>>      certain segments are "just checkpoints" and are not involved in
>>      searching.
>>
>>    * I could simplify the approach for my recent addIndexes changes
>>      (LUCENE-702) to use this, instead of it's current approach  
>> (wish I
>>      had thought of this sooner: ugh!.).
>>
>>    * A single index could hold many snapshots, and, we could enable a
>>      reader to explicitly open against an older snapshot.  EG  
>> maybe you
>>      take weekly and a monthly snapshot because you sometimes want to
>>      go back and "run a search on last week's catalog".
>>
>> Feedback?
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: adding "explicit commits" to Lucene?

Posted by Doron Cohen <DO...@il.ibm.com>.

Also related is the request made several times in the list to be able to
control when docids are changing, for applications that need to maintain
some mapping between external IDs to Lucene docs but for some performance
reasons cannot afford to only count on storing external (DB) IDs in
Lucene's field. For instance, recent discussion "Making document numbers
persistent" in java-user.

So, an application controlled commit would allow an application not to
"experience" document numbering changes - no docid changes would affect the
application until a commit is issued. So the application would be able to
call optimize and then issue a commit, thereby exposing docid changes.

One disadvantage of controlling ids changes like this is that search would
have to stale long behind index updates, unless optimize is called.

Therefore, - that's another issue of course - I am wondering if there might
be interest in allowing applications to control whether deleted docs are
allowed to be removed/squeezed-out or not.

Michael McCandless <lu...@mikemccandless.com> wrote on 14/01/2007
13:36:34:

> Team,
>
> I've been struggling to find a clean solution for LUCENE-710, when I
> thought of a simple addition to Lucene ("explicit commits") that would
> I think resolve LUCENE-710 and would fix a few other outstanding
> issues when readers are using a "live" index (being updated by a
> writer).
>
> The basic idea is to add an explicit "commit" operation to Lucene.
>
> This is the same nice feature Solr has, but just a different
> implementation (in Lucene core, in a single index, instead).  The
> commit makes a "point in time" snapshot (term borrowed from Solr!)
> available for searching.
>
> The implementation is surprisingly simple (see below) and completely
> backwards compatible.
>
> I'd like to get some feedback on the idea/implementation.
>
>
> Details...: right now, Lucene writes a new segments_N file at various
> times: when a writer (or reader that's writing deletes/norms) needs to
> flush its pending changes to disk; when a writer merges segments; when
> a writer is closed; multiple times during optimize/addIndexes; etc.
>
> These times are not controllable / predictable to the developer using
> Lucene.
>
> A new reader always opens the last segments_N written, and, when a
> reader uses isCurrent() to check whether it should re-open (the
> suggested way), that method always returns false (meaning you should
> re-open) if there are any new segments_N files.
>
> So it's somewhat uncontrollable to the developer what state the index
> is in when you [re-]open a reader.
>
> People work around this today by adding logic above Lucene so that the
> writer separately communicates to readers when is a good time to
> refresh.  But with "explicit commits", readers could instead look
> directly at the index and pick the right segments_N to refresh to.
>
> I'm proposing that we separate the writing of a new segments_N file
> into those writes that are done automatically by Lucene (I'll call
> these "checkpoints") from meaningful (to the application) commits that
> are done explicitly by the developer at known times (I'll call this
> "committing a snapshot").  I would add a new boolean mode to
> IndexWriter called "autoCommit", and a new public method "commit()" to
> IndexWriter and IndexReader (we'd have to rename the current protected
> commit() in IndexReader)
>
> When autoCommit is true, this means every write of a segments_N file
> will be "commit a snapshot", meaning readers will then use it for
> searching.  This will be the default and this is exactly how Lucene
> behaves today.  So this change is completely backwards compatible.
>
> When autoCommit is false, then when Lucene chooses to save a
> segments_N file it's just a "checkpoint": a reader would not open or
> re-open to the checkpoint.  This means the developer must then call
> IndexWriter.commit() or IndexReader.commit() in order to "commit a
> snapshot" at the right time, thereby telling readers that this
> segments_N file is a valid one to switch to for searching.
>
>
> The implementation is very simple (I have an initial coarse prototype
> working with all but the last bullet):
>
>    * If a segments_N file is just a checkpoint, it's named
>      "segmentsx_N" (note the added 'x'); if it's a snapshot, it's named
>      "segments_N".  No other changes to the index format.
>
>    * A reader by default opens the latest snapshot but can optionally
>      open a specific N (segments_N) snapshot.
>
>    * A writer by default starts from the most recent "checkpoint" but
>      may also take a specific checkpoint or snapshot point N
>      (segments_N) to start from (to allow rollback).
>
>    * Change IndexReader.isCurrent() to see if there are any newer
>      snapshots but disregard newer checkpoints.
>
>    * When a writer is in autoCommit=false mode, it always writes to the
>      next segmentsx_N; else it writes to segments_N.
>
>    * The commit() method would just write to the next segments_N file
>      and return the N it had written (in case application needs to
>      re-use it later).
>
>    * IndexFileDeleter would need to have a slightly smarter policy when
>      autoCommit=false, ie, "don't delete anything referenced by either
>      the past N snapshots or if the snapshot was obsoleted less than X
>      minutes ago".
>
>
> I think there are some compelling things this could solve:
>
>    * The "delete then add" problem (really a special but very common
>      case of general transactions):
>
>      Right now when you want to update a bunch of documents in a Lucene
>      index, it's best to open a reader, do a "batch delete", close the
>      reader, open a writer, do a "batch add", close the writer.  This
>      is the suggested way.
>
>      The open risk here is that a reader could refresh at any time
>      during these operations, and find that a bunch of documents have
>      been deleted but not yet added again.
>
>      Whereas, with autoCommit false you could do this entire operation
>      (batch delete then batch add), and then call the final commit() in
>      the end, and readers would know not to re-open the index until
>      that final commit() succeeded.
>
>    * The "using too much disk space during optimize" problem:
>
>      This came up on the user's list recently: if you aggressively
>      refresh readers while optimize() is running, you can tie up much
>      more disk space than you'd expect, because your readers are
>      holding open all the [possibly very large] intermediate segments.
>
>      Whereas, if autoCommit is false, then developer calls optimize()
>      and then calls commit(), the readers would know not to re-open
>      until optimize was complete.
>
>    * More general transactions:
>
>      It has come up a fair number of times how to make Lucene
>      transactional, either by itself ("do the following complex series
>      of index operations but if there is any failure, rollback to the
>      start, and don't expose result to searcher until all operations
>      are done") or as part of a larger transaction eg involving a
>      relational database.
>
>      EG, if you want to add a big set of documents to Lucene, but not
>      make them searchable until they are all added, or until a specific
>      time (eg Monday @ 9 AM), you can't do that easily today but it
>      would be simple with explicit commits.
>
>      I believe this change would make transactions work correctly with
>      Lucene.
>
>    * LUCENE-710 ("implement point in time searching without relying on
>      filesystem semantics"), also known as "getting Lucene to work
>      correctly over NFS".
>
>      I think this issue is nearly solved when autoCommit=false, as long
>      as we can adopt a shared policy on "when readers refresh" to match
>      the new deletion policy (described above).  Basically, as long as
>      the deleter and readers are playing by the same "refresh rules"
>      and the writer gives the readers enough time to switch/warm, then
>      the deleter should never delete something in use by a reader.
>
>
>
> There are also some neat future things made possible:
>
>    * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
>      could have a more efficient implementation (just like Solr) when
>      autoCommit is false, because deletes don't need to be flushed
>      until commit() is called.  Whereas, now, they must be aggressively
>      flushed on each checkpoint.
>
>    * More generally, because "checkpoints" do not need to be usable by
>      a reader/searcher, other neat optimizations might be possible.
>
>      EG maybe the merge policy could be improved if it knows that
>      certain segments are "just checkpoints" and are not involved in
>      searching.
>
>    * I could simplify the approach for my recent addIndexes changes
>      (LUCENE-702) to use this, instead of it's current approach (wish I
>      had thought of this sooner: ugh!.).
>
>    * A single index could hold many snapshots, and, we could enable a
>      reader to explicitly open against an older snapshot.  EG maybe you
>      take weekly and a monthly snapshot because you sometimes want to
>      go back and "run a search on last week's catalog".
>
> Feedback?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org