You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/01/29 15:37:43 UTC

[jira] Created: (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

apply delete-by-Term and docID immediately to newly flushed segments
--------------------------------------------------------------------

                 Key: LUCENE-2897
                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 3.2, 4.0


Spinoff from LUCENE-2324.

When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.

We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.

This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2897:
---------------------------------------

    Fix Version/s:     (was: 3.1)
                   3.2

I changed my mind!  Pushing this to 3.2 now.

> apply delete-by-Term and docID immediately to newly flushed segments
> --------------------------------------------------------------------
>
>                 Key: LUCENE-2897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2897.patch, LUCENE-2897.patch
>
>
> Spinoff from LUCENE-2324.
> When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.
> We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
> This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2897:
---------------------------------------

    Fix Version/s:     (was: 3.2)
                   3.1

I think we can do this for 3.1.

> apply delete-by-Term and docID immediately to newly flushed segments
> --------------------------------------------------------------------
>
>                 Key: LUCENE-2897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2897.patch, LUCENE-2897.patch
>
>
> Spinoff from LUCENE-2324.
> When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.
> We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
> This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2897:
---------------------------------------

    Attachment: LUCENE-2897.patch

Initial patch.

The approach is nice and simple ;)  All tests pass, but I put a bunch of nocommits in there, and I stopped short of the stuff I'll have to redo after committing ooo merges.

> apply delete-by-Term and docID immediately to newly flushed segments
> --------------------------------------------------------------------
>
>                 Key: LUCENE-2897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2897.patch
>
>
> Spinoff from LUCENE-2324.
> When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.
> We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
> This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-2897.
----------------------------------------

    Resolution: Fixed

> apply delete-by-Term and docID immediately to newly flushed segments
> --------------------------------------------------------------------
>
>                 Key: LUCENE-2897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2897.patch, LUCENE-2897.patch
>
>
> Spinoff from LUCENE-2324.
> When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.
> We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
> This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2897:
---------------------------------------

    Attachment: LUCENE-2897.patch

Patch -- I think it's ready to commit!

> apply delete-by-Term and docID immediately to newly flushed segments
> --------------------------------------------------------------------
>
>                 Key: LUCENE-2897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2897.patch, LUCENE-2897.patch
>
>
> Spinoff from LUCENE-2324.
> When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.
> We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
> This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988472#action_12988472 ] 

Michael McCandless commented on LUCENE-2897:
--------------------------------------------

bq. I had to read this a few times, yes it's very elegant as we're skipping the postings that otherwise would be deleted immediately after flush, and we're reusing the terms map already in DWPT.

Well... I think we can't [easily] skip writing the postings, because could result in non-deterministic behavior (I put a comment on this in the patch).

If we did the flush w/ 2 passes (first pass to mark all del docs and 2nd to flush) then we could skip writing postings of docs that were deleted.  But I suspect that's too much cost on flush.

With a single pass, we'd end up writing some postings for the doc, but not all, depending on the order in which its terms arrived vs its deleted terms.

I mean, in practice, an app is gonna delete against ID field (typically) so if we "knew" that down deep here in Luceneland we could do the first pass only against that one field...

Also, merge is still going to have to apply del docs, since eg stored fields have written the deleted docs.

> apply delete-by-Term and docID immediately to newly flushed segments
> --------------------------------------------------------------------
>
>                 Key: LUCENE-2897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2897.patch
>
>
> Spinoff from LUCENE-2324.
> When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.
> We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
> This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988474#action_12988474 ] 

Michael Busch commented on LUCENE-2897:
---------------------------------------

Yeah this is nice.  I was thinking we'd switch to live deletes with RT, because then we can also handle delete-by-query like this.

So the deleted queries we still have to buffer per DWPT, but this solves the updateDocument() problem.

> apply delete-by-Term and docID immediately to newly flushed segments
> --------------------------------------------------------------------
>
>                 Key: LUCENE-2897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2897.patch
>
>
> Spinoff from LUCENE-2324.
> When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.
> We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
> This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988481#action_12988481 ] 

Michael McCandless commented on LUCENE-2897:
--------------------------------------------

bq. Instead we're building the deleted docs BV as we're flushing.

Right, that's what the patch does.  Seems to work great :)

bq. I was thinking we'd switch to live deletes with RT, because then we can also handle delete-by-query like this.

Right, though for RT we need to apply the delete immediately (vs this patch which applies on flush).

And delete-by-Query is still buffered...

> apply delete-by-Term and docID immediately to newly flushed segments
> --------------------------------------------------------------------
>
>                 Key: LUCENE-2897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2897.patch
>
>
> Spinoff from LUCENE-2324.
> When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.
> We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
> This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2897) apply delete-by-Term and docID immediately to newly flushed segments

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988473#action_12988473 ] 

Jason Rutherglen commented on LUCENE-2897:
------------------------------------------

bq. Well... I think we can't [easily] skip writing the postings, because could result in non-deterministic behavior (I put a comment on this in the patch).

Instead we're building the deleted docs BV as we're flushing.

> apply delete-by-Term and docID immediately to newly flushed segments
> --------------------------------------------------------------------
>
>                 Key: LUCENE-2897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2897
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2897.patch
>
>
> Spinoff from LUCENE-2324.
> When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted.  But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs.  So it's more efficient to apply the deletes (for this one segment) at that time.
> We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
> This issue should wait until LUCENE-1076 is in since that issue cuts over buffered deletes to a transactional stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org