You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2009/06/19 23:02:07 UTC

[jira] Commented: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter

    [ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722006#action_12722006 ] 

Shai Erera commented on LUCENE-1705:
------------------------------------

My search app has such a scenario, and currently we just delete all the documents given a certain criteria (something similar to the above MatchAllDocsQuery. But I actually think that's the wrong approach. If you want to delete all the documents from the index, you'd better create a new one. The main reason is that if your index has, say, 10M documents, a deleteAll() will keep those 10M in the index, and when you'll re-index, the index size will be doubled. Worth still, the deleted documents may belong to segments which will not be merged/optimized right away (depends on your mergeFactor setting), and therefore will stick around for a long time (until you call optimize() or expungeDeletes()).

But, creating a new IndexWriter right away, while overriding the current one is not so smart, because your users will be left w/ no search results until the index has accumulated enough documents. Therefore I think the solution for such an approach should be:
# Call writer.rollback() - abort all current operations, cancel everything until the last commit.
# Create a new IndexWriter in a new directory and re-index everything.
# In the meantime, all your search operations go against the current index, which you know is not going to change until the other one is re-built, and therefore you can also optimize things, by opening an IndexReader and stop any accounting your code may do - just leave it open.
# When re-indexing has complete, sync all your code and:
#* Define your workDir to be the new index dir. That way new searches can begin right away on the index index)
#* Safely delete the old index dir (probably need to do something here to ensure no readers are open against this dir etc.).

That's a high-level description and I realize it may have some holes here and there, but you get the point.

If we were to create a deleteAll() method, I'd expect it to work that way. I.e., the solution you proposed above (write a new segments file referencing no segments) would prevent all searches until something new is actually re-indexed right?

I have to admit though, that I don't have an idea yet on how it can be done inside Lucene, such that new readers will see the old segments, while when I finish re-indexing and call commit, the previous segments will just be deleted.

A wild shot (and then I'll go to sleep on it) - how about if you re-index everything, not committing during that time at all. Readers that are open against the current directory will see all the documents, EXCEPT the new ones you're adding (same for new readers that you may open). When you're done re-indexing, you'll call a commitNewOnly, which will create an empty segments file and then call commit. That way, assuming you're using KeepOnlyLastCommitDeletionPolicy, after the existing readers will close, any new reader that will be opened will see the new segments only, and the next time you commit, the old segments will be deleted.

That will move the deleteAll() method to the application side, since it knows when it can safely delete all the current segments. If you don't have such a requirement (keeping an index for searches until re-indexing is complete), then I think you can safely close() the index and re-create it?

> Add deleteAllDocuments() method to IndexWriter
> ----------------------------------------------
>
>                 Key: LUCENE-1705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1705
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Tim Smith
>
> Ideally, there would be a deleteAllDocuments() or clear() method on the IndexWriter
> This method should have the same performance and characteristics as:
> * currentWriter.close()
> * currentWriter = new IndexWriter(..., create=true,...)
> This would greatly optimize a delete all documents case. Using deleteDocuments(new MatchAllDocsQuery()) could be expensive given a large existing index.
> IndexWriter.deleteAllDocuments() should have the same semantics as a commit(), as far as index visibility goes (new IndexReader opening would get the empty index)
> I see this was previously asked for in LUCENE-932, however it would be nice to finally see this added such that the IndexWriter would not need to be closed to perform the "clear" as this seems to be the general recommendation for working with an IndexWriter now
> deleteAllDocuments() method should:
> * abort any background merges (they are pointless once a deleteAll has been received)
> * write new segments file referencing no segments
> This method would remove one of the final reasons i would ever need to close an IndexWriter and reopen a new one 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org