You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Ivan Stojanovic (Created) (JIRA)" <ji...@apache.org> on 2012/03/02 13:34:57 UTC

[jira] [Created] (LUCENE-3838) IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)

IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)
-------------------------------------------------------------------------------------

                 Key: LUCENE-3838
                 URL: https://issues.apache.org/jira/browse/LUCENE-3838
             Project: Lucene - Java
          Issue Type: Bug
          Components: core/index
    Affects Versions: 3.5, 3.4, 3.3, 3.2, 3.1
         Environment: Windows, Linux, OSX
            Reporter: Ivan Stojanovic
            Priority: Blocker


My company uses Lucene for high performance, heavy loaded farms of translation repositories with hundreds of simultaneous add/delete/update/search/retrieve threads. In order to support this complex architecture beside other things and tricks used here I rely on docId-s being unchanged until I ask that explicitly (using IndexWriter.optimize() - IndexWriter.forceMerge()).

For this behavior LogMergePolicy is used.

This worked fine until we raised the Lucene version from 3.0.2 to 3.5.0. Until version 3.1.0 merge triggerred by IndexWriter.addDocument() didn't expunge deleted documents ensuring that docId-s stayed unchanged and making some critical jobs possible without impact on index size. IndexWriter.optimize() did the actual deleted documents removal.

>From Lucene version 3.1.0 IndexWriter.maybeMerge() does the same thing as IndexWriter.forceMerge() regarding deleted documents. There is no difference. This leads to unpredictable internal index structure changes during simple document add (and possible delete) operations and in undefined point in time. I looked into the Lucene source code and can definitely confirm this.

This issue makes our Lucene client code totally unusable.

Solution steps:

1) add a flag somewhere that will control whether the deleted documents should be removed in maybeMerge(). Note that this is only a half of what we need here.
2) make forceMerge() always remove deleted documents no matter if maybeMerge() removes them or not. Alternatively, there can be another parameter added to forceMerge() that will also tell if deleted documents should be removed from index or not.

The sample JUnit code that can replicate this issue is added below.



public class TempTest {

    private Analyzer _analyzer = new KeywordAnalyzer();

    @Test
    public void testIndex() throws Exception {
	File indexDir = new File("sample-index");
	if (indexDir.exists()) {
	    indexDir.delete();
	}

	FSDirectory index = FSDirectory.open(indexDir);

	Document doc;

	IndexWriter writer = createWriter(index, true);
	try {
	    doc = new Document();
	    doc.add(new Field("field", "text0", Field.Store.YES,
		    Field.Index.ANALYZED));
	    writer.addDocument(doc);

	    doc = new Document();
	    doc.add(new Field("field", "text1", Field.Store.YES,
		    Field.Index.ANALYZED));
	    writer.addDocument(doc);

	    doc = new Document();
	    doc.add(new Field("field", "text2", Field.Store.YES,
		    Field.Index.ANALYZED));
	    writer.addDocument(doc);

	    writer.commit();
	} finally {
	    writer.close();
	}

	IndexReader reader = IndexReader.open(index, false);
	try {
	    reader.deleteDocument(1);
	} finally {
	    reader.close();
	}

	writer = createWriter(index, false);
	try {
	    for (int i = 3; i < 100; i++) {
		doc = new Document();
		doc.add(new Field("field", "text" + i, Field.Store.YES,
			Field.Index.ANALYZED));
		writer.addDocument(doc);

		writer.commit();
	    }
	} finally {
	    writer.close();
	}

	boolean deleted;
	String text;

	reader = IndexReader.open(index, true);
	try {
	    deleted = reader.isDeleted(1);
	    text = reader.document(1).get("field");
	} finally {
	    reader.close();
	}

	assertTrue(deleted); // This line breaks
	assertEquals("text1", text);
    }

    private MergePolicy createEngineMergePolicy() {
	LogDocMergePolicy mergePolicy = new LogDocMergePolicy();

	mergePolicy.setCalibrateSizeByDeletes(false);
	mergePolicy.setUseCompoundFile(true);
	mergePolicy.setNoCFSRatio(1.0);

	return mergePolicy;
    }

    private IndexWriter createWriter(Directory index, boolean create)
	    throws Exception {
	IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_35,
		_analyzer);

	iwConfig.setOpenMode(create ? IndexWriterConfig.OpenMode.CREATE
		: IndexWriterConfig.OpenMode.APPEND);
	iwConfig.setMergePolicy(createEngineMergePolicy());
	iwConfig.setMergeScheduler(new ConcurrentMergeScheduler());

	return new IndexWriter(index, iwConfig);
    }

}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3838) IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224511#comment-13224511 ] 

Michael McCandless commented on LUCENE-3838:
--------------------------------------------

bq. of course this happens in version 3.1.0 as stated in the title (in parentheses).

Sorry, what I meant was Lucene has, forever, removed deleted docs
during natural merges (as well as forced merges); I'm not sure why
in 3.0.2 you're seeing otherwise...

bq. Actually, it has never been stated that this is an internal implementation detail (if I can remember correctly).

I don't know whether this is officially documented anywhere, but it
comes up every so often on the lists and the answer is always "don't
rely on Lucene's docID"... or, rather "rely on docID at your own risk".

bq. Anyway, we already have an ID field but we can't rely on it for long running operations.

I didn't fully understand why you can't use your ID field for long
running operations...

bq. I only don't understand if we will have to wait for a Lucene 4.0 release for a custom codec implementation 

You'd have to use Lucene trunk (not yet released) to work with codecs.

bq. If I need to implement it for trunk than can you please give me a starting point where to begin from?

Start here I think?

  http://wiki.apache.org/lucene-java/HowToContribute

EG, get a trunk checkout, browse trunk's javadocs, look @ the test cases, etc.

bq. Also, can this approach differentiate between maybeMerge() and forceMerge().

Maybe... eg a MergePolicy/Scheduler knows if a given merge request was
"forced" or not.

                
> IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3838
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3838
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 3.1, 3.2, 3.3, 3.4, 3.5
>         Environment: Windows, Linux, OSX
>            Reporter: Ivan Stojanovic
>            Priority: Blocker
>              Labels: api-change
>         Attachments: TempTest.java
>
>
> My company uses Lucene for high performance, heavy loaded farms of translation repositories with hundreds of simultaneous add/delete/update/search/retrieve threads. In order to support this complex architecture beside other things and tricks used here I rely on docId-s being unchanged until I ask that explicitly (using IndexWriter.optimize() - IndexWriter.forceMerge()).
> For this behavior LogMergePolicy is used.
> This worked fine until we raised the Lucene version from 3.0.2 to 3.5.0. Until version 3.1.0 merge triggerred by IndexWriter.addDocument() didn't expunge deleted documents ensuring that docId-s stayed unchanged and making some critical jobs possible without impact on index size. IndexWriter.optimize() did the actual deleted documents removal.
> From Lucene version 3.1.0 IndexWriter.maybeMerge() does the same thing as IndexWriter.forceMerge() regarding deleted documents. There is no difference. This leads to unpredictable internal index structure changes during simple document add (and possible delete) operations and in undefined point in time. I looked into the Lucene source code and can definitely confirm this.
> This issue makes our Lucene client code totally unusable.
> Solution steps:
> 1) add a flag somewhere that will control whether the deleted documents should be removed in maybeMerge(). Note that this is only a half of what we need here.
> 2) make forceMerge() always remove deleted documents no matter if maybeMerge() removes them or not. Alternatively, there can be another parameter added to forceMerge() that will also tell if deleted documents should be removed from index or not.
> The sample JUnit code that can replicate this issue is added below.
> public class TempTest {
>     private Analyzer _analyzer = new KeywordAnalyzer();
>     @Test
>     public void testIndex() throws Exception {
> 	File indexDir = new File("sample-index");
> 	if (indexDir.exists()) {
> 	    indexDir.delete();
> 	}
> 	FSDirectory index = FSDirectory.open(indexDir);
> 	Document doc;
> 	IndexWriter writer = createWriter(index, true);
> 	try {
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text0", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text1", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text2", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    writer.commit();
> 	} finally {
> 	    writer.close();
> 	}
> 	IndexReader reader = IndexReader.open(index, false);
> 	try {
> 	    reader.deleteDocument(1);
> 	} finally {
> 	    reader.close();
> 	}
> 	writer = createWriter(index, false);
> 	try {
> 	    for (int i = 3; i < 100; i++) {
> 		doc = new Document();
> 		doc.add(new Field("field", "text" + i, Field.Store.YES,
> 			Field.Index.ANALYZED));
> 		writer.addDocument(doc);
> 		writer.commit();
> 	    }
> 	} finally {
> 	    writer.close();
> 	}
> 	boolean deleted;
> 	String text;
> 	reader = IndexReader.open(index, true);
> 	try {
> 	    deleted = reader.isDeleted(1);
> 	    text = reader.document(1).get("field");
> 	} finally {
> 	    reader.close();
> 	}
> 	assertTrue(deleted); // This line breaks
> 	assertEquals("text1", text);
>     }
>     private MergePolicy createEngineMergePolicy() {
> 	LogDocMergePolicy mergePolicy = new LogDocMergePolicy();
> 	mergePolicy.setCalibrateSizeByDeletes(false);
> 	mergePolicy.setUseCompoundFile(true);
> 	mergePolicy.setNoCFSRatio(1.0);
> 	return mergePolicy;
>     }
>     private IndexWriter createWriter(Directory index, boolean create)
> 	    throws Exception {
> 	IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_35,
> 		_analyzer);
> 	iwConfig.setOpenMode(create ? IndexWriterConfig.OpenMode.CREATE
> 		: IndexWriterConfig.OpenMode.APPEND);
> 	iwConfig.setMergePolicy(createEngineMergePolicy());
> 	iwConfig.setMergeScheduler(new ConcurrentMergeScheduler());
> 	return new IndexWriter(index, iwConfig);
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3838) IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221842#comment-13221842 ] 

Michael McCandless commented on LUCENE-3838:
--------------------------------------------

Lucene's maybeMerge, even in 3.1.0, will merge away deleted documents; I'm not sure why you don't see that happening.

Really, when Lucene reclaims deletions and renumbers its documents, is an internal implementation detail.  Applications should not rely on this behavior.  Can you add your own ID field to the index?  Or, alternatively, never delete documents but instead use a filter in the application to skip the documents.  Or, in 4.0 (trunk), you could perhaps make a custom codec that "pretends" there are no deletions when merging runs...
                
> IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3838
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3838
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 3.1, 3.2, 3.3, 3.4, 3.5
>         Environment: Windows, Linux, OSX
>            Reporter: Ivan Stojanovic
>            Priority: Blocker
>              Labels: api-change
>         Attachments: TempTest.java
>
>
> My company uses Lucene for high performance, heavy loaded farms of translation repositories with hundreds of simultaneous add/delete/update/search/retrieve threads. In order to support this complex architecture beside other things and tricks used here I rely on docId-s being unchanged until I ask that explicitly (using IndexWriter.optimize() - IndexWriter.forceMerge()).
> For this behavior LogMergePolicy is used.
> This worked fine until we raised the Lucene version from 3.0.2 to 3.5.0. Until version 3.1.0 merge triggerred by IndexWriter.addDocument() didn't expunge deleted documents ensuring that docId-s stayed unchanged and making some critical jobs possible without impact on index size. IndexWriter.optimize() did the actual deleted documents removal.
> From Lucene version 3.1.0 IndexWriter.maybeMerge() does the same thing as IndexWriter.forceMerge() regarding deleted documents. There is no difference. This leads to unpredictable internal index structure changes during simple document add (and possible delete) operations and in undefined point in time. I looked into the Lucene source code and can definitely confirm this.
> This issue makes our Lucene client code totally unusable.
> Solution steps:
> 1) add a flag somewhere that will control whether the deleted documents should be removed in maybeMerge(). Note that this is only a half of what we need here.
> 2) make forceMerge() always remove deleted documents no matter if maybeMerge() removes them or not. Alternatively, there can be another parameter added to forceMerge() that will also tell if deleted documents should be removed from index or not.
> The sample JUnit code that can replicate this issue is added below.
> public class TempTest {
>     private Analyzer _analyzer = new KeywordAnalyzer();
>     @Test
>     public void testIndex() throws Exception {
> 	File indexDir = new File("sample-index");
> 	if (indexDir.exists()) {
> 	    indexDir.delete();
> 	}
> 	FSDirectory index = FSDirectory.open(indexDir);
> 	Document doc;
> 	IndexWriter writer = createWriter(index, true);
> 	try {
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text0", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text1", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text2", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    writer.commit();
> 	} finally {
> 	    writer.close();
> 	}
> 	IndexReader reader = IndexReader.open(index, false);
> 	try {
> 	    reader.deleteDocument(1);
> 	} finally {
> 	    reader.close();
> 	}
> 	writer = createWriter(index, false);
> 	try {
> 	    for (int i = 3; i < 100; i++) {
> 		doc = new Document();
> 		doc.add(new Field("field", "text" + i, Field.Store.YES,
> 			Field.Index.ANALYZED));
> 		writer.addDocument(doc);
> 		writer.commit();
> 	    }
> 	} finally {
> 	    writer.close();
> 	}
> 	boolean deleted;
> 	String text;
> 	reader = IndexReader.open(index, true);
> 	try {
> 	    deleted = reader.isDeleted(1);
> 	    text = reader.document(1).get("field");
> 	} finally {
> 	    reader.close();
> 	}
> 	assertTrue(deleted); // This line breaks
> 	assertEquals("text1", text);
>     }
>     private MergePolicy createEngineMergePolicy() {
> 	LogDocMergePolicy mergePolicy = new LogDocMergePolicy();
> 	mergePolicy.setCalibrateSizeByDeletes(false);
> 	mergePolicy.setUseCompoundFile(true);
> 	mergePolicy.setNoCFSRatio(1.0);
> 	return mergePolicy;
>     }
>     private IndexWriter createWriter(Directory index, boolean create)
> 	    throws Exception {
> 	IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_35,
> 		_analyzer);
> 	iwConfig.setOpenMode(create ? IndexWriterConfig.OpenMode.CREATE
> 		: IndexWriterConfig.OpenMode.APPEND);
> 	iwConfig.setMergePolicy(createEngineMergePolicy());
> 	iwConfig.setMergeScheduler(new ConcurrentMergeScheduler());
> 	return new IndexWriter(index, iwConfig);
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3838) IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)

Posted by "Ivan Stojanovic (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222347#comment-13222347 ] 

Ivan Stojanovic commented on LUCENE-3838:
-----------------------------------------

Hi Michael,

of course this happens in version 3.1.0 as stated in the title (in parentheses).

Actually, it has never been stated that this is an internal implementation detail (if I can remember correctly). I'm very sure that we are not the only ones who were relying on this behavior. Also, this backward compatibility break wasn't stated in 3.1.0 changes log.

Anyway, we already have an ID field but we can't rely on it for long running operations. Suppose that an index export is in progress while there is a bunch of add/delete/search operations. Or worse, suppose that a batch delete is in progress (driven by a filter criteria) at the same time. I have to say here that we are using only one searcher per index here and also we are working with farms of indexes with size of 3-5 millions of documents per index. I can't even imagine the use of more than one searcher per index here. One searcher per index also gives us the best performance which is our top concern. Another thing. When an admin performs optimization then the index is locked so no one can access it in order to avoid disk overuse.

We also have a deletes filter :)
It is used as a ram filter that buffers deletes using BitSet and occasionally flushing this buffer to index (deleting documents marked as deletes). This gives us the lightning performance related to both deleting documents and search operation relying on a custom Collector wrapping this filter. If we used an application filter to skip documents than the search would suffer a significant slow down because of communication with an application filter for each document retrieved from index. if we do this than our ultra fast Lucene driven application loses its sense.

The suggestion with a custom codec sounds very promising. I only don't understand if we will have to wait for a Lucene 4.0 release for a custom codec implementation (with maybe an API that allows that) or I need to implement it for Lucene trunk. If I need to implement it for trunk than can you please give me a starting point where to begin from? I must say that I haven't dived deep in Lucene merge functionality. Also, can this approach differentiate between maybeMerge() and forceMerge(). We need to support document removal in forceMerge(), of course.

Greatest regards,
Ivan
                
> IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3838
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3838
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 3.1, 3.2, 3.3, 3.4, 3.5
>         Environment: Windows, Linux, OSX
>            Reporter: Ivan Stojanovic
>            Priority: Blocker
>              Labels: api-change
>         Attachments: TempTest.java
>
>
> My company uses Lucene for high performance, heavy loaded farms of translation repositories with hundreds of simultaneous add/delete/update/search/retrieve threads. In order to support this complex architecture beside other things and tricks used here I rely on docId-s being unchanged until I ask that explicitly (using IndexWriter.optimize() - IndexWriter.forceMerge()).
> For this behavior LogMergePolicy is used.
> This worked fine until we raised the Lucene version from 3.0.2 to 3.5.0. Until version 3.1.0 merge triggerred by IndexWriter.addDocument() didn't expunge deleted documents ensuring that docId-s stayed unchanged and making some critical jobs possible without impact on index size. IndexWriter.optimize() did the actual deleted documents removal.
> From Lucene version 3.1.0 IndexWriter.maybeMerge() does the same thing as IndexWriter.forceMerge() regarding deleted documents. There is no difference. This leads to unpredictable internal index structure changes during simple document add (and possible delete) operations and in undefined point in time. I looked into the Lucene source code and can definitely confirm this.
> This issue makes our Lucene client code totally unusable.
> Solution steps:
> 1) add a flag somewhere that will control whether the deleted documents should be removed in maybeMerge(). Note that this is only a half of what we need here.
> 2) make forceMerge() always remove deleted documents no matter if maybeMerge() removes them or not. Alternatively, there can be another parameter added to forceMerge() that will also tell if deleted documents should be removed from index or not.
> The sample JUnit code that can replicate this issue is added below.
> public class TempTest {
>     private Analyzer _analyzer = new KeywordAnalyzer();
>     @Test
>     public void testIndex() throws Exception {
> 	File indexDir = new File("sample-index");
> 	if (indexDir.exists()) {
> 	    indexDir.delete();
> 	}
> 	FSDirectory index = FSDirectory.open(indexDir);
> 	Document doc;
> 	IndexWriter writer = createWriter(index, true);
> 	try {
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text0", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text1", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text2", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    writer.commit();
> 	} finally {
> 	    writer.close();
> 	}
> 	IndexReader reader = IndexReader.open(index, false);
> 	try {
> 	    reader.deleteDocument(1);
> 	} finally {
> 	    reader.close();
> 	}
> 	writer = createWriter(index, false);
> 	try {
> 	    for (int i = 3; i < 100; i++) {
> 		doc = new Document();
> 		doc.add(new Field("field", "text" + i, Field.Store.YES,
> 			Field.Index.ANALYZED));
> 		writer.addDocument(doc);
> 		writer.commit();
> 	    }
> 	} finally {
> 	    writer.close();
> 	}
> 	boolean deleted;
> 	String text;
> 	reader = IndexReader.open(index, true);
> 	try {
> 	    deleted = reader.isDeleted(1);
> 	    text = reader.document(1).get("field");
> 	} finally {
> 	    reader.close();
> 	}
> 	assertTrue(deleted); // This line breaks
> 	assertEquals("text1", text);
>     }
>     private MergePolicy createEngineMergePolicy() {
> 	LogDocMergePolicy mergePolicy = new LogDocMergePolicy();
> 	mergePolicy.setCalibrateSizeByDeletes(false);
> 	mergePolicy.setUseCompoundFile(true);
> 	mergePolicy.setNoCFSRatio(1.0);
> 	return mergePolicy;
>     }
>     private IndexWriter createWriter(Directory index, boolean create)
> 	    throws Exception {
> 	IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_35,
> 		_analyzer);
> 	iwConfig.setOpenMode(create ? IndexWriterConfig.OpenMode.CREATE
> 		: IndexWriterConfig.OpenMode.APPEND);
> 	iwConfig.setMergePolicy(createEngineMergePolicy());
> 	iwConfig.setMergeScheduler(new ConcurrentMergeScheduler());
> 	return new IndexWriter(index, iwConfig);
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3838) IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)

Posted by "Ivan Stojanovic (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Stojanovic updated LUCENE-3838:
------------------------------------

    Attachment: TempTest.java

Sorry for the test added in description not being formatted correctly. TempTest.java is attached also.
                
> IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3838
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3838
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 3.1, 3.2, 3.3, 3.4, 3.5
>         Environment: Windows, Linux, OSX
>            Reporter: Ivan Stojanovic
>            Priority: Blocker
>              Labels: api-change
>         Attachments: TempTest.java
>
>
> My company uses Lucene for high performance, heavy loaded farms of translation repositories with hundreds of simultaneous add/delete/update/search/retrieve threads. In order to support this complex architecture beside other things and tricks used here I rely on docId-s being unchanged until I ask that explicitly (using IndexWriter.optimize() - IndexWriter.forceMerge()).
> For this behavior LogMergePolicy is used.
> This worked fine until we raised the Lucene version from 3.0.2 to 3.5.0. Until version 3.1.0 merge triggerred by IndexWriter.addDocument() didn't expunge deleted documents ensuring that docId-s stayed unchanged and making some critical jobs possible without impact on index size. IndexWriter.optimize() did the actual deleted documents removal.
> From Lucene version 3.1.0 IndexWriter.maybeMerge() does the same thing as IndexWriter.forceMerge() regarding deleted documents. There is no difference. This leads to unpredictable internal index structure changes during simple document add (and possible delete) operations and in undefined point in time. I looked into the Lucene source code and can definitely confirm this.
> This issue makes our Lucene client code totally unusable.
> Solution steps:
> 1) add a flag somewhere that will control whether the deleted documents should be removed in maybeMerge(). Note that this is only a half of what we need here.
> 2) make forceMerge() always remove deleted documents no matter if maybeMerge() removes them or not. Alternatively, there can be another parameter added to forceMerge() that will also tell if deleted documents should be removed from index or not.
> The sample JUnit code that can replicate this issue is added below.
> public class TempTest {
>     private Analyzer _analyzer = new KeywordAnalyzer();
>     @Test
>     public void testIndex() throws Exception {
> 	File indexDir = new File("sample-index");
> 	if (indexDir.exists()) {
> 	    indexDir.delete();
> 	}
> 	FSDirectory index = FSDirectory.open(indexDir);
> 	Document doc;
> 	IndexWriter writer = createWriter(index, true);
> 	try {
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text0", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text1", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text2", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    writer.commit();
> 	} finally {
> 	    writer.close();
> 	}
> 	IndexReader reader = IndexReader.open(index, false);
> 	try {
> 	    reader.deleteDocument(1);
> 	} finally {
> 	    reader.close();
> 	}
> 	writer = createWriter(index, false);
> 	try {
> 	    for (int i = 3; i < 100; i++) {
> 		doc = new Document();
> 		doc.add(new Field("field", "text" + i, Field.Store.YES,
> 			Field.Index.ANALYZED));
> 		writer.addDocument(doc);
> 		writer.commit();
> 	    }
> 	} finally {
> 	    writer.close();
> 	}
> 	boolean deleted;
> 	String text;
> 	reader = IndexReader.open(index, true);
> 	try {
> 	    deleted = reader.isDeleted(1);
> 	    text = reader.document(1).get("field");
> 	} finally {
> 	    reader.close();
> 	}
> 	assertTrue(deleted); // This line breaks
> 	assertEquals("text1", text);
>     }
>     private MergePolicy createEngineMergePolicy() {
> 	LogDocMergePolicy mergePolicy = new LogDocMergePolicy();
> 	mergePolicy.setCalibrateSizeByDeletes(false);
> 	mergePolicy.setUseCompoundFile(true);
> 	mergePolicy.setNoCFSRatio(1.0);
> 	return mergePolicy;
>     }
>     private IndexWriter createWriter(Directory index, boolean create)
> 	    throws Exception {
> 	IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_35,
> 		_analyzer);
> 	iwConfig.setOpenMode(create ? IndexWriterConfig.OpenMode.CREATE
> 		: IndexWriterConfig.OpenMode.APPEND);
> 	iwConfig.setMergePolicy(createEngineMergePolicy());
> 	iwConfig.setMergeScheduler(new ConcurrentMergeScheduler());
> 	return new IndexWriter(index, iwConfig);
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org