You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2014/12/17 12:00:15 UTC

[jira] [Comment Edited] (SOLR-6640) ChaosMonkeySafeLeaderTest failure with CorruptIndexException

    [ https://issues.apache.org/jira/browse/SOLR-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249708#comment-14249708 ] 

Shalin Shekhar Mangar edited comment on SOLR-6640 at 12/17/14 10:59 AM:
------------------------------------------------------------------------

I am looking at this failure too and I see another bug. I was wondering why the replica had these writes in the first place considering that recovery on startup had not completed.

# RecoveryStrategy publishes the state of the replica as 'recovering' before it sets the update log to buffering mode which is why the leader sends updates to this replica that affect the index.
# The test itself doesn't wait for a steady state e.g. by calling waitForRecovery or waitForThingsToLevelOut before starting the indexing threads. This is probably a good thing because that's what has helped us find this problem.
# Shouldn't the peersync also be done while update log is set to buffering mode?

{quote}
So it's these files which are not getting removed when we do IW.rollback that were causing the problem - 
_0.cfe _0.cfs _0.si _0_1.liv _1.fdt _1.fdx
I am yet to figure out whether these files should have been removed by IW.rollback() or not?
{quote}

These files hang around because an IndexReader is open using the IndexWriter due to soft commit(s).


was (Author: shalinmangar):
I am looking at this failure too and I see another bug. I was wondering why did the replica have these writes in the first place considering that it hadn't recovery on startup wasn't complete yet.

# RecoveryStrategy publishes the state of the replica as 'recovering' before it sets the update log to buffering mode which is why the leader sends updates to this replica that affect the index.
# The test itself doesn't wait for a steady state e.g. by calling waitForRecovery or waitForThingsToLevelOut before starting the indexing threads. This is probably a good thing because that's what has helped us find this problem.
# Shouldn't the peersync also be done while update log is set to buffering mode?

{quote}
So it's these files which are not getting removed when we do IW.rollback that were causing the problem - 
_0.cfe _0.cfs _0.si _0_1.liv _1.fdt _1.fdx
I am yet to figure out whether these files should have been removed by IW.rollback() or not?
{quote}

These files hang around because an IndexReader is open using the IndexWriter due to soft commit(s).

> ChaosMonkeySafeLeaderTest failure with CorruptIndexException
> ------------------------------------------------------------
>
>                 Key: SOLR-6640
>                 URL: https://issues.apache.org/jira/browse/SOLR-6640
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java)
>    Affects Versions: 5.0
>            Reporter: Shalin Shekhar Mangar
>             Fix For: 5.0
>
>         Attachments: Lucene-Solr-5.x-Linux-64bit-jdk1.8.0_20-Build-11333.txt, SOLR-6640.patch, SOLR-6640.patch
>
>
> Test failure found on jenkins:
> http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/11333/
> {code}
> 1 tests failed.
> REGRESSION:  org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.testDistribSearch
> Error Message:
> shard2 is not consistent.  Got 62 from http://127.0.0.1:57436/collection1lastClient and got 24 from http://127.0.0.1:53065/collection1
> Stack Trace:
> java.lang.AssertionError: shard2 is not consistent.  Got 62 from http://127.0.0.1:57436/collection1lastClient and got 24 from http://127.0.0.1:53065/collection1
>         at __randomizedtesting.SeedInfo.seed([F4B371D421E391CD:7555FFCC56BCF1F1]:0)
>         at org.junit.Assert.fail(Assert.java:93)
>         at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1255)
>         at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1234)
>         at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.doTest(ChaosMonkeySafeLeaderTest.java:162)
>         at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:869)
> {code}
> Cause of inconsistency is:
> {code}
> Caused by: org.apache.lucene.index.CorruptIndexException: file mismatch, expected segment id=yhq3vokoe1den2av9jbd3yp8, got=yhq3vokoe1den2av9jbd3yp7 (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/ssd/jenkins/workspace/Lucene-Solr-5.x-Linux/solr/build/solr-core/test/J0/temp/solr.cloud.ChaosMonkeySafeLeaderTest-F4B371D421E391CD-001/tempDir-001/jetty3/index/_1_2.liv")))
>    [junit4]   2> 		at org.apache.lucene.codecs.CodecUtil.checkSegmentHeader(CodecUtil.java:259)
>    [junit4]   2> 		at org.apache.lucene.codecs.lucene50.Lucene50LiveDocsFormat.readLiveDocs(Lucene50LiveDocsFormat.java:88)
>    [junit4]   2> 		at org.apache.lucene.codecs.asserting.AssertingLiveDocsFormat.readLiveDocs(AssertingLiveDocsFormat.java:64)
>    [junit4]   2> 		at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:102)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org