You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mark Miller <ma...@gmail.com> on 2010/07/31 14:41:43 UTC

Solr Replication Test Case Failure


 This looks like it might actually be an issue - it fails once every 20
runs or so as a guess.

   [junit] Testsuite: org.apache.solr.handler.TestReplicationHandler
    [junit] Testcase:
testReplicateAfterWrite2Slave(org.apache.solr.handler.TestReplicationHandler):   
FAILED
    [junit] expected:<1> but was:<0>
    [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
    [junit]     at
org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave(TestReplicationHandler.java:464)
    [junit]
    [junit]
    [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 343.909 sec

At first I tried to extend the wait for it, but that's obviously no help
- in this case the test failed after running for 343 seconds. I've seen it as high as 968 seconds.

- Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Sun, Aug 1, 2010 at 4:18 PM, Yonik Seeley <yo...@lucidimagination.com> wrote:
> On Sun, Aug 1, 2010 at 3:42 PM, Mark Miller <ma...@gmail.com> wrote:
>> I hit the same test issue about an hour ago.
>
> It took me over 200 runs, but I just hit it too.

OK, I figured out the problem - just not the best way to fix it yet.
I'm reusing this issue:
https://issues.apache.org/jira/browse/SOLR-1469

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Sun, Aug 1, 2010 at 3:42 PM, Mark Miller <ma...@gmail.com> wrote:
> I hit the same test issue about an hour ago.

It took me over 200 runs, but I just hit it too.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Mark Miller <ma...@gmail.com>.

I hit the same test issue about an hour ago.

On 8/1/10 12:31 PM, Mark Miller wrote:
> On 8/1/10 11:31 AM, Yonik Seeley wrote:
>> Mark, do you hit this when running this test alone (i.e. ant test
>> -Dtestcase=TestReplicationHandler in a loop)?
> 
> I have not tried that yet - I originally got the problem just doing test
> in a loop - and I saw it about one every 10-20 runs. I havn't seen the
> Jetty unresponsive problem once in about 500 runs of test on my Ubuntu
> machine.
> 
> - Mark
> lucidimagination.com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Mark Miller <ma...@gmail.com>.

On 8/1/10 11:31 AM, Yonik Seeley wrote:
> Mark, do you hit this when running this test alone (i.e. ant test
> -Dtestcase=TestReplicationHandler in a loop)?

I have not tried that yet - I originally got the problem just doing test
in a loop - and I saw it about one every 10-20 runs. I havn't seen the
Jetty unresponsive problem once in about 500 runs of test on my Ubuntu
machine.

- Mark
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Sat, Jul 31, 2010 at 9:06 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Sat, Jul 31, 2010 at 12:38 PM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> FYI, I'm now running this in a loop on my ubuntu box, without the
>> retry-loop, trying to replicate a failure.
>
> FYI, I've hit 3 failures so far... all of the form "Connection
> refused"/"Jetty/Solr unresponsive", which may be related to SOLR-2019

OK, I've applied SOLR-2019 to try and remove any errors related to a
slow jetty start.
I've also just committed a change to remove the retry loop (which
should not be needed).
Mark, do you hit this when running this test alone (i.e. ant test
-Dtestcase=TestReplicationHandler in a loop)?

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Sat, Jul 31, 2010 at 12:38 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> FYI, I'm now running this in a loop on my ubuntu box, without the
> retry-loop, trying to replicate a failure.

FYI, I've hit 3 failures so far... all of the form "Connection
refused"/"Jetty/Solr unresponsive", which may be related to SOLR-2019

-Yonik
http://www.lucidimagination.com


> -Yonik
> http://www.lucidimagination.com
>
> On Sat, Jul 31, 2010 at 11:52 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> OK, can you try to reproduce now?
>> Since the comments indicated that all the commits were to bump up the
>> index version number, I kept them all and just inserted an additional
>> commit in the query retry loop.
>>
>> But actually... there may still be a bug somewhere (even if this fixes
>> the test failures).
>> Each commit should wait for a new searcher to be registered before
>> returning... hence it should be impossible for overlapping warming
>> searchers to be responsible for the failure.  Hence when the test
>> fails, either the doc add, or the commit is failing.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>
>> On Sat, Jul 31, 2010 at 11:35 AM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>>> Do the logs give any hints?
>>> Downside of only logging SEVERE is that it's much harder to
>>> investigate the cause of any intermittent failures that do happen.
>>>
>>> Looking at this test code, you shouldn't have to wait at all.  The
>>> test disables replication, indexes docs to the slave, commits (and
>>> waits for a new searcher to be registered), and then queries the
>>> slave.
>>>
>>> We should just remove that wait loop.
>>>
>>> Oh... i just figured it out while writing this I think...
>>>
>>>    index(slaveClient, "id", 551, "name", "name = " + 551);
>>>    slaveClient.commit(true, true);
>>>    index(slaveClient, "id", 552, "name", "name = " + 552);
>>>    slaveClient.commit(true, true);
>>>    index(slaveClient, "id", 553, "name", "name = " + 553);
>>>    slaveClient.commit(true, true);
>>>    index(slaveClient, "id", 554, "name", "name = " + 554);
>>>    slaveClient.commit(true, true);
>>>    index(slaveClient, "id", 555, "name", "name = " + 555);
>>>    slaveClient.commit(true, true);
>>>
>>> I bet that last commit can fail due to max warming searchers.
>>> I'll fix.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>> On Sat, Jul 31, 2010 at 8:41 AM, Mark Miller <ma...@gmail.com> wrote:
>>>>
>>>>
>>>>  This looks like it might actually be an issue - it fails once every 20
>>>> runs or so as a guess.
>>>>
>>>>   [junit] Testsuite: org.apache.solr.handler.TestReplicationHandler
>>>>    [junit] Testcase:
>>>> testReplicateAfterWrite2Slave(org.apache.solr.handler.TestReplicationHandler):
>>>> FAILED
>>>>    [junit] expected:<1> but was:<0>
>>>>    [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
>>>>    [junit]     at
>>>> org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave(TestReplicationHandler.java:464)
>>>>    [junit]
>>>>    [junit]
>>>>    [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 343.909 sec
>>>>
>>>> At first I tried to extend the wait for it, but that's obviously no help
>>>> - in this case the test failed after running for 343 seconds. I've seen it as high as 968 seconds.
>>>>
>>>> - Mark
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Mark Miller <ma...@gmail.com>.

Still running tests non stop here as well - I'll ping the list if I see
it again.

- Mark

On 7/31/10 12:38 PM, Yonik Seeley wrote:
> FYI, I'm now running this in a loop on my ubuntu box, without the
> retry-loop, trying to replicate a failure.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> On Sat, Jul 31, 2010 at 11:52 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> OK, can you try to reproduce now?
>> Since the comments indicated that all the commits were to bump up the
>> index version number, I kept them all and just inserted an additional
>> commit in the query retry loop.
>>
>> But actually... there may still be a bug somewhere (even if this fixes
>> the test failures).
>> Each commit should wait for a new searcher to be registered before
>> returning... hence it should be impossible for overlapping warming
>> searchers to be responsible for the failure.  Hence when the test
>> fails, either the doc add, or the commit is failing.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>
>> On Sat, Jul 31, 2010 at 11:35 AM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>>> Do the logs give any hints?
>>> Downside of only logging SEVERE is that it's much harder to
>>> investigate the cause of any intermittent failures that do happen.
>>>
>>> Looking at this test code, you shouldn't have to wait at all.  The
>>> test disables replication, indexes docs to the slave, commits (and
>>> waits for a new searcher to be registered), and then queries the
>>> slave.
>>>
>>> We should just remove that wait loop.
>>>
>>> Oh... i just figured it out while writing this I think...
>>>
>>>    index(slaveClient, "id", 551, "name", "name = " + 551);
>>>    slaveClient.commit(true, true);
>>>    index(slaveClient, "id", 552, "name", "name = " + 552);
>>>    slaveClient.commit(true, true);
>>>    index(slaveClient, "id", 553, "name", "name = " + 553);
>>>    slaveClient.commit(true, true);
>>>    index(slaveClient, "id", 554, "name", "name = " + 554);
>>>    slaveClient.commit(true, true);
>>>    index(slaveClient, "id", 555, "name", "name = " + 555);
>>>    slaveClient.commit(true, true);
>>>
>>> I bet that last commit can fail due to max warming searchers.
>>> I'll fix.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>> On Sat, Jul 31, 2010 at 8:41 AM, Mark Miller <ma...@gmail.com> wrote:
>>>>
>>>>
>>>>  This looks like it might actually be an issue - it fails once every 20
>>>> runs or so as a guess.
>>>>
>>>>   [junit] Testsuite: org.apache.solr.handler.TestReplicationHandler
>>>>    [junit] Testcase:
>>>> testReplicateAfterWrite2Slave(org.apache.solr.handler.TestReplicationHandler):
>>>> FAILED
>>>>    [junit] expected:<1> but was:<0>
>>>>    [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
>>>>    [junit]     at
>>>> org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave(TestReplicationHandler.java:464)
>>>>    [junit]
>>>>    [junit]
>>>>    [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 343.909 sec
>>>>
>>>> At first I tried to extend the wait for it, but that's obviously no help
>>>> - in this case the test failed after running for 343 seconds. I've seen it as high as 968 seconds.
>>>>
>>>> - Mark
>>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Yonik Seeley <yo...@lucidimagination.com>.

FYI, I'm now running this in a loop on my ubuntu box, without the
retry-loop, trying to replicate a failure.

-Yonik
http://www.lucidimagination.com

On Sat, Jul 31, 2010 at 11:52 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> OK, can you try to reproduce now?
> Since the comments indicated that all the commits were to bump up the
> index version number, I kept them all and just inserted an additional
> commit in the query retry loop.
>
> But actually... there may still be a bug somewhere (even if this fixes
> the test failures).
> Each commit should wait for a new searcher to be registered before
> returning... hence it should be impossible for overlapping warming
> searchers to be responsible for the failure.  Hence when the test
> fails, either the doc add, or the commit is failing.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Sat, Jul 31, 2010 at 11:35 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> Do the logs give any hints?
>> Downside of only logging SEVERE is that it's much harder to
>> investigate the cause of any intermittent failures that do happen.
>>
>> Looking at this test code, you shouldn't have to wait at all.  The
>> test disables replication, indexes docs to the slave, commits (and
>> waits for a new searcher to be registered), and then queries the
>> slave.
>>
>> We should just remove that wait loop.
>>
>> Oh... i just figured it out while writing this I think...
>>
>>    index(slaveClient, "id", 551, "name", "name = " + 551);
>>    slaveClient.commit(true, true);
>>    index(slaveClient, "id", 552, "name", "name = " + 552);
>>    slaveClient.commit(true, true);
>>    index(slaveClient, "id", 553, "name", "name = " + 553);
>>    slaveClient.commit(true, true);
>>    index(slaveClient, "id", 554, "name", "name = " + 554);
>>    slaveClient.commit(true, true);
>>    index(slaveClient, "id", 555, "name", "name = " + 555);
>>    slaveClient.commit(true, true);
>>
>> I bet that last commit can fail due to max warming searchers.
>> I'll fix.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> On Sat, Jul 31, 2010 at 8:41 AM, Mark Miller <ma...@gmail.com> wrote:
>>>
>>>
>>>  This looks like it might actually be an issue - it fails once every 20
>>> runs or so as a guess.
>>>
>>>   [junit] Testsuite: org.apache.solr.handler.TestReplicationHandler
>>>    [junit] Testcase:
>>> testReplicateAfterWrite2Slave(org.apache.solr.handler.TestReplicationHandler):
>>> FAILED
>>>    [junit] expected:<1> but was:<0>
>>>    [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
>>>    [junit]     at
>>> org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave(TestReplicationHandler.java:464)
>>>    [junit]
>>>    [junit]
>>>    [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 343.909 sec
>>>
>>> At first I tried to extend the wait for it, but that's obviously no help
>>> - in this case the test failed after running for 343 seconds. I've seen it as high as 968 seconds.
>>>
>>> - Mark
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Yonik Seeley <yo...@lucidimagination.com>.

OK, can you try to reproduce now?
Since the comments indicated that all the commits were to bump up the
index version number, I kept them all and just inserted an additional
commit in the query retry loop.

But actually... there may still be a bug somewhere (even if this fixes
the test failures).
Each commit should wait for a new searcher to be registered before
returning... hence it should be impossible for overlapping warming
searchers to be responsible for the failure.  Hence when the test
fails, either the doc add, or the commit is failing.

-Yonik
http://www.lucidimagination.com



On Sat, Jul 31, 2010 at 11:35 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> Do the logs give any hints?
> Downside of only logging SEVERE is that it's much harder to
> investigate the cause of any intermittent failures that do happen.
>
> Looking at this test code, you shouldn't have to wait at all.  The
> test disables replication, indexes docs to the slave, commits (and
> waits for a new searcher to be registered), and then queries the
> slave.
>
> We should just remove that wait loop.
>
> Oh... i just figured it out while writing this I think...
>
>    index(slaveClient, "id", 551, "name", "name = " + 551);
>    slaveClient.commit(true, true);
>    index(slaveClient, "id", 552, "name", "name = " + 552);
>    slaveClient.commit(true, true);
>    index(slaveClient, "id", 553, "name", "name = " + 553);
>    slaveClient.commit(true, true);
>    index(slaveClient, "id", 554, "name", "name = " + 554);
>    slaveClient.commit(true, true);
>    index(slaveClient, "id", 555, "name", "name = " + 555);
>    slaveClient.commit(true, true);
>
> I bet that last commit can fail due to max warming searchers.
> I'll fix.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Sat, Jul 31, 2010 at 8:41 AM, Mark Miller <ma...@gmail.com> wrote:
>>
>>
>>  This looks like it might actually be an issue - it fails once every 20
>> runs or so as a guess.
>>
>>   [junit] Testsuite: org.apache.solr.handler.TestReplicationHandler
>>    [junit] Testcase:
>> testReplicateAfterWrite2Slave(org.apache.solr.handler.TestReplicationHandler):
>> FAILED
>>    [junit] expected:<1> but was:<0>
>>    [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
>>    [junit]     at
>> org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave(TestReplicationHandler.java:464)
>>    [junit]
>>    [junit]
>>    [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 343.909 sec
>>
>> At first I tried to extend the wait for it, but that's obviously no help
>> - in this case the test failed after running for 343 seconds. I've seen it as high as 968 seconds.
>>
>> - Mark
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Solr Replication Test Case Failure

Posted by Yonik Seeley <yo...@lucidimagination.com>.

Do the logs give any hints?
Downside of only logging SEVERE is that it's much harder to
investigate the cause of any intermittent failures that do happen.

Looking at this test code, you shouldn't have to wait at all.  The
test disables replication, indexes docs to the slave, commits (and
waits for a new searcher to be registered), and then queries the
slave.

We should just remove that wait loop.

Oh... i just figured it out while writing this I think...

    index(slaveClient, "id", 551, "name", "name = " + 551);
    slaveClient.commit(true, true);
    index(slaveClient, "id", 552, "name", "name = " + 552);
    slaveClient.commit(true, true);
    index(slaveClient, "id", 553, "name", "name = " + 553);
    slaveClient.commit(true, true);
    index(slaveClient, "id", 554, "name", "name = " + 554);
    slaveClient.commit(true, true);
    index(slaveClient, "id", 555, "name", "name = " + 555);
    slaveClient.commit(true, true);

I bet that last commit can fail due to max warming searchers.
I'll fix.

-Yonik
http://www.lucidimagination.com

On Sat, Jul 31, 2010 at 8:41 AM, Mark Miller <ma...@gmail.com> wrote:
>
>
>  This looks like it might actually be an issue - it fails once every 20
> runs or so as a guess.
>
>   [junit] Testsuite: org.apache.solr.handler.TestReplicationHandler
>    [junit] Testcase:
> testReplicateAfterWrite2Slave(org.apache.solr.handler.TestReplicationHandler):
> FAILED
>    [junit] expected:<1> but was:<0>
>    [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
>    [junit]     at
> org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave(TestReplicationHandler.java:464)
>    [junit]
>    [junit]
>    [junit] Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 343.909 sec
>
> At first I tried to extend the wait for it, but that's obviously no help
> - in this case the test failed after running for 343 seconds. I've seen it as high as 968 seconds.
>
> - Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org