You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2016/06/06 17:53:21 UTC
[jira] [Commented] (SOLR-9189) explosion of timeout related failures in jenkins the past few days

    [ https://issues.apache.org/jira/browse/SOLR-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316873#comment-15316873 ] 

Hoss Man commented on SOLR-9189:
--------------------------------


My initial gut paranoia skimming the jenkins emails this morning was to assume that this might be because of SOLR-5776 -- the hypothosis being: "The increased randomized use of ssl (factoring in tests.nightly / tests.multiplier) is causing more tests to slow down due to the crypto calculations"

... but that hypothosis seems weak when i started looking at the logs -- there is a "Randomized ssl" line as part of the logs for every SolrTestCaseJ4 subclass showing if ssl is being used or not...

* http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/834/
** 25 test failures
** only 7 of those were using ssl
* https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1034/
** 44 test failures
** only 17 of those were using ssl

...even if we assume every test failure where ssl was in use was directly caused by ssl, that still leaves a really high increase in the number of failed tests in those two runs.

So my ammended (paranoid) hypothosis is "The increased randomized use of ssl (factoring in tests.nightly / tests.multiplier) is causing more tests to slow down due to the crypto calculations *EVEN IN OTHER TESTS AT THE SAME TIME DUE TO CPU STARVATION*"

I'm going to commit a blanket disable of all SSL randomization _on master_ ASAP to test this hypothosis.

Part of me feels like this is an overkill reaction, and that a more rational response would simply be to undo the "increased odds of using ssl" portion of SOLR-5776 -- but I'd really like to get a difinitive understanding of wether SSL usage is really having such a seriously pronounced affect on other tests in the same jenkins run -- OR -- *is it just a red herring, and some other recent change has caused serious timeout issues?*



> explosion of timeout related failures in jenkins the past few days
> ------------------------------------------------------------------
>
>                 Key: SOLR-9189
>                 URL: https://issues.apache.org/jira/browse/SOLR-9189
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Critical
>
> In the past few days, something has gone seriously wonky with our jenkins tests -- causing a serious explosion in the number of test failures -- notably do to various sorts of timeouts...
> * "Unable to create core ... Timed out getting coreNodeName for ..."
> * "msg=SolrCore is loading,code=503"
> * "Timeout occured while waiting response from server"
> * "No registered leader was found after waiting for 30000ms"
> * "Unable to create core ... Caused by: Timed out getting shard id for core: ..."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org