You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Hoss Man (Jira)" <ji...@apache.org> on 2019/09/19 00:24:00 UTC

[jira] [Commented] (SOLR-13778) Windows JDK SSL Test Failure trend: SSLException: Software caused connection abort: recv failed

    [ https://issues.apache.org/jira/browse/SOLR-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932942#comment-16932942 ] 

Hoss Man commented on SOLR-13778:
---------------------------------

Here's a full example of what one of these stack traces tends to look like...
{noformat}
...
   [junit4]    > Caused by: javax.net.ssl.SSLException: Software caused connection abort: recv failed
   [junit4]    >        at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:127)
   [junit4]    >        at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:320)
   [junit4]    >        at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:263)
   [junit4]    >        at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:258)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1342)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:844)
   [junit4]    >        at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
   [junit4]    >        at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
   [junit4]    >        at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
   [junit4]    >        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
   [junit4]    >        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
   [junit4]    >        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
   [junit4]    >        at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
   [junit4]    >        at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
   [junit4]    >        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
   [junit4]    >        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
   [junit4]    >        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
   [junit4]    >        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
   [junit4]    >        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
   [junit4]    >        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
   [junit4]    >        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
   [junit4]    >        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
   [junit4]    >        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
   [junit4]    >        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
   [junit4]    >        ... 46 more
   [junit4]    >        Suppressed: java.net.SocketException: Software caused connection abort: socket write error
   [junit4]    >                at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
   [junit4]    >                at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
   [junit4]    >                at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
   [junit4]    >                at java.base/sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
   [junit4]    >                at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:351)
   [junit4]    >                ... 68 more
   [junit4]    > Caused by: java.net.SocketException: Software caused connection abort: recv failed
   [junit4]    >        at java.base/java.net.SocketInputStream.socketRead0(Native Method)
   [junit4]    >        at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
   [junit4]    >        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
   [junit4]    >        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:448)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1132)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:828)
   [junit4]    >        ... 64 more
{noformat}
Allthough it's not obvious from the public view of my reports, grepping all the available logs (available for all builds with any test failures in the past 7 days) show these types of Exceptions *only* happen on Uwe's jenkins server, and only on Windows builds (Uwe's jenkins server being the only one running tests on Windows)...
{noformat}
$ zgrep -c 'javax.net.ssl.SSLException: Software caused connection abort: recv failed' */*/*/jenkins.log.txt.gz | grep -v ':0$'
thetaphi/Lucene-Solr-8.x-Windows/447/jenkins.log.txt.gz:5
thetaphi/Lucene-Solr-8.x-Windows/453/jenkins.log.txt.gz:14
thetaphi/Lucene-Solr-master-Windows/8135/jenkins.log.txt.gz:4
thetaphi/Lucene-Solr-master-Windows/8136/jenkins.log.txt.gz:2
thetaphi/Lucene-Solr-master-Windows/8137/jenkins.log.txt.gz:6
thetaphi/Lucene-Solr-master-Windows/8139/jenkins.log.txt.gz:12
thetaphi/Lucene-Solr-master-Windows/8142/jenkins.log.txt.gz:5
thetaphi/Lucene-Solr-master-Windows/8143/jenkins.log.txt.gz:6
thetaphi/Lucene-Solr-master-Windows/8146/jenkins.log.txt.gz:6
thetaphi/Lucene-Solr-master-Windows/8148/jenkins.log.txt.gz:6
{noformat}
This is inspite of the fact thta Uwe's jenkins server runs Linux jobs almost 4 times as often as it runs tests on Windows...
{noformat}
$ find thetaphi/ -mindepth 2 -maxdepth 2 | perl -ple 's{.*-(.*?-.*?)/\d+}{$1}' | sort | uniq -c
    132 8.x-Linux
     36 8.x-MacOSX
     36 8.x-Solaris
     37 8.x-Windows
    136 master-Linux
     36 master-MacOSX
     38 master-Windows
$ find thetaphi/ -mindepth 2 -maxdepth 2 | perl -ple 's{.*-(.*?)/\d+}{$1}' | sort | uniq -c
    268 Linux
     72 MacOSX
     36 Solaris
     75 Windows
{noformat}
It's also worth noting that a cursory glance indicates that the root cause (SocketException) does not seem to occur in any build, regardless of server/OS, *except* when running as part of an SSL Socket (On the above mentioned thetaphi Windows jobs)...
{noformat}
$ zgrep 'java.net.SocketException: Software caused connection abort: recv failed' */*/*/jenkins.log.txt.gz | wc -l
66
$ zgrep 'javax.net.ssl.SSLException: Software caused connection abort: recv failed' */*/*/jenkins.log.txt.gz | wc -l
66
$ zgrep 'javax.net.ssl.SSLException: Software caused connection abort: recv failed' thetaphi/*Windows/*/jenkins.log.txt.gz | wc -l
66
{noformat}

----
My best guess as to the root cause of this bug in the JVM is [https://bugs.openjdk.java.net/browse/JDK-8209333] since it's one of the few SSL related bugs i can find that meets all of the following criteria:
 * mentions 'Software caused connection abort: recv failed'
 * says it "Affects Version/s: 11"
 * does not explicitly say it's fixed in 11.0.4
 ** Indicates "Fix Version/s: 12; Resolved In Build: b25" w/backports for 12.0.1-master and 13-b01
 ** i still don't understand openjdk's jira conventions very well, but the only commit linked to the issue is in '.../jdk/jdk12/rev/8a61a04c456c' so i'm pretty sure that "b25" corrisponds to an build jdk12.
 * mentions Windows
 ** not explicitly, but the test case attached by the reporter uses Windows file paths indicating he was was seeing the issue when running tests on Windows JVMs

It's also worth noting that the person who filed the issue indicated that the problem & testcase was specific to '“setNeedClientAuth” is false.' ... but there are no comments from an openjdk devs on this issue (other then committing a fix) let alone on the specific nature of when this bug manifests, so it's possible that the original reporter only ever tested with clientAuth=false and that particular bug is more general then that. I'm not sure, but AFAICT the resulting commit doesn't seem to be in any sort of clientAuth specific code paths

In our own test logs, we do see some of these failures occurring even when {{clientAuth==true}} ...

{noformat}
$ zgrep -l 'javax.net.ssl.SSLException: Software caused connection abort: recv failed' thetaphi/*Windows/*/jenkins.log.txt.gz | xargs zgrep 'Randomized ssl (true)' | perl -ple 's/.*(clientAuth\s+\(.*?\)).*/$1/' | sort | uniq -c
     28 clientAuth (false)
      2 clientAuth (true)
{noformat}

2 things to remember when considering this bit of information:
* by default our build system only records log output for tests that fail, so there's no way to know exactly how many tests _passed_ using a SSL + w/ particular clientAuth value
* the (default) odds of using a particular clientAuth value vary base on {{tests.nightly}} and {{tests.multiplier}}

We can however look at *all* failures that occur on Uwe's jenkins jobs, and what their clientAuth values were, to try and see if there is any obvious corrolation/differnece...

{noformat}
$ zgrep 'Randomized ssl (true)' thetaphi/*/*/jenkins.log.txt.gz | perl -ple 's/.*(clientAuth\s+\(.*?\)).*/$1/' | sort | uniq -c
     56 clientAuth (false)
      8 clientAuth (true)
$ zgrep 'Randomized ssl (true)' thetaphi/*Windows/*/jenkins.log.txt.gz | perl -ple 's/.*(clientAuth\s+\(.*?\)).*/$1/' | sort | uniq -c
     53 clientAuth (false)
      4 clientAuth (true)
{noformat}

So that tells us:
* of all the failures on Uwe's jenkins machine where we know wether SSL was used or not, there were 64 failures where SSL was used
* of those 64, 12% (8) used clientAuth
* of those 64,  89% (57) happened on windows
** of those 57, 50% (30) where due to this SSL exception, and only 2 has clientAuth==true

(Perhaps the bug is just more likely to happen when clients send a larger amount of data, and maybe that generally happens more when clientAuth is in use?)

> Windows JDK SSL Test Failure trend: SSLException: Software caused connection abort: recv failed
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13778
>                 URL: https://issues.apache.org/jira/browse/SOLR-13778
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>
> Now that Uwe's jenkins build has been correctly reporting it's build results for my [automated reports|http://fucit.org/solr-jenkins-reports/failure-report.html] to pick up, I've noticed a pattern of failures that indicate a definite problem with using SSL on Windows (even with java 11.0.4
>  )
>  The symptommatic stack traces all contain...
> {noformat}
> ...
>    [junit4]    > Caused by: javax.net.ssl.SSLException: Software caused connection abort: recv failed
>    [junit4]    >        at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:127)
> ...
>    [junit4]    > Caused by: java.net.SocketException: Software caused connection abort: recv failed
>    [junit4]    >        at java.base/java.net.SocketInputStream.socketRead0(Native Method)
> ...
> {noformat}
> I suspect this may be related to [https://bugs.openjdk.java.net/browse/JDK-8209333] but i have no concrete evidence to back this up.
> I'll post some details of my analysis in comments...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org