You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sarutak <gi...@git.apache.org> on 2014/08/18 21:39:01 UTC

[GitHub] spark pull request: [SPARK-3106] Suppress unwilling CancelledKeyEx...

GitHub user sarutak opened a pull request:

    https://github.com/apache/spark/pull/2019

    [SPARK-3106] Suppress unwilling CancelledKeyException, ClosedChannelException and error messages caused by SendingConnection

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sarutak/spark SPARK-3106

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2019.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2019
    
----
commit 698a47e175559437bfaa1f23fa8158a2b9b68fad
Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Date:   2014-08-18T19:37:14Z

    Modified read behavior of SendingConnection

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16631901
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -263,14 +282,20 @@ class SendingConnection(val address: InetSocketAddress, selector_ : Selector,
     
       val DEFAULT_INTEREST = SelectionKey.OP_READ
     
    +  var alreadyReading = false
    +
       override def registerInterest() {
         // Registering read too - does not really help in most cases, but for some
         // it does - so let us keep it for now.
    -    changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
    +    changeConnectionKeyInterest(
    +      SelectionKey.OP_WRITE | (if (!alreadyReading) {
    +        alreadyReading = true
    +        DEFAULT_INTEREST
    +      } else { 0 }))
    --- End diff --
    
    SocketChannel#read blocks a thread which calls SocketChannel#read, JavaDoc says at least.
    I think waiting on channel.read()  is needed by only one thread per one channel to detect disconnection. Why re-register OP_READ is needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: (WIP)[SPARK-3106] Suppress unwilling Cancelled...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52546951
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18776/consoleFull) for   PR 2019 at commit [`48ae3c6`](https://github.com/apache/spark/commit/48ae3c635307b657760545b085135931d8a30b88).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52900626
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19045/consoleFull) for   PR 2019 at commit [`2d7f444`](https://github.com/apache/spark/commit/2d7f444045a036f98dc09b1a15c10bb6efe8e671).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `In multiclass classification, all `$2^`
      * `public final class JavaDecisionTree `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52667415
  
    @arahuja I found a path which we meet the situation like you mention. I'll fix soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16631699
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, val selector: Selector,
       }
     
       def close() {
    -    closed = true
    -    val k = key()
    -    if (k != null) {
    -      k.cancel()
    +    synchronized {
    +      /**
    +       * We should avoid executing closing sequence
    +       * double by a same thread.
    +       * Otherwise we can fail to call connectionsById.get() in
    +       * ConnectionManager#removeConnection() at the 2nd time
    +       */
    +      if (!closed) {
    +        disposeSasl()
    +
    +        /**
    +          * callOnCloseCallback() should be invoked
    +          * before k.cancel() and channel.close()
    +          * to avoid key() returns null.
    +          * If key() returns null before callOnCloseCallback(),
    +          * We cannot remove entry from connectionsByKey in ConnectionManager
    +          * and end up being threw CancelledKeyException.
    +          */
    +        callOnCloseCallback()
    +        val k = key()
    +        if (k != null) {
    +          k.cancel()
    +        }
    +        channel.close()
    +        closed = true
    +      }
    --- End diff --
    
    The way to handle this is to make closed an AtomicBoolean and do a getAndSet.
    If the result of getAndSet is false, which means closed was false on invocation, only then do the actual logic of close from earlier : it is a bug that all invocations of close was trying to do the same thing.
    
    Essentially :
    a) Change 
    ```var closed = false```
    to
    ```var closed = new AtomicBoolean(false)```
    
    b) Change close() to
    ```
    def close() {
      val prev = closed.getAndSet(true)
      if (! prev) {
        closeImpl()
      }
    }
    ```
    
    Where closeImpl is a private method containing the logic from earlier close (except for the closed variable update).
    
    
    This will ensure that failures in closeImpl will still result in connection being marked as close; and repeated invocations will not cause same code to be executed and other failures to surface (like missing id from map, etc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16727953
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
                   val connection = connectionsByKey.getOrElse(key, null)
    -              if (connection != null) {
    -                val lastOps = key.interestOps()
    -                key.interestOps(ops)
    -
    -                // hot loop - prevent materialization of string if trace not enabled.
    -                if (isTraceEnabled()) {
    -                  def intToOpStr(op: Int): String = {
    -                    val opStrs = ArrayBuffer[String]()
    -                    if ((op & SelectionKey.OP_READ) != 0) opStrs += "READ"
    -                    if ((op & SelectionKey.OP_WRITE) != 0) opStrs += "WRITE"
    -                    if ((op & SelectionKey.OP_CONNECT) != 0) opStrs += "CONNECT"
    -                    if ((op & SelectionKey.OP_ACCEPT) != 0) opStrs += "ACCEPT"
    -                    if (opStrs.size > 0) opStrs.reduceLeft(_ + " | " + _) else " "
    +              if (connection != null && !connection.isClosed) {
    +                if (key.isValid) {
    +                  if (connection != null) {
    --- End diff --
    
    Ah, double "connection != null" is redundant as you mentioned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-53410681
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19208/consoleFull) for   PR 2019 at commit [`4eee6c9`](https://github.com/apache/spark/commit/4eee6c9402efb862b015ea3a9203ebafb21592bc).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52894136
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19043/consoleFull) for   PR 2019 at commit [`5f91c8d`](https://github.com/apache/spark/commit/5f91c8d445428a5695f87690d78bbd5739dd1e91).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52895490
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19045/consoleFull) for   PR 2019 at commit [`2d7f444`](https://github.com/apache/spark/commit/2d7f444045a036f98dc09b1a15c10bb6efe8e671).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-53148302
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19100/consoleFull) for   PR 2019 at commit [`814692c`](https://github.com/apache/spark/commit/814692c5f467aa787838e85a4bfbbc8f1cd97bae).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16731768
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
    --- End diff --
    
    Why do you say they are fatal? They're just normal things. Nothing is broken when those exceptions happen, because the code is handling them. The only thing that happens is that ugly messages are printed to the logs.
    
    As Mridul mentioned, I think it's better to handle these situations than to avoid them, because it's pretty tricky to avoid them. The code already seems to handle these exceptions properly, and  your solution to avoiding them altogether may have performance side-effects.
    
    So unless something is actually broken (which I don't think it is), why try to avoid the exceptions in the first place?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: (WIP)[SPARK-3106] Suppress unwilling Cancelled...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52544435
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18775/consoleFull) for   PR 2019 at commit [`698a47e`](https://github.com/apache/spark/commit/698a47e175559437bfaa1f23fa8158a2b9b68fad).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52898460
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19043/consoleFull) for   PR 2019 at commit [`5f91c8d`](https://github.com/apache/spark/commit/5f91c8d445428a5695f87690d78bbd5739dd1e91).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16731303
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
    --- End diff --
    
    A compromised solution may be remove the logging but, the situations thrown CancelledKeyException, ClosedChannelException, AsyncronousCloseException even if resolve race condition are fatal. I think we should leave logging logic for detect such fatal situation.
    I want error message to appear when we actually need.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16730942
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -368,32 +378,38 @@ private[spark] class ConnectionManager(
                 val key = selectedKeys.next
                 selectedKeys.remove()
                 try {
    -              if (key.isValid) {
    -                if (key.isAcceptable) {
    -                  acceptConnection(key)
    -                } else
    -                if (key.isConnectable) {
    -                  triggerConnect(key)
    -                } else
    -                if (key.isReadable) {
    -                  triggerRead(key)
    -                } else
    -                if (key.isWritable) {
    -                  triggerWrite(key)
    +              key.synchronized {
    +                val connection = connectionsByKey.getOrElse(key, null)
    +                if (key.channel.isInstanceOf[ServerSocketChannel] ||
    --- End diff --
    
    The reason why Checking ServerSocketChannel is same for above.
    About key.synchronized is also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52893814
  
    @arahuja I've modified. Can you test with new PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16632326
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, val selector: Selector,
       }
     
       def close() {
    -    closed = true
    -    val k = key()
    -    if (k != null) {
    -      k.cancel()
    +    synchronized {
    +      /**
    +       * We should avoid executing closing sequence
    +       * double by a same thread.
    +       * Otherwise we can fail to call connectionsById.get() in
    +       * ConnectionManager#removeConnection() at the 2nd time
    +       */
    +      if (!closed) {
    +        disposeSasl()
    +
    +        /**
    +          * callOnCloseCallback() should be invoked
    +          * before k.cancel() and channel.close()
    +          * to avoid key() returns null.
    +          * If key() returns null before callOnCloseCallback(),
    +          * We cannot remove entry from connectionsByKey in ConnectionManager
    +          * and end up being threw CancelledKeyException.
    +          */
    +        callOnCloseCallback()
    +        val k = key()
    +        if (k != null) {
    +          k.cancel()
    +        }
    +        channel.close()
    +        closed = true
    +      }
    --- End diff --
    
    O.K. Connecton#close is just for mark as closed and failure during closing is not recovered right?
    If it is, using AtomicBoolean is reasonable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52635149
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18842/consoleFull) for   PR 2019 at commit [`ce07ae5`](https://github.com/apache/spark/commit/ce07ae58453715acbf646c6cfbadff9a106980c3).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16732756
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
    --- End diff --
    
    If remote host shut down suddenly by accidents, connection should be closed. I think, it's not normal.
    
    Hmm, I think ignoring / hiding or declining severity of those ugly message are one of compromised solutions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52899199
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19047/consoleFull) for   PR 2019 at commit [`2d7f444`](https://github.com/apache/spark/commit/2d7f444045a036f98dc09b1a15c10bb6efe8e671).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-53147147
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19100/consoleFull) for   PR 2019 at commit [`814692c`](https://github.com/apache/spark/commit/814692c5f467aa787838e85a4bfbbc8f1cd97bae).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16631910
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, val selector: Selector,
       }
     
       def close() {
    -    closed = true
    -    val k = key()
    -    if (k != null) {
    -      k.cancel()
    +    synchronized {
    +      /**
    +       * We should avoid executing closing sequence
    +       * double by a same thread.
    +       * Otherwise we can fail to call connectionsById.get() in
    +       * ConnectionManager#removeConnection() at the 2nd time
    +       */
    +      if (!closed) {
    +        disposeSasl()
    +
    +        /**
    +          * callOnCloseCallback() should be invoked
    +          * before k.cancel() and channel.close()
    +          * to avoid key() returns null.
    +          * If key() returns null before callOnCloseCallback(),
    +          * We cannot remove entry from connectionsByKey in ConnectionManager
    +          * and end up being threw CancelledKeyException.
    +          */
    +        callOnCloseCallback()
    +        val k = key()
    +        if (k != null) {
    +          k.cancel()
    +        }
    +        channel.close()
    +        closed = true
    +      }
    --- End diff --
    
    I think all of closing sequence should be executed atomically so I think only using atomic boolean is insufficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16631998
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -263,14 +282,20 @@ class SendingConnection(val address: InetSocketAddress, selector_ : Selector,
     
       val DEFAULT_INTEREST = SelectionKey.OP_READ
     
    +  var alreadyReading = false
    +
       override def registerInterest() {
         // Registering read too - does not really help in most cases, but for some
         // it does - so let us keep it for now.
    -    changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
    +    changeConnectionKeyInterest(
    +      SelectionKey.OP_WRITE | (if (!alreadyReading) {
    +        alreadyReading = true
    +        DEFAULT_INTEREST
    +      } else { 0 }))
    --- End diff --
    
    We use non blocking IO.
    Please take a look at ConnectionManager and Connection classes in detail to get better understanding of the codebase; there are a lot of resources online about how to use nio in non blocking mode in a multithreaded application.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by arahuja <gi...@git.apache.org>.
Github user arahuja commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52664737
  
    I tested this patch on branch-1.0 and still see those Exceptions in the logs, curious to know if you expected this to work there as well, or on YARN?
    
    Exceptions:
    
    14/08/19 12:39:42 WARN SendingConnection: Error writing in connection to ConnectionManagerId(demeter-csmaz11-4.demeter.hpc.mssm.edu,35328)
    java.nio.channels.AsynchronousCloseException
            at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
            at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
            at org.apache.spark.network.SendingConnection.write(Connection.scala:380)
            at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)
    
    14/08/19 12:37:25 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(demeter-csmau08-20.demeter.hpc.mssm.edu,53302)
    14/08/19 12:37:25 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@24644e7f
    java.nio.channels.CancelledKeyException
            at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287)
            at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52627997
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18842/consoleFull) for   PR 2019 at commit [`ce07ae5`](https://github.com/apache/spark/commit/ce07ae58453715acbf646c6cfbadff9a106980c3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16733264
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
    --- End diff --
    
    It is normal. In a distributed system, hosts will go down and you have to handle that.
    
    Now, about logging that it happened. It's useful to log that a host went down. But is this the place to do it? These exceptions are too generic for that - CancelledKeyExceptions can happen for other reasons too. So the code that is notified that a host went down and didn't expect that to happen should log it instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52933110
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19055/consoleFull) for   PR 2019 at commit [`22bae6f`](https://github.com/apache/spark/commit/22bae6f9ea91db99e9ca723de22a8c3bab1d5e22).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52933277
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19055/consoleFull) for   PR 2019 at commit [`22bae6f`](https://github.com/apache/spark/commit/22bae6f9ea91db99e9ca723de22a8c3bab1d5e22).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-53192139
  
    One of issues I'd like to resolve in this PR is miss-detection when SedingConnection is closed by corresponding ReceivingConnection in removeConnection.
    If SendingConnection close itself, invoking removeConnection then, corresponding ReceivingConnection fail to close the SendingConnection because sendingConnectionOpt.isDefined is false.
    
    
              if (!sendingConnectionOpt.isDefined) {
                logError(s"Corresponding SendingConnection to ${remoteConnectionManagerId} not found")
                return
              }
    
    Actually, this situation is not error.
    
    Can we remove the logic for closing SendingConnection? It's expected be closed by itself or ConnectionManager#stop right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52665868
  
    Hi @arahuja , I tested on Hadoop 2.x with YARN.
    It seemed that exceptions like you mentioned got calm down.
    
    Before I changed, I saw those exception on drivers. Where did you see those exceptions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16728124
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -334,17 +338,23 @@ private[spark] class ConnectionManager(
                   while (allKeys.hasNext) {
                     val key = allKeys.next()
                     try {
    -                  if (! key.isValid) {
    -                    logInfo("Key not valid ? " + key)
    -                    throw new CancelledKeyException()
    +                  key.synchronized {
    +                    val connection = connectionsByKey.getOrElse(key, null)
    +                    if (key.channel.isInstanceOf[ServerSocketChannel] ||
    +                      connection != null && !connection.isClosed) {
    +                      if (!key.isValid) {
    --- End diff --
    
    Merge with previous condition? Also, check looks weird (why check for ServerSocketChannel?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16630105
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -263,14 +282,20 @@ class SendingConnection(val address: InetSocketAddress, selector_ : Selector,
     
       val DEFAULT_INTEREST = SelectionKey.OP_READ
     
    +  var alreadyReading = false
    +
       override def registerInterest() {
         // Registering read too - does not really help in most cases, but for some
         // it does - so let us keep it for now.
    -    changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
    +    changeConnectionKeyInterest(
    +      SelectionKey.OP_WRITE | (if (!alreadyReading) {
    +        alreadyReading = true
    +        DEFAULT_INTEREST
    +      } else { 0 }))
    --- End diff --
    
    What is the intent behind this change ?
    Probably there is a misunderstanding of why DEFAULT_INTEREST is registered : it is not to actually read from this socket, but to register for read events so that close is detected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52621410
  
    This change can resolve being threw ClosedChannelException, CancelledKeyException and warning message "Corresponding SendingConnectionManagerId not found" and "All connections not cleaned up" we can see recently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak closed the pull request at:

    https://github.com/apache/spark/pull/2019


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-53388879
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19203/consoleFull) for   PR 2019 at commit [`855c207`](https://github.com/apache/spark/commit/855c2076c34801c8228c8e3fa3e5dee30c82e853).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16630107
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -263,14 +282,20 @@ class SendingConnection(val address: InetSocketAddress, selector_ : Selector,
     
       val DEFAULT_INTEREST = SelectionKey.OP_READ
     
    +  var alreadyReading = false
    +
       override def registerInterest() {
         // Registering read too - does not really help in most cases, but for some
         // it does - so let us keep it for now.
    -    changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
    +    changeConnectionKeyInterest(
    +      SelectionKey.OP_WRITE | (if (!alreadyReading) {
    +        alreadyReading = true
    +        DEFAULT_INTEREST
    +      } else { 0 }))
       }
     
       override def unregisterInterest() {
    -    changeConnectionKeyInterest(DEFAULT_INTEREST)
    +    changeConnectionKeyInterest(0)
    --- End diff --
    
    Incorrect change, please see above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16631703
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -263,14 +282,20 @@ class SendingConnection(val address: InetSocketAddress, selector_ : Selector,
     
       val DEFAULT_INTEREST = SelectionKey.OP_READ
     
    +  var alreadyReading = false
    +
       override def registerInterest() {
         // Registering read too - does not really help in most cases, but for some
         // it does - so let us keep it for now.
    -    changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
    +    changeConnectionKeyInterest(
    +      SelectionKey.OP_WRITE | (if (!alreadyReading) {
    +        alreadyReading = true
    +        DEFAULT_INTEREST
    +      } else { 0 }))
    --- End diff --
    
    There is no blocking read - read events are never fired for SendingConnection unless socket was closed from underneath us.
    Which is why we always re-register for OP_READ irrespective of whether we are registering for OP_WRITE or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16728852
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -334,17 +338,23 @@ private[spark] class ConnectionManager(
                   while (allKeys.hasNext) {
                     val key = allKeys.next()
                     try {
    -                  if (! key.isValid) {
    -                    logInfo("Key not valid ? " + key)
    -                    throw new CancelledKeyException()
    +                  key.synchronized {
    +                    val connection = connectionsByKey.getOrElse(key, null)
    +                    if (key.channel.isInstanceOf[ServerSocketChannel] ||
    +                      connection != null && !connection.isClosed) {
    +                      if (!key.isValid) {
    --- End diff --
    
    I don't think we cannot merge 2 condition because L343-344 wants to check connection is closed by another thread (it's not always abnormal case. sometimes happens). and L345 wants to check channel is accidentally closed. Usually, if channel is closed, connection should null. I'd like to detect irregular case.
    
    And, I check for ServerSocketChannel because key is related to ServerSocketChannel, connection is always null and enters infinite loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52904034
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19047/consoleFull) for   PR 2019 at commit [`2d7f444`](https://github.com/apache/spark/commit/2d7f444045a036f98dc09b1a15c10bb6efe8e671).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52621462
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18839/consoleFull) for   PR 2019 at commit [`e1b580e`](https://github.com/apache/spark/commit/e1b580e10a591255a3c76941933111961fe3a572).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52951855
  
    **Tests timed out** after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: (WIP)[SPARK-3106] Suppress unwilling Cancelled...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52544364
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18775/consoleFull) for   PR 2019 at commit [`698a47e`](https://github.com/apache/spark/commit/698a47e175559437bfaa1f23fa8158a2b9b68fad).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-53386362
  
    In this PR, I want to resolve following issues.
    
    (1) Race condition between a thread invoking ConnectionManager#stop and a thread invoking threads invoking Connection#close
    
    In this case, if a thread invoking ConnectionManager#stop evaluates "connectionsByKey -= connection.key" in ConnectionManager#removeConnection() after a thread invoking Connection#close evaluates k.cancel or channel.close in Connection#close(), warning message "All connections not cleaned up" appears because when evaluating "connectionsByKey -= connection.key", key is already null.
    
    (2) Race condition between a thread invoking SendingConnection#close and a thread invoking SendingConnection#close after invoking ReceivingConnection#close
    
    In this case, if a thread invoking ReceivingConnection#close evaluates "!sendingConnectionOpt.isDefined" in ConnectionManager#removeConnection after a thread invoking SendingConnection#close evaluates connectionsById -= "sendingConnectionManagerId" in ConnectionManager#removeConnection, "!sendingConnectionOpt.isDefined" is true and error message "Corresponding SendingConnection to ${remoteConnectionManagerId} not found" appears.
    
    (4) Race condition between a thread invoking ConnectionManager#run and  threads invoking Connection#close
    
    In this case, if a thread invoking ConnectionManager#run evaluates "! key.invalid", after threads invoking Connection#close evaluates key.cancel, "! key.invalid" is true and error message related to CancelledKeyException appears.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-53149673
  
    handling tcp/ip events is by definition async, particularly when state changes can happen orthogonal to state within java variables.
    so there is only so much you can try to do to reduce exceptions you see in the logs - the important point is not to prevent issues (which is not possible if you want to write performent robust code), but to detect them and ensure it is handled properly.
    
    GIven that, the changes here look fragile : we can revisit this PR when they are addressed, since I think there is value in some of these.
    (For example, make closed an atomic boolean and do a getAndSet and do the expensive close only if previous value was false; and so on)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-53384128
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19203/consoleFull) for   PR 2019 at commit [`855c207`](https://github.com/apache/spark/commit/855c2076c34801c8228c8e3fa3e5dee30c82e853).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16726926
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
    --- End diff --
    
    Not sure I understand what the comment is trying to say.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52626545
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18839/consoleFull) for   PR 2019 at commit [`e1b580e`](https://github.com/apache/spark/commit/e1b580e10a591255a3c76941933111961fe3a572).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16632025
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, val selector: Selector,
       }
     
       def close() {
    -    closed = true
    -    val k = key()
    -    if (k != null) {
    -      k.cancel()
    +    synchronized {
    +      /**
    +       * We should avoid executing closing sequence
    +       * double by a same thread.
    +       * Otherwise we can fail to call connectionsById.get() in
    +       * ConnectionManager#removeConnection() at the 2nd time
    +       */
    +      if (!closed) {
    +        disposeSasl()
    +
    +        /**
    +          * callOnCloseCallback() should be invoked
    +          * before k.cancel() and channel.close()
    +          * to avoid key() returns null.
    +          * If key() returns null before callOnCloseCallback(),
    +          * We cannot remove entry from connectionsByKey in ConnectionManager
    +          * and end up being threw CancelledKeyException.
    +          */
    +        callOnCloseCallback()
    +        val k = key()
    +        if (k != null) {
    +          k.cancel()
    +        }
    +        channel.close()
    +        closed = true
    +      }
    --- End diff --
    
    I think you are misunderstanding the intent of what close is supposed to do for Connection classes. It is supposed to mirror normal expectation of close on streams - barring the bug I mentioned about.
    
    In a nutshell, it is supposed to mark connection as closed (so the repeated invocations of the method are idempotent), and cleanup if required. Take a look at how close is implemented in general in various jdk IO classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16728270
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
    --- End diff --
    
    Another thing to keep in mind is that this is a hot loop (as described in the comment below). Even without contention, I think `synchronized` adds a memory fence and this may impact performance. @rxin do you have some shuffle benchmarks which can check performance ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16732261
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -334,17 +338,23 @@ private[spark] class ConnectionManager(
                   while (allKeys.hasNext) {
                     val key = allKeys.next()
                     try {
    -                  if (! key.isValid) {
    -                    logInfo("Key not valid ? " + key)
    -                    throw new CancelledKeyException()
    +                  key.synchronized {
    +                    val connection = connectionsByKey.getOrElse(key, null)
    +                    if (key.channel.isInstanceOf[ServerSocketChannel] ||
    +                      connection != null && !connection.isClosed) {
    +                      if (!key.isValid) {
    --- End diff --
    
    You can still merge the conditions. Your code is racy with the connection being closed regardless of whether you have one or two ifs here.
    
    You can get rid of the weird check by doing:
    
        if ((connection == null || !connection.isClosed) && !key.isValid) { throw ... }
    
    But this kinda feeds back into the "why avoid exceptions when they're handled properly" discussion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16631373
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, val selector: Selector,
       }
     
       def close() {
    -    closed = true
    -    val k = key()
    -    if (k != null) {
    -      k.cancel()
    +    synchronized {
    +      /**
    +       * We should avoid executing closing sequence
    +       * double by a same thread.
    +       * Otherwise we can fail to call connectionsById.get() in
    +       * ConnectionManager#removeConnection() at the 2nd time
    +       */
    +      if (!closed) {
    +        disposeSasl()
    +
    +        /**
    +          * callOnCloseCallback() should be invoked
    +          * before k.cancel() and channel.close()
    +          * to avoid key() returns null.
    +          * If key() returns null before callOnCloseCallback(),
    +          * We cannot remove entry from connectionsByKey in ConnectionManager
    +          * and end up being threw CancelledKeyException.
    +          */
    +        callOnCloseCallback()
    +        val k = key()
    +        if (k != null) {
    +          k.cancel()
    +        }
    +        channel.close()
    +        closed = true
    +      }
    --- End diff --
    
    SendingConnection#close is called from 3 threads on the same instance.
    For example, 1st thread of handle-read-write-executor calls ReceivingConnection#close -> SendingConnection#close,  2nd thread of handle-read-write-executor calles SendingConnection#close and 3rd thread of connection-manager-thread calls ConnectionManager#run -> SendingConnection#close.
    
    I think, if it threw exception from any methods in close(), connection is not marked as closed because one of those thread is expected to close resources even if another thread fail to close.
    
    And synchronized block is for protect being called SendingConnection#close from 3 threads.
    It can be one of following situation.
    (1) One thread of handle-read-write-execuor evaluates key.cancel in SendingConnection#close
    (2) Then, connection-manager-thread calls removeConnection via callOnCloseCallback and evaluates "connectionsyKey -= connection.key". This should be fail because connection.key is null at this time.
    
    After (2) above, connection-manager-thread expects connectionsByKey.size != 0 in ConnectionManager#stop but that size cannot be 0 and we get log message "All connections not cleaned up".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] *Race Condition Issue* Fix the or...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52898610
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16727939
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
    --- End diff --
    
    I'm a little concerned that this isn't really adding much benefit.
    
    Connection.close() doesn't synchronize on anything, so it can still be called while this lock is held, as far as I understand the code. So you could still execute this code all the way to L291 and then get a CancelledKeyException the same way. So you're maybe narrowing the conditions in which you'd get the exception, but it's still possible to get it.
    
    I guess this will be similar to Mridul's comment, but: if the issue is that we're logging these exceptions, how about just demoting them? If we're comfortable that the code is doing the right thing and handling the errors appropriately, then the log messages are not that interesting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16731252
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
    --- End diff --
    
    Sure, but again, all that is doing is avoiding an exception. An exception that is handled properly by the code.
    
    So, what is better:
    - synchronize on every key to avoid throwing an exception
    - don't synchronize and let exceptions be handled
    
    Given that this is a hot loop, and during normal operation keys are not being cancelled left and right, I think the second approach is better. So I'd just turn the log messages into `logDebug`, since they're expected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16631095
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -263,14 +282,20 @@ class SendingConnection(val address: InetSocketAddress, selector_ : Selector,
     
       val DEFAULT_INTEREST = SelectionKey.OP_READ
     
    +  var alreadyReading = false
    +
       override def registerInterest() {
         // Registering read too - does not really help in most cases, but for some
         // it does - so let us keep it for now.
    -    changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
    +    changeConnectionKeyInterest(
    +      SelectionKey.OP_WRITE | (if (!alreadyReading) {
    +        alreadyReading = true
    +        DEFAULT_INTEREST
    +      } else { 0 }))
    --- End diff --
    
    I understand registering DEFAULT_INTEREST (OP_READ) is to detect closing connection by remote host.
    But, once blocked by channel.read() in SendingConnection#read, DEFAULT_INTEREST is not needed.
    
    In addition, because SendingConnection is never unregistered OP_READ, 2 threads on the same SendingConnection should be active and during one of the thread cancels its key, another thread can evaluate key.isValid in ConnectionManager#run.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: (WIP)[SPARK-3106] Suppress unwilling Cancelled...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52553813
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18776/consoleFull) for   PR 2019 at commit [`48ae3c6`](https://github.com/apache/spark/commit/48ae3c635307b657760545b085135931d8a30b88).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16630100
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, val selector: Selector,
       }
     
       def close() {
    -    closed = true
    -    val k = key()
    -    if (k != null) {
    -      k.cancel()
    +    synchronized {
    +      /**
    +       * We should avoid executing closing sequence
    +       * double by a same thread.
    +       * Otherwise we can fail to call connectionsById.get() in
    +       * ConnectionManager#removeConnection() at the 2nd time
    +       */
    +      if (!closed) {
    +        disposeSasl()
    +
    +        /**
    +          * callOnCloseCallback() should be invoked
    +          * before k.cancel() and channel.close()
    +          * to avoid key() returns null.
    +          * If key() returns null before callOnCloseCallback(),
    +          * We cannot remove entry from connectionsByKey in ConnectionManager
    +          * and end up being threw CancelledKeyException.
    +          */
    +        callOnCloseCallback()
    +        val k = key()
    +        if (k != null) {
    +          k.cancel()
    +        }
    +        channel.close()
    +        closed = true
    +      }
    --- End diff --
    
    This is incorrect change.
    Any of those methods can throw an exception - leaving Connection.closed as false.
    
    What is the point of the synchronized btw ? None of the other methods are protected by this lock


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16736248
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -334,17 +338,23 @@ private[spark] class ConnectionManager(
                   while (allKeys.hasNext) {
                     val key = allKeys.next()
                     try {
    -                  if (! key.isValid) {
    -                    logInfo("Key not valid ? " + key)
    -                    throw new CancelledKeyException()
    +                  key.synchronized {
    +                    val connection = connectionsByKey.getOrElse(key, null)
    +                    if (key.channel.isInstanceOf[ServerSocketChannel] ||
    +                      connection != null && !connection.isClosed) {
    +                      if (!key.isValid) {
    --- End diff --
    
    I separate !key.isValid from the rest intentionally.
    When connection is null or connection is closed, !key.isValid is true because connection is closed by another thread calling Connection#close and it's not the case we should log.
    
    If key.isValid is false, even if connection is not closed, it's abnormal case.
    
    Anyway, we (including other committer/contributor) should discuss how we treat this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16728263
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -368,32 +378,38 @@ private[spark] class ConnectionManager(
                 val key = selectedKeys.next
                 selectedKeys.remove()
                 try {
    -              if (key.isValid) {
    -                if (key.isAcceptable) {
    -                  acceptConnection(key)
    -                } else
    -                if (key.isConnectable) {
    -                  triggerConnect(key)
    -                } else
    -                if (key.isReadable) {
    -                  triggerRead(key)
    -                } else
    -                if (key.isWritable) {
    -                  triggerWrite(key)
    +              key.synchronized {
    +                val connection = connectionsByKey.getOrElse(key, null)
    +                if (key.channel.isInstanceOf[ServerSocketChannel] ||
    --- End diff --
    
    Same weird check for ServerSocketChannel.
    
    Also same issue with `key.synchronized` not really protecting against the underlying connection object being closed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-52934582
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19057/consoleFull) for   PR 2019 at commit [`21d5898`](https://github.com/apache/spark/commit/21d58986ee5a6f386a98518a7d313b73b6aa1133).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16632233
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -263,14 +282,20 @@ class SendingConnection(val address: InetSocketAddress, selector_ : Selector,
     
       val DEFAULT_INTEREST = SelectionKey.OP_READ
     
    +  var alreadyReading = false
    +
       override def registerInterest() {
         // Registering read too - does not really help in most cases, but for some
         // it does - so let us keep it for now.
    -    changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
    +    changeConnectionKeyInterest(
    +      SelectionKey.OP_WRITE | (if (!alreadyReading) {
    +        alreadyReading = true
    +        DEFAULT_INTEREST
    +      } else { 0 }))
    --- End diff --
    
    Yes, we use blocking IO.
    And I had a misunderstanding. SocketChannel#read blocks the other thread which try to call SocketChannel#read on the same instance so unregistering OP_READ is wrong.
    So, we should resolve race condition using another way because a thread registering OP_READ can call SocketChannel#close in SendingConnection#read during a thread registering OP_WRITE calls SocketChannel#write in SendingConnection#write.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16632362
  
    --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
    @@ -263,14 +282,20 @@ class SendingConnection(val address: InetSocketAddress, selector_ : Selector,
     
       val DEFAULT_INTEREST = SelectionKey.OP_READ
     
    +  var alreadyReading = false
    +
       override def registerInterest() {
         // Registering read too - does not really help in most cases, but for some
         // it does - so let us keep it for now.
    -    changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
    +    changeConnectionKeyInterest(
    +      SelectionKey.OP_WRITE | (if (!alreadyReading) {
    +        alreadyReading = true
    +        DEFAULT_INTEREST
    +      } else { 0 }))
    --- End diff --
    
    Please keep following in mind while trying to find a solution :
    
    1) All invocations of register for a write connection will have OP_READ set (so there wont be a case where OP_READ is not set).
    OP_WRITE may or may not be set based on whether we have outstanding data to write or not.
    This is to ensure the tcp stack alerts us in case remote close is detected (via keep alive, etc).
    
    2) Only a single thread per socket will process it at a given point of time, we ensure this : and marking for re-registeration happens within this (not actual registeration - that always happens in the selector thread).
    
    So we wont have the case of conflicting re-registeration requests : we ensure this.
    At worst, we can have :
    a) OP_READ (because we finished write), wakeup selector
    b) before selector thread woke up, we want to re-register with OP_WRITE | OP_READ again (since some other thread wanted to write data).
    We process registeration requests in order - and so (b) will take precedence over (a).
    
    We handle reverse case of some thread wanting to write while write is going on and finishes fully (resulting in (a) ) by use of resetForceReregister.
    This code path is complicated since it handles a lot of edge cases.
    
    3) No thread calls register on selector - only the selector thread can (not ensuring this causes deadlocks actually) : hence why we have registeration request queues for new and existing sockets.
    
    4) A close can happen because of explicit close by spark, close due to socket errors at own side, close due to network issues, close due to remote side.
    There is only so much we can do to distinguish these.
    We detect remote close by (1) (note, it is not gauranteed to report immediately - and sometimes can take prolonged time) and local close is handled gracefully anyway.
    
    
    Given all this, I am not sure what are the MT issues seen and the causes for it, it can be quite involved at times - the one main issue I see is, repeated invocation to close (and there can be repeated invocations as you rightly pointed out) seems to attempt to clean up the state repeatedly.
    This is incorrect - it should do it once and only once; repeated invocations are legal, but actual close implementation code should be executed once.
    Ofcourse, exception while executing it are fine and unrecoverable, and we have to live with it (like in case of socket/stream.close throwing exception).
    
    To alleviate this, I proposed the AtomicBoolean change.
    I might obviously be missing other things since it has been a while since I looked at these classes, so a fresh pair of eyes is definitely welcome !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2019#issuecomment-53405037
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19208/consoleFull) for   PR 2019 at commit [`4eee6c9`](https://github.com/apache/spark/commit/4eee6c9402efb862b015ea3a9203ebafb21592bc).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16727606
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
                   val connection = connectionsByKey.getOrElse(key, null)
    -              if (connection != null) {
    -                val lastOps = key.interestOps()
    -                key.interestOps(ops)
    -
    -                // hot loop - prevent materialization of string if trace not enabled.
    -                if (isTraceEnabled()) {
    -                  def intToOpStr(op: Int): String = {
    -                    val opStrs = ArrayBuffer[String]()
    -                    if ((op & SelectionKey.OP_READ) != 0) opStrs += "READ"
    -                    if ((op & SelectionKey.OP_WRITE) != 0) opStrs += "WRITE"
    -                    if ((op & SelectionKey.OP_CONNECT) != 0) opStrs += "CONNECT"
    -                    if ((op & SelectionKey.OP_ACCEPT) != 0) opStrs += "ACCEPT"
    -                    if (opStrs.size > 0) opStrs.reduceLeft(_ + " | " + _) else " "
    +              if (connection != null && !connection.isClosed) {
    +                if (key.isValid) {
    +                  if (connection != null) {
    --- End diff --
    
    This check is redundant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16730782
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
               val (key, ops) = keyInterestChangeRequests.dequeue()
    -
               try {
    -            if (key.isValid) {
    +            key.synchronized {
    --- End diff --
    
    Yes, Connection#close synchronize partially but all we should protect is key cancellation. If a thread (Thread-A) enter L286, no threads cannot cancel the key in Connection#close() until Thread-A finishes key-related operations.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

Posted by sarutak <gi...@git.apache.org>.
Github user sarutak commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2019#discussion_r16727729
  
    --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
    @@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
             }
     
             while(!keyInterestChangeRequests.isEmpty) {
    +          // Expect key interested in OP_ACCEPT is not change its interest
    --- End diff --
    
    If key for OP_ACCEPT enter this loop, connectionsByKey.getOrElse(key, null) will return null so this logic ignore OP_ACCEPT. I'll refine the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org