You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/11/08 16:45:59 UTC

[GitHub] [pulsar-client-go] bschofield opened a new pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

bschofield opened a new pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663


   Fixes #662
   
   ### Motivation
   
   As described in #662, there appears to be a potential race condition in connection.go function _waitUntilReady()_: the `cond.Broadcast()` can occur before the `cond.Wait()`.
   
   ### Modifications
   
   Function `waitUntilReady()` was previously holding the global `c.Lock()` on the connection. From my reading of the code, this mutex is intended to protect the `cnx` variable. I think that the use of `c.Lock()` in `waitUntilReady()` was probably just a typo.
   
   Instead, I think the author probably intended to grab the lock associated with the `sync.Cond`, i.e. `c.cond.L.Lock()`. This looks like the correct thing to do when using a `sync.Cond`. I think there should be a corresponding lock around the `cond.Broadcast()` also. See https://stackoverflow.com/questions/36857167/how-to-correctly-use-sync-cond/42772799#42772799 for more details.
   
   ### Verifying this change
   
   I am unsure if this change is covered by existing tests. It fixes a rare race condition, so I think it would be quite difficult to test for.
   
   I have deployed this change on my own production system, and it doesn't obviously break anything. I report back if I see any issues that might be related to it.
   
   ### Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API: no
     - The schema: no
     - The default values of configurations: no
     - The wire protocol: no
   
   ### Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963949062


   @BewareMyPower: I tried this modification of your code, which I think is exactly the one you suggested:
   
   ```go
   package main
   
   import (
   	"fmt"
   	"sync"
   	"time"
   )
   
   func main() {
   	m := sync.Mutex{}
   	c := sync.NewCond(&m)
   	flag := false
   
   	defer m.Unlock()
   	m.Lock()
   
   	go func() {
   		// No locks here
   		fmt.Println("Before Broadcast")
   		flag = true
   		c.Broadcast()
   		fmt.Println("After Broadcast")
   	}()
   
   	// Sleep for 2 seconds to make sure c.Broadcast() is called before c.Wait()
   	time.Sleep(time.Duration(2) * time.Second)
   	fmt.Println("Before Wait")
   	for {
   		c.Wait()
   		if flag {
   			break
   		}
   	}
   	fmt.Printf("After Wait, flag is %v\n", flag)
   }
   ```
   
   When I run that, this happens:
   ```
   Before Broadcast
   After Broadcast
   Before Wait
   fatal error: all goroutines are asleep - deadlock!
   
   goroutine 1 [sync.Cond.Wait]:
   sync.runtime_notifyListWait(0xc0000b8050, 0x0)
   	/home/ben/sdk/go1.17.1/src/runtime/sema.go:513 +0x13d
   sync.(*Cond).Wait(0x4b4360)
   	/home/ben/sdk/go1.17.1/src/sync/cond.go:56 +0x8c
   main.main()
   	/home/ben/tmp/foo.go:29 +0x191
   ```
   So, I think that test case (which you very kindly provided) confirms that we agree that it really is necessary to call `cond.Broadcast()`  after `cond.Wait()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963969328


   > @BewareMyPower:
   > 
   > > Therefore, I think a better solution is [...]
   > 
   > With that solution, the bug that I am experiencing in production could still occur:
   > 
   > [goroutine A] calls `c.getState()` and sees that it is not set to `connectionReady` [goroutine B] changes the state to `connectionReady` [goroutine B] sends a `cond.Broadcast()`, which goes nowhere because no goroutine is waiting. [goroutine A] calls `cond.Wait()`, which never completes
   > 
   > To avoid this, it is necessary to hold the lock before calling `c.getState()`.
   
   It's right. I also realized this problem just now. Here we should lock the whole `changeState`. I'm just wondering whether there should be another lock for this. But `c.Lock()` might be used in more places in future. (If `c.Lock()` and `c.cond.L.Lock()` are the same)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] cckellogg merged pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
cckellogg merged pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on a change in pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on a change in pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#discussion_r745398040



##########
File path: pulsar/internal/connection.go
##########
@@ -327,8 +327,8 @@ func (c *connection) doHandshake() bool {
 }
 
 func (c *connection) waitUntilReady() error {
-	c.Lock()
-	defer c.Unlock()
+	c.cond.L.Lock()

Review comment:
       They're two different mutexes. See https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963931326 for an explanation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on a change in pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on a change in pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#discussion_r745414797



##########
File path: pulsar/internal/connection.go
##########
@@ -327,8 +327,8 @@ func (c *connection) doHandshake() bool {
 }
 
 func (c *connection) waitUntilReady() error {
-	c.Lock()
-	defer c.Unlock()
+	c.cond.L.Lock()

Review comment:
       I agree that `waitUntilReady()` and `changeState()` should share the same lock. I also agree that `waitUntilReady()` and `changeState()` should share the same lock. I think the correct lock to use is `c.cond.L`, for the reasons given in the main discussion.

##########
File path: pulsar/internal/connection.go
##########
@@ -327,8 +327,8 @@ func (c *connection) doHandshake() bool {
 }
 
 func (c *connection) waitUntilReady() error {
-	c.Lock()
-	defer c.Unlock()
+	c.cond.L.Lock()

Review comment:
       I agree that `waitUntilReady()` and `changeState()` should share the same lock.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963433544


   @cckellogg @BewareMyPower I think this relates to your discussion in https://github.com/apache/pulsar-client-go/pull/631/files#r723350028, would appreciate it if you could take a look.
   
   I think you correctly removed the use of `c.Lock()`, but I think `c.cond.L.Lock()` should be used instead. Although it is absolutely true that the `sync.Cond` documentation says you aren't _required_ to hold the lock when broadcasting, if you don't then I think you can end up with the race observed in #662. Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963965511


   @BewareMyPower:
   > Therefore, I think a better solution is [...]
   
   With that solution, the bug that I am experiencing in production could still occur:
   
   [goroutine A] calls `c.getState()` and sees that it is not set to `connectionReady`
   [goroutine B] changes the state to `connectionReady`
   [goroutine B] sends a `cond.Broadcast()`, which goes nowhere because no goroutine is waiting.
   [goroutine A] calls `cond.Wait()`, which never completes
   
   To avoid this, it is necessary to hold the lock before calling `c.getState()`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963732361


   https://github.com/apache/pulsar-client-go/blob/fe3b7c4e445b3de42974ca692574229ad9099a45/pulsar/internal/connection.go#L212
   
   And see https://pkg.go.dev/sync@go1.17.1#NewCond
   
   > NewCond returns a new Cond with Locker l.
   
   ```go
   // NewCond returns a new Cond with Locker l.
   func NewCond(l Locker) *Cond {
       return &Cond{L: l}
   }
   ```
   
   So I think `c.Lock()` is the same as `c.cond.L.Lock()`.
   
   Back to your issue, how could you ensure `c.cond.Wait()` must be called before `c.cond.BroadCast()` is called? And I think it's not necessary. Please see my code snippet.
   
   ```go
   package main
   
   import (
   	"fmt"
   	"sync"
   	"time"
   )
   
   func main() {
   	m := sync.Mutex{}
   	c := sync.NewCond(&m)
   	flag := false
   
   	defer m.Unlock()
   	m.Lock()
   
   	go func() {
   		// No locks here
   		fmt.Println("Before Broadcast")
   		flag = true
   		c.Broadcast()
   		fmt.Println("After Broadcast")
   	}()
   
   	// Sleep for 2 seconds to make sure c.Broadcast() is called before c.Wait()
   	time.Sleep(time.Duration(2) * time.Second)
   	fmt.Println("Before Wait")
   	for !flag {
   		c.Wait()
   	}
   	fmt.Printf("After Wait, flag is %v\n", flag)
   }
   ```
   
   I called `time.Sleep` to make sure `c.Wait()` is called after `c.Broadcast()`. And the output is
   
   ```
   Before Broadcast
   After Broadcast
   Before Wait
   After Wait, flag is true
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963966453


   > we agree that it really is necessary to call cond.Broadcast() after cond.Wait()?
   
   Yeah, I agree.
   
   > If changeState gets called before waitUntilReady, then the broadcast happens first and all is well.
   >
   > if waitUntilReady gets called before changeState, then waitUntilReady will block on the cond.Wait(), whilst still holding the lock. 
   
   It makes sense. We should make the combination of `setState` and `c.Broadcast()` an atomic operation.
   
   > however you should avoid calling defer inside the for loop because a new defer call will be pushed onto the stack for each loop invocation.
   
   Thanks for pointing it out. I'm not so familiar with Go, just being used to C++ object's lifetime rule, which is a little different.
   
   ----
   
   The only question I'm still confused is, could you explain following code and its output?
   
   ```go
   package main
   
   import (
   	"fmt"
   	"sync"
   )
   
   type connection struct {
   	sync.Mutex
   	cond *sync.Cond
   }
   
   func NewConnection() *connection {
   	cnx := &connection{}
   	cnx.cond = sync.NewCond(cnx)
   	return cnx
   }
   
   func main() {
   	cnx := NewConnection()
   	fmt.Printf("%v\n%v\n", cnx, cnx.cond.L)
   
   	cnx.cond.L.Lock()
   	cnx.Lock()
   	cnx.Unlock()
   	cnx.cond.L.Unlock()
   }
   ```
   
   ```
   &{{0 0} 0xc000104040}
   &{{0 0} 0xc000104040}
   fatal error: all goroutines are asleep - deadlock!
   
   goroutine 1 [semacquire]:
   sync.runtime_SemacquireMutex(0xc000010244, 0x0, 0x1)
   	/usr/local/go/src/runtime/sema.go:71 +0x47
   sync.(*Mutex).lockSlow(0xc000010240)
   	/usr/local/go/src/sync/mutex.go:138 +0x105
   sync.(*Mutex).Lock(...)
   	/usr/local/go/src/sync/mutex.go:81
   main.main()
   	main.go:24 +0x175
   ```
   
   It looks like `cnx.Lock()` and `cnx.cond.L.Lock()` are the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963751704


   The current `waitUntilReady` implementation is
   
   ```go
   func (c *connection) waitUntilReady() error {
   	c.Lock()
   	defer c.Unlock()
   	c.cond.L.Lock()
   	defer c.cond.L.Unlock()
   
   	for c.getState() != connectionReady {
   		c.log.Debugf("Wait until connection is ready state=%s", c.getState().String())
                    // 1. `c.getState()` is called again
   		if c.getState() == connectionClosed {
   			return errors.New("connection error")
   		}
                   // 2. Even if the 2nd call of `c.getState()` returns connectionReady, `c.cond.Wait()` will still be called
   		c.cond.Wait()
   	}
   	return nil
   }
   ```
   
   See my comments for my concern. `getState()` and `setState()` are both atomic. But if we the combination of two `getState()` calls is not atomic.
   
   Therefore, I think a better solution is
   
   ```go
   func (c *connection) waitUntilReady() error {
   	for {
   		switch state := c.getState(); state {
   		case connectionReady:
   			return nil
   		case connectionClosed:
   			return errors.New("connection error")
   		default:
   			c.log.Debugf("Wait until connection is ready state=%s", state.String())
   			// Here we use c.cond.L.Lock() because we might not use c.Lock as the cond's internal Locker in future
   			c.cond.L.Lock()
   			defer c.cond.L.Unlock()
   			c.cond.Wait()
   		}
   	}
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963433544


   @cckellogg @BewareMyPower I think this relates to your discussion in https://github.com/apache/pulsar-client-go/pull/631/files#r723350028, would appreciate it if you could take a look.
   
   I think you correctly removed the use of `c.Lock()`, but I think `c.cond.L.Lock()` should be used instead. Although it is absolutely true that the `sync.Cond` documentation says you aren't _required_ to hold the lock when broadcasting, if you don't then it looks like you can end up with the race observed in #662. Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963954888


   @BewareMyPower:
   > Therefore, I think a better solution is [...]
   
   I think that fix also looks fine (and maybe better), however you should avoid calling `defer` inside the for loop because a new defer call will be pushed onto the stack for each loop invocation. So it should be:
   
   ```go
   func (c *connection) waitUntilReady() error {
   	for {
   		switch state := c.getState(); state {
   		case connectionReady:
   			return nil
   		case connectionClosed:
   			return errors.New("connection error")
   		default:
   			c.log.Debugf("Wait until connection is ready state=%s", state.String())
   			// Here we use c.cond.L.Lock() because we are concerned about race conditions between the switch statement above, and the cond.Wait() below
   			c.cond.L.Lock()
   			c.cond.Wait()
   			c.cond.L.Unlock()
   		}
   	}
   }
   ```
   
   I also edited your comment to make it clear why this lock is being used. I definitely agree that a comment is required, this is a really complicated situation and not made clearer by the slightly strange way that `sync.Cond` works.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963945418


   @BewareMyPower:
   > However, for the code here, how could you ensure that when `changeState` is called, the lock has already been acquired in `waitUntilReady`?
   
   It doesn't matter whether `changeState()` is called first, or `waitUntilReady()` is called first. Using the lock prevents the race condition that can occur in the gap between the test in `for c.getState() != connectionReady` and the call to `c.cond.Wait()` occurring.
   
   If `changeState` gets called before `waitUntilReady`, then the broadcast happens first and all is well.
   
   if `waitUntilReady` gets called before `changeState`, then `waitUntilReady` will block on the `cond.Wait()`, whilst still holding the lock. However, `cond.Wait()` call has a clever internal release of the lock which I think should still allow the `changeState` to happen:
   
   ```go
   func (c *Cond) Wait() {
   	c.checker.check()
   	t := runtime_notifyListAdd(&c.notify)
   	c.L.Unlock()
   	runtime_notifyListWait(&c.notify, t)
   	c.L.Lock()
   }
   ```
   
   (note that the `c.L` here is what we are calling `c.cond.L`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on a change in pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on a change in pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#discussion_r745439440



##########
File path: pulsar/internal/connection.go
##########
@@ -327,8 +327,8 @@ func (c *connection) doHandshake() bool {
 }
 
 func (c *connection) waitUntilReady() error {
-	c.Lock()
-	defer c.Unlock()
+	c.cond.L.Lock()

Review comment:
       Nope, I was wrong! See https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963966453 which proves that `c.Lock()` and `c.cond.L.Lock()` are the same thing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963973621


   > The only question I'm still confused is, could you explain following code and its output?
   
   Ag! I'm so sorry @BewareMyPower, you're 100% correct: `c.Lock()` and `c.cond.L.Lock()` absolutely are the same lock. I had totally missed that `sync.NewCond` took a `Locker` as an argument. Thanks for clarifying that, and my apologies.
   
   >  I'm just wondering whether there should be another lock for this.
   
   I think that's up to you guys -- it depends what the ultimate intent of the lock is. Is it to protect against all state changes in the connection, or just some of them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] cckellogg commented on a change in pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
cckellogg commented on a change in pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#discussion_r745270617



##########
File path: pulsar/internal/connection.go
##########
@@ -327,8 +327,8 @@ func (c *connection) doHandshake() bool {
 }
 
 func (c *connection) waitUntilReady() error {
-	c.Lock()
-	defer c.Unlock()
+	c.cond.L.Lock()

Review comment:
       Curious what's the difference between doing `c.Lock()` vs `c.cond.L.Lock()`? Same for the changeState function? From looking at the code they should share the same lock?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963945418


   @BewareMyPower:
   > However, for the code here, how could you ensure that when `changeState` is called, the lock has already been acquired in `waitUntilReady`?
   
   That's the point of using the lock: It doesn't matter whether changeState() is called first, or waitUntilReady() is called first. Using the lock prevents the race condition that can occur in the gap between the test in `for c.getState() != connectionReady` and the call to `c.cond.Wait()` occurring.
   
   If `changeState` gets called before `waitUntilReady`, then the broadcast happens first and all is well.
   
   if `waitUntilReady` gets called before `changeState`, then `changeState` will block on the `cond.Wait()`, whilst still holding the lock. However, `cond.Wait()` call has a clever internal release of the lock which I think should still allow the `changeState` to happen:
   
   ```go
   func (c *Cond) Wait() {
   	c.checker.check()
   	t := runtime_notifyListAdd(&c.notify)
   	c.L.Unlock()
   	runtime_notifyListWait(&c.notify, t)
   	c.L.Lock()
   }
   ```
   
   (note that the `c.L` here is what we are calling `c.cond.L`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963945418


   @BewareMyPower:
   > However, for the code here, how could you ensure that when `changeState` is called, the lock has already been acquired in `waitUntilReady`?
   
   That's the point of using the lock: It doesn't matter whether `changeState()` is called first, or `waitUntilReady()` is called first. Using the lock prevents the race condition that can occur in the gap between the test in `for c.getState() != connectionReady` and the call to `c.cond.Wait()` occurring.
   
   If `changeState` gets called before `waitUntilReady`, then the broadcast happens first and all is well.
   
   if `waitUntilReady` gets called before `changeState`, then `waitUntilReady` will block on the `cond.Wait()`, whilst still holding the lock. However, `cond.Wait()` call has a clever internal release of the lock which I think should still allow the `changeState` to happen:
   
   ```go
   func (c *Cond) Wait() {
   	c.checker.check()
   	t := runtime_notifyListAdd(&c.notify)
   	c.L.Unlock()
   	runtime_notifyListWait(&c.notify, t)
   	c.L.Lock()
   }
   ```
   
   (note that the `c.L` here is what we are calling `c.cond.L`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on a change in pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on a change in pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#discussion_r745414797



##########
File path: pulsar/internal/connection.go
##########
@@ -327,8 +327,8 @@ func (c *connection) doHandshake() bool {
 }
 
 func (c *connection) waitUntilReady() error {
-	c.Lock()
-	defer c.Unlock()
+	c.cond.L.Lock()

Review comment:
       I agree that `waitUntilReady()` and `changeState()` should share the same lock. I think the correct lock to use is `c.cond.L`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963976311


   > I think that's up to you guys -- it depends what the ultimate intent of the lock is. Is it to protect against all state changes in the connection, or just some of them.
   
   Yeah, it won't block this PR. Just say it because I've seen the similar case in C++ client that a common mutex is used everywhere.
   
   For this PR, I think you can change `c.cond.L.` back to `c.`, the it will LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963984362


   OK, I made that change and added a comment so hopefully nobody removes the lock in future. Thanks again for looking at this, I really appreciate it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963433544


   @cckellogg @BewareMyPower I think this relates to your discussion in https://github.com/apache/pulsar-client-go/pull/631/files#r723350028, would appreciate it if you could take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963732361


   https://github.com/apache/pulsar-client-go/blob/fe3b7c4e445b3de42974ca692574229ad9099a45/pulsar/internal/connection.go#L212
   
   And see https://pkg.go.dev/sync@go1.17.1#NewCond
   
   > NewCond returns a new Cond with Locker l.
   
   So I think `c.Lock()` is the same as `c.cond.L.Lock()`.
   
   Back to your issue, how could you ensure `c.cond.Wait()` must be called before `c.cond.BroadCast()` is called? And I think it's not necessary. Please see my code snippet.
   
   ```go
   package main
   
   import (
   	"fmt"
   	"sync"
   	"time"
   )
   
   func main() {
   	m := sync.Mutex{}
   	c := sync.NewCond(&m)
   	flag := false
   
   	defer m.Unlock()
   	m.Lock()
   
   	go func() {
   		// No locks here
   		fmt.Println("Before Broadcast")
   		flag = true
   		c.Broadcast()
   		fmt.Println("After Broadcast")
   	}()
   
   	// Sleep for 2 seconds to make sure c.Broadcast() is called before c.Wait()
   	time.Sleep(time.Duration(2) * time.Second)
   	fmt.Println("Before Wait")
   	for !flag {
   		c.Wait()
   	}
   	fmt.Printf("After Wait, flag is %v\n", flag)
   }
   ```
   
   I called `time.Sleep` to make sure `c.Wait()` is called after `c.Broadcast()`. And the output is
   
   ```
   Before Broadcast
   After Broadcast
   Before Wait
   After Wait, flag is true
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963433544


   @cckellogg @BewareMyPower I think this relates to your discussion in https://github.com/apache/pulsar-client-go/pull/631/files#r723350028, would appreciate it if you could take a look.
   
   You correctly removed the use of `c.Lock()`, but I think `c.cond.L.Lock()` should be used instead. And although it is absolutely true that the `sync.Cond` documentation says you aren't _required_ to hold the `c.cond.L.Lock()` when broadcasting, if you don't then it looks like you can end up with the race observed in #662. Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963735350


   Sorry I gave a wrong example. The `c.Wait()` is not called because `flag` is already true. It's right if my example was modified from
   
   ```go
   	for !flag {
   		c.Wait()
   	}
   ```
   
   to
   
   ```go
   	for {
   		c.Wait()
   		if flag {
   			break
   		}
   	}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963931326


   Thanks for taking a look at this, @BewareMyPower and @cckellogg!
   
   > So I think c.Lock() is the same as c.cond.L.Lock().
   
   I'm pretty sure that's not the case. If we take a look at the definition of the connection type:
   
   ```go
   
   type connection struct {
   	sync.Mutex
   	cond              *sync.Cond
   [...]
   ```
   
   ...then we can see that `connection` struct is built out of a `sync.Mutex` using struct composition. That means that calls to `c.Lock()` are calling the `sync.Mutex` which is part of the `connection` struct itself.
   
   If we next take a look at the definition of `sync.Cond` we see:
   
   ```go
   type Cond struct {
   	noCopy noCopy
   
   	// L is held while observing or changing the condition
   	L Locker
   [...]
   ```
   
   (where `Locker` is go's interface type for a `sync.Mutex`). So these are two separate locks: they are not the same.
   
   Another way we can convince ourselves of this is by noting that it is forbidden to copy a `sync.Mutex`, you have to take pointers to it. Since the struct composition on `connection` does not use a pointer type, there would be no way to alias the mutex inside `cond` to the mutex on `connection`. They must be different things.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963945418


   @BewareMyPower:
   > However, for the code here, how could you ensure that when `changeState` is called, the lock has already been acquired in `waitUntilReady`?
   
   That's the point of using the lock: It doesn't matter whether `changeState()` is called first, or `waitUntilReady()` is called first. Using the lock prevents the race condition that can occur in the gap between the test in `for c.getState() != connectionReady` and the call to `c.cond.Wait()` occurring.
   
   If `changeState` gets called before `waitUntilReady`, then the broadcast happens first and all is well.
   
   if `waitUntilReady` gets called before `changeState`, then `changeState` will block on the `cond.Wait()`, whilst still holding the lock. However, `cond.Wait()` call has a clever internal release of the lock which I think should still allow the `changeState` to happen:
   
   ```go
   func (c *Cond) Wait() {
   	c.checker.check()
   	t := runtime_notifyListAdd(&c.notify)
   	c.L.Unlock()
   	runtime_notifyListWait(&c.notify, t)
   	c.L.Lock()
   }
   ```
   
   (note that the `c.L` here is what we are calling `c.cond.L`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963954888


   @BewareMyPower:
   > Therefore, I think a better solution is [...]
   
   I think that fix also looks fine (and maybe better), however you should avoid calling `defer` inside the for loop because a new defer call will be pushed onto the stack for each loop invocation. So it should be:
   
   ```go
   func (c *connection) waitUntilReady() error {
   	for {
   		switch state := c.getState(); state {
   		case connectionReady:
   			return nil
   		case connectionClosed:
   			return errors.New("connection error")
   		default:
   			c.log.Debugf("Wait until connection is ready state=%s", state.String())
   			// Here we use c.cond.L.Lock() because we are concerned about race conditions between the switch statement above, and the `cond.Wait()` below
   			c.cond.L.Lock()
   			c.cond.Wait()
   			c.cond.L.Unlock()
   		}
   	}
   }
   ```
   
   I also edited your comment to make it clear why this lock is being used. I definitely agree that a comment is required, this is a really complicated situation and not made clearer by the slightly strange way that `sync.Cond` works.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield removed a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield removed a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963954888


   @BewareMyPower:
   > Therefore, I think a better solution is [...]
   
   I think that fix also looks fine (and maybe better), however you should avoid calling `defer` inside the for loop because a new defer call will be pushed onto the stack for each loop invocation. So it should be:
   
   ```go
   func (c *connection) waitUntilReady() error {
   	for {
   		switch state := c.getState(); state {
   		case connectionReady:
   			return nil
   		case connectionClosed:
   			return errors.New("connection error")
   		default:
   			c.log.Debugf("Wait until connection is ready state=%s", state.String())
   			// Here we use c.cond.L.Lock() because we are concerned about race conditions between the switch statement above, and the cond.Wait() below
   			c.cond.L.Lock()
   			c.cond.Wait()
   			c.cond.L.Unlock()
   		}
   	}
   }
   ```
   
   I also edited your comment to make it clear why this lock is being used. I definitely agree that a comment is required, this is a really complicated situation and not made clearer by the slightly strange way that `sync.Cond` works.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
bschofield commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963973621


   > The only question I'm still confused is, could you explain following code and its output?
   
   Ag! I'm so sorry @BewareMyPower, you're 100% correct: `c.Lock()` and `c.cond.L.Lock()` absolutely are the same lock. I had totally missed that `sync.NewCond` took a `Locker` as an argument. Thanks for clarifying that, and my apologies.
   
   >  I'm just wondering whether there should be another lock for this.
   
   I think that's up to you guys -- it depends what the ultimate intent of the lock is. Is it to protect against all state changes in the connection, or just some of them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower edited a comment on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower edited a comment on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963976311


   > I think that's up to you guys -- it depends what the ultimate intent of the lock is. Is it to protect against all state changes in the connection, or just some of them.
   
   Yeah, it won't block this PR. Just say it because I've seen the similar case in C++ client that a common mutex is used everywhere.
   
   For this PR, I think you can change `c.cond.L.` back to `c.`, then it will LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963745752


   I agree that we should call `c.Broadcast()` after `c.Wait()`, but the code here is not like the example in https://stackoverflow.com/questions/36857167/how-to-correctly-use-sync-cond/42772799#42772799.
   
   ```go
       m := &sync.Mutex{}
       c := sync.NewCond(m)
       m.Lock()
       go func() {
           m.Lock() // Wait for c.Wait()
           c.Broadcast()
           m.Unlock()
       }()
       c.Wait() // Unlocks m, waits, then locks m again
       m.Unlock()
   ```
   
   Because in that example, the `m.Lock()` has already been called before the goroutine is executed. So `c.Broadcast()` won't be called before `c.Wait()` releases the mutex internally. However, for the code here, how could you ensure that when `changeState` is called, the lock has already been acquired in `waitUntilReady`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] BewareMyPower commented on pull request #663: [Issue 662] Fix race in connection.go waitUntilReady()

Posted by GitBox <gi...@apache.org>.
BewareMyPower commented on pull request #663:
URL: https://github.com/apache/pulsar-client-go/pull/663#issuecomment-963966453


   > we agree that it really is necessary to call cond.Broadcast() after cond.Wait()?
   
   Yeah, I agree.
   
   > If changeState gets called before waitUntilReady, then the broadcast happens first and all is well.
   >
   > if waitUntilReady gets called before changeState, then waitUntilReady will block on the cond.Wait(), whilst still holding the lock. 
   
   It makes sense. We should make the combination of `setState` and `c.Broadcast()` an atomic operation.
   
   > however you should avoid calling defer inside the for loop because a new defer call will be pushed onto the stack for each loop invocation.
   
   Thanks for pointing it out. I'm not so familiar with Go, just being used to C++ object's lifetime rule, which is a little different.
   
   ----
   
   The only question I'm still confused is, could you explain following code and its output?
   
   ```go
   package main
   
   import (
   	"fmt"
   	"sync"
   )
   
   type connection struct {
   	sync.Mutex
   	cond *sync.Cond
   }
   
   func NewConnection() *connection {
   	cnx := &connection{}
   	cnx.cond = sync.NewCond(cnx)
   	return cnx
   }
   
   func main() {
   	cnx := NewConnection()
   	fmt.Printf("%v\n%v\n", cnx, cnx.cond.L)
   
   	cnx.cond.L.Lock()
   	cnx.Lock()
   	cnx.Unlock()
   	cnx.cond.Wait()
   }
   ```
   
   ```
   &{{0 0} 0xc000104040}
   &{{0 0} 0xc000104040}
   fatal error: all goroutines are asleep - deadlock!
   
   goroutine 1 [semacquire]:
   sync.runtime_SemacquireMutex(0xc000010244, 0x0, 0x1)
   	/usr/local/go/src/runtime/sema.go:71 +0x47
   sync.(*Mutex).lockSlow(0xc000010240)
   	/usr/local/go/src/sync/mutex.go:138 +0x105
   sync.(*Mutex).Lock(...)
   	/usr/local/go/src/sync/mutex.go:81
   main.main()
   	main.go:24 +0x175
   ```
   
   It looks like `cnx.Lock()` and `cnx.cond.L.Lock()` are the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org