You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by Jordan Zimmerman <jo...@jordanzimmerman.com> on 2016/02/05 03:23:19 UTC

PLEASE REVIEW - Major re-work of Watcher wrappers

Devs,

In trying to fix the bad log message "Failed to find watcher” (which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate more sets of eyes on this change. Please review carefully if you can.

https://github.com/apache/curator/pull/131

-Jordan

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
That totally fixed it!  Thanks, Jordan!

https://github.com/apache/curator/pull/132

On Mon, Feb 8, 2016 at 12:08 PM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> Well, you could add back an alternate version KillSession that works as it
> used it. I’m OK with that.
>
> -JZ
>
>
> On Feb 8, 2016, at 12:06 PM, Scott Blum <dr...@gmail.com> wrote:
>
> Okay, I took a look, and I couldn't see how to use
> timing.forSessionSleep() in this test correctly.  It's not the test harness
> I'm trying to block.  What I'm trying to test involves seeing how TreeCache
> reacts to a *real* session loss.  The fact that KillSession isn't a real
> session loss anymore is therefore impeding my ability to test what I was
> trying to test before.
>
> Let me explain.  The following manual test does what I want:
>
> 1) Connect a TreeCache to a remote ZK server through an ssh tunnel.
> 2) Follow the node creation steps in testKilledSession()
> 3) Kill the tunnel, wait for session loss, restore the tunnel.
>
> So I guess my question is, is there any way to "really" kill the session,
> server side?  Or should I formulate this test completely differently?
>
> Another way to look at this is that I can't synchronize my test steps to
> TreeCache's internal operations effectively.
>
>
> On Sun, Feb 7, 2016 at 10:36 PM, Scott Blum <dr...@gmail.com> wrote:
>
>> Oooh... let me try that.
>>
>> On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>>> FYI - I added a new method to Timing to help with this:
>>>
>>> forSessionSleep()
>>>
>>> The new version of KillSession inserts an eventOfDeath directly into the
>>> client. So, no, it’s not a real session kill like it used to be. But, now
>>> it’s more reliable. Use timing.forSessionSleep() to wait for session
>>> timeout.
>>>
>>> -JZ
>>>
>>>
>>> On Feb 7, 2016, at 8:21 PM, Scott Blum <dr...@gmail.com> wrote:
>>>
>>> Can you describe the change then? Because kill session doesn't seem to
>>> now ensure that ephemeral nodes bound the the killed session disappear in a
>>> timely manner
>>> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com>
>>> wrote:
>>>
>>>> Are we using a new zookeeper?
>>>>
>>>>
>>>> In Curator 3.0 with “new” connection state handling which simulates a
>>>> Session timeout. However, all tests are run twice. First with the old
>>>> handling and then with the new.
>>>>
>>>> Or did something change with our implementation of KillSession.kill()?
>>>>
>>>>
>>>> KillSession did change though. A change I made to ZK got added in 3.5
>>>> and we now use that.
>>>>
>>>> -Jordan
>>>>
>>>> On Feb 7, 2016, at 1:55 PM, Scott Blum <dr...@gmail.com> wrote:
>>>>
>>>> Are we using a new zookeeper?  Or did something change with our
>>>> implementation of KillSession.kill()?
>>>>
>>>> Or maybe there's a timing issue with Curator's ConnectionStateManager
>>>> State change: LOST?  I don't understand how we could get a LOST event
>>>> without the ephemeral node attached to that session having disappeared?
>>>>
>>>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <
>>>> jordan@jordanzimmerman.com> wrote:
>>>>
>>>>> I don’t know if anything changed in ZooKeeper itself. I know that the
>>>>> connection states changed in Curator, but Curator now tests both the old
>>>>> mode and the new mode and they both fail here.
>>>>>
>>>>> -JZ
>>>>>
>>>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dr...@gmail.com> wrote:
>>>>>
>>>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0
>>>>> branch is that the ephemeral node /test/me created in
>>>>> testKilledSession() really isn't disappearing when it should.
>>>>>
>>>>> After the session loss and the reconnect, /test still shows 2 children
>>>>> [foo, me] and /test/me still returns a node.
>>>>>
>>>>> Any idea why the timing here would have changed?
>>>>>
>>>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> https://issues.apache.org/jira/browse/CURATOR-302
>>>>>>
>>>>>> I need to trace through what's really going on under the hood rather
>>>>>> than band-aid the test.  Should be able to in next couple of days.
>>>>>>
>>>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <
>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>
>>>>>>> Any update on this? I think you should create a Jira for it.
>>>>>>>
>>>>>>> -Jordan
>>>>>>>
>>>>>>>
>>>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dr...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs.
>>>>>>> master issue.  I think I'm going to just have to dump in a ton of log
>>>>>>> messages and see what differs.
>>>>>>>
>>>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>
>>>>>>>> OK - please create a new Issue in Jira for this.
>>>>>>>>
>>>>>>>> -Jordan
>>>>>>>>
>>>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have
>>>>>>>> been broken for a while.  Maybe I'll have to git bisect...
>>>>>>>>
>>>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.
>>>>>>>>> I think there is a legit bug/race in TreeCache, and the following patch
>>>>>>>>> *should* remedy:
>>>>>>>>>
>>>>>>>>> diff --git
>>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>>> index df4403c..a4a022b 100644
>>>>>>>>> ---
>>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>>> +++
>>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>>>>>          void wasDeleted() throws Exception
>>>>>>>>>          {
>>>>>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>>>>>> -
>>>>>>>>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>>>>>              ConcurrentMap<String, TreeNode> childMap =
>>>>>>>>> children.getAndSet(null);
>>>>>>>>>              if ( childMap != null )
>>>>>>>>>              {
>>>>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>>>>>          case RECONNECTED:
>>>>>>>>>              try
>>>>>>>>>              {
>>>>>>>>> +                outstandingOps.incrementAndGet();
>>>>>>>>>                  root.wasReconnected();
>>>>>>>>>
>>>>>>>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>>>>>> +                {
>>>>>>>>> +                    if ( isInitialized.compareAndSet(false, true)
>>>>>>>>> )
>>>>>>>>> +                    {
>>>>>>>>> +
>>>>>>>>>  publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>> +                    }
>>>>>>>>> +                }
>>>>>>>>>              }
>>>>>>>>>              catch ( Exception e )
>>>>>>>>>              {
>>>>>>>>>
>>>>>>>>> That should guarantee that the initialized event gets deferred
>>>>>>>>> until all outstanding refreshes finish.. but it's not.  Something seems to
>>>>>>>>> have changed under the hood in how background events are getting sent to
>>>>>>>>> TreeCache, and I don't really understand it yet.  And running the debugger
>>>>>>>>> seems to affect the timing, like something racy is going on. :(
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>
>>>>>>>>>>> This fails. But, if I switch them it works:
>>>>>>>>>>>
>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>
>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>>>>>
>>>>>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> So you end up with 2 initialized events?
>>>>>>>>>>>
>>>>>>>>>>> You mean this?
>>>>>>>>>>>
>>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED)
>>>>>>>>>>> ;
>>>>>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>
>>>>>>>>>>> Seems weird if there are two, but I can help look.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Scott,
>>>>>>>>>>>>
>>>>>>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>>>>>>>
>>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>>
>>>>>>>>>>>> However, if I change the two asserts to:
>>>>>>>>>>>>
>>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>>
>>>>>>>>>>>> it works. Does that make any sense?
>>>>>>>>>>>>
>>>>>>>>>>>> -Jordan
>>>>>>>>>>>>
>>>>>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > Devs,
>>>>>>>>>>>> >
>>>>>>>>>>>> > In trying to fix the bad log message "Failed to find watcher”
>>>>>>>>>>>> (which turns out to be a ZK client issue), I realize that the
>>>>>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still
>>>>>>>>>>>> working on getting all tests to pass but I’d appreciate more sets of eyes
>>>>>>>>>>>> on this change. Please review carefully if you can.
>>>>>>>>>>>> >
>>>>>>>>>>>> > https://github.com/apache/curator/pull/131
>>>>>>>>>>>> >
>>>>>>>>>>>> > -Jordan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Well, you could add back an alternate version KillSession that works as it used it. I’m OK with that.

-JZ

> On Feb 8, 2016, at 12:06 PM, Scott Blum <dr...@gmail.com> wrote:
> 
> Okay, I took a look, and I couldn't see how to use timing.forSessionSleep() in this test correctly.  It's not the test harness I'm trying to block.  What I'm trying to test involves seeing how TreeCache reacts to a real session loss.  The fact that KillSession isn't a real session loss anymore is therefore impeding my ability to test what I was trying to test before.
> 
> Let me explain.  The following manual test does what I want:
> 
> 1) Connect a TreeCache to a remote ZK server through an ssh tunnel.
> 2) Follow the node creation steps in testKilledSession()
> 3) Kill the tunnel, wait for session loss, restore the tunnel.
> 
> So I guess my question is, is there any way to "really" kill the session, server side?  Or should I formulate this test completely differently?
> 
> Another way to look at this is that I can't synchronize my test steps to TreeCache's internal operations effectively.
> 
> 
> On Sun, Feb 7, 2016 at 10:36 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
> Oooh... let me try that.
> 
> On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
> FYI - I added a new method to Timing to help with this:
> forSessionSleep()
> The new version of KillSession inserts an eventOfDeath directly into the client. So, no, it’s not a real session kill like it used to be. But, now it’s more reliable. Use timing.forSessionSleep() to wait for session timeout.
> 
> -JZ
> 
> 
>> On Feb 7, 2016, at 8:21 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Can you describe the change then? Because kill session doesn't seem to now ensure that ephemeral nodes bound the the killed session disappear in a timely manner
>> 
>> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>> Are we using a new zookeeper?
>> 
>> 
>> In Curator 3.0 with “new” connection state handling which simulates a Session timeout. However, all tests are run twice. First with the old handling and then with the new.
>> 
>>> Or did something change with our implementation of KillSession.kill()?
>> 
>> KillSession did change though. A change I made to ZK got added in 3.5 and we now use that.
>> 
>> -Jordan
>> 
>>> On Feb 7, 2016, at 1:55 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Are we using a new zookeeper?  Or did something change with our implementation of KillSession.kill()?
>>> 
>>> Or maybe there's a timing issue with Curator's ConnectionStateManager State change: LOST?  I don't understand how we could get a LOST event without the ephemeral node attached to that session having disappeared?
>>> 
>>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>> I don’t know if anything changed in ZooKeeper itself. I know that the connection states changed in Curator, but Curator now tests both the old mode and the new mode and they both fail here.
>>> 
>>> -JZ
>>> 
>>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 branch is that the ephemeral node /test/me created in testKilledSession() really isn't disappearing when it should.
>>>> 
>>>> After the session loss and the reconnect, /test still shows 2 children [foo, me] and /test/me still returns a node.
>>>> 
>>>> Any idea why the timing here would have changed?
>>>> 
>>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>> https://issues.apache.org/jira/browse/CURATOR-302 <https://issues.apache.org/jira/browse/CURATOR-302>
>>>> 
>>>> I need to trace through what's really going on under the hood rather than band-aid the test.  Should be able to in next couple of days.
>>>> 
>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>> Any update on this? I think you should create a Jira for it.
>>>> 
>>>> -Jordan
>>>> 
>>>> 
>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master issue.  I think I'm going to just have to dump in a ton of log messages and see what differs.
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>>> OK - please create a new Issue in Jira for this.
>>>>> 
>>>>> -Jordan
>>>>> 
>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been broken for a while.  Maybe I'll have to git bisect...
>>>>>> 
>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I think there is a legit bug/race in TreeCache, and the following patch *should* remedy:
>>>>>> 
>>>>>> diff --git a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> index df4403c..a4a022b 100644
>>>>>> --- a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> +++ b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>>          void wasDeleted() throws Exception
>>>>>>          {
>>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>>> -            client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>>              ConcurrentMap<String, TreeNode> childMap = children.getAndSet(null);
>>>>>>              if ( childMap != null )
>>>>>>              {
>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>>          case RECONNECTED:
>>>>>>              try
>>>>>>              {
>>>>>> +                outstandingOps.incrementAndGet();
>>>>>>                  root.wasReconnected();
>>>>>>                  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>>> +                {
>>>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>>>> +                    {
>>>>>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>> +                    }
>>>>>> +                }
>>>>>>              }
>>>>>>              catch ( Exception e )
>>>>>>              {
>>>>>> 
>>>>>> That should guarantee that the initialized event gets deferred until all outstanding refreshes finish.. but it's not.  Something seems to have changed under the hood in how background events are getting sent to TreeCache, and I don't really understand it yet.  And running the debugger seems to affect the timing, like something racy is going on. :(
>>>>>> 
>>>>>> 
>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>> 
>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>> 
>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>> 
>>>>>> This fails. But, if I switch them it works:
>>>>>> 
>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>> 
>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> So you end up with 2 initialized events?
>>>>>>> 
>>>>>>> You mean this?
>>>>>>> 
>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>> 
>>>>>>> Seems weird if there are two, but I can help look.
>>>>>>> 
>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>>>>> Hey Scott,
>>>>>>> 
>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>> 
>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>> 
>>>>>>> However, if I change the two asserts to:
>>>>>>> 
>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>> 
>>>>>>> it works. Does that make any sense?
>>>>>>> 
>>>>>>> -Jordan
>>>>>>> 
>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>>>>> >
>>>>>>> > Devs,
>>>>>>> >
>>>>>>> > In trying to fix the bad log message "Failed to find watcher” (which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate more sets of eyes on this change. Please review carefully if you can.
>>>>>>> >
>>>>>>> > https://github.com/apache/curator/pull/131 <https://github.com/apache/curator/pull/131>
>>>>>>> >
>>>>>>> > -Jordan
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 


Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
Okay, I took a look, and I couldn't see how to use timing.forSessionSleep()
in this test correctly.  It's not the test harness I'm trying to block.
What I'm trying to test involves seeing how TreeCache reacts to a *real*
session loss.  The fact that KillSession isn't a real session loss anymore
is therefore impeding my ability to test what I was trying to test before.

Let me explain.  The following manual test does what I want:

1) Connect a TreeCache to a remote ZK server through an ssh tunnel.
2) Follow the node creation steps in testKilledSession()
3) Kill the tunnel, wait for session loss, restore the tunnel.

So I guess my question is, is there any way to "really" kill the session,
server side?  Or should I formulate this test completely differently?

Another way to look at this is that I can't synchronize my test steps to
TreeCache's internal operations effectively.


On Sun, Feb 7, 2016 at 10:36 PM, Scott Blum <dr...@gmail.com> wrote:

> Oooh... let me try that.
>
> On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> FYI - I added a new method to Timing to help with this:
>>
>> forSessionSleep()
>>
>> The new version of KillSession inserts an eventOfDeath directly into the
>> client. So, no, it’s not a real session kill like it used to be. But, now
>> it’s more reliable. Use timing.forSessionSleep() to wait for session
>> timeout.
>>
>> -JZ
>>
>>
>> On Feb 7, 2016, at 8:21 PM, Scott Blum <dr...@gmail.com> wrote:
>>
>> Can you describe the change then? Because kill session doesn't seem to
>> now ensure that ephemeral nodes bound the the killed session disappear in a
>> timely manner
>> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com>
>> wrote:
>>
>>> Are we using a new zookeeper?
>>>
>>>
>>> In Curator 3.0 with “new” connection state handling which simulates a
>>> Session timeout. However, all tests are run twice. First with the old
>>> handling and then with the new.
>>>
>>> Or did something change with our implementation of KillSession.kill()?
>>>
>>>
>>> KillSession did change though. A change I made to ZK got added in 3.5
>>> and we now use that.
>>>
>>> -Jordan
>>>
>>> On Feb 7, 2016, at 1:55 PM, Scott Blum <dr...@gmail.com> wrote:
>>>
>>> Are we using a new zookeeper?  Or did something change with our
>>> implementation of KillSession.kill()?
>>>
>>> Or maybe there's a timing issue with Curator's ConnectionStateManager
>>> State change: LOST?  I don't understand how we could get a LOST event
>>> without the ephemeral node attached to that session having disappeared?
>>>
>>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <
>>> jordan@jordanzimmerman.com> wrote:
>>>
>>>> I don’t know if anything changed in ZooKeeper itself. I know that the
>>>> connection states changed in Curator, but Curator now tests both the old
>>>> mode and the new mode and they both fail here.
>>>>
>>>> -JZ
>>>>
>>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dr...@gmail.com> wrote:
>>>>
>>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0
>>>> branch is that the ephemeral node /test/me created in
>>>> testKilledSession() really isn't disappearing when it should.
>>>>
>>>> After the session loss and the reconnect, /test still shows 2 children
>>>> [foo, me] and /test/me still returns a node.
>>>>
>>>> Any idea why the timing here would have changed?
>>>>
>>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dr...@gmail.com>
>>>> wrote:
>>>>
>>>>> https://issues.apache.org/jira/browse/CURATOR-302
>>>>>
>>>>> I need to trace through what's really going on under the hood rather
>>>>> than band-aid the test.  Should be able to in next couple of days.
>>>>>
>>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <
>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>
>>>>>> Any update on this? I think you should create a Jira for it.
>>>>>>
>>>>>> -Jordan
>>>>>>
>>>>>>
>>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs.
>>>>>> master issue.  I think I'm going to just have to dump in a ton of log
>>>>>> messages and see what differs.
>>>>>>
>>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>
>>>>>>> OK - please create a new Issue in Jira for this.
>>>>>>>
>>>>>>> -Jordan
>>>>>>>
>>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dr...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have
>>>>>>> been broken for a while.  Maybe I'll have to git bisect...
>>>>>>>
>>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dr...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I
>>>>>>>> think there is a legit bug/race in TreeCache, and the following patch
>>>>>>>> *should* remedy:
>>>>>>>>
>>>>>>>> diff --git
>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>> index df4403c..a4a022b 100644
>>>>>>>> ---
>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>> +++
>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>>>>          void wasDeleted() throws Exception
>>>>>>>>          {
>>>>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>>>>> -
>>>>>>>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>>>>              ConcurrentMap<String, TreeNode> childMap =
>>>>>>>> children.getAndSet(null);
>>>>>>>>              if ( childMap != null )
>>>>>>>>              {
>>>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>>>>          case RECONNECTED:
>>>>>>>>              try
>>>>>>>>              {
>>>>>>>> +                outstandingOps.incrementAndGet();
>>>>>>>>                  root.wasReconnected();
>>>>>>>>
>>>>>>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>>>>> +                {
>>>>>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>>>>>> +                    {
>>>>>>>> +
>>>>>>>>  publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>> +                    }
>>>>>>>> +                }
>>>>>>>>              }
>>>>>>>>              catch ( Exception e )
>>>>>>>>              {
>>>>>>>>
>>>>>>>> That should guarantee that the initialized event gets deferred
>>>>>>>> until all outstanding refreshes finish.. but it's not.  Something seems to
>>>>>>>> have changed under the hood in how background events are getting sent to
>>>>>>>> TreeCache, and I don't really understand it yet.  And running the debugger
>>>>>>>> seems to affect the timing, like something racy is going on. :(
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>>>>>
>>>>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>
>>>>>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>
>>>>>>>>>> This fails. But, if I switch them it works:
>>>>>>>>>>
>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>
>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>>>>
>>>>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> So you end up with 2 initialized events?
>>>>>>>>>>
>>>>>>>>>> You mean this?
>>>>>>>>>>
>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>
>>>>>>>>>> Seems weird if there are two, but I can help look.
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Scott,
>>>>>>>>>>>
>>>>>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>>>>>>
>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>
>>>>>>>>>>> However, if I change the two asserts to:
>>>>>>>>>>>
>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>
>>>>>>>>>>> it works. Does that make any sense?
>>>>>>>>>>>
>>>>>>>>>>> -Jordan
>>>>>>>>>>>
>>>>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Devs,
>>>>>>>>>>> >
>>>>>>>>>>> > In trying to fix the bad log message "Failed to find watcher”
>>>>>>>>>>> (which turns out to be a ZK client issue), I realize that the
>>>>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still
>>>>>>>>>>> working on getting all tests to pass but I’d appreciate more sets of eyes
>>>>>>>>>>> on this change. Please review carefully if you can.
>>>>>>>>>>> >
>>>>>>>>>>> > https://github.com/apache/curator/pull/131
>>>>>>>>>>> >
>>>>>>>>>>> > -Jordan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
Oooh... let me try that.

On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman <jordan@jordanzimmerman.com
> wrote:

> FYI - I added a new method to Timing to help with this:
>
> forSessionSleep()
>
> The new version of KillSession inserts an eventOfDeath directly into the
> client. So, no, it’s not a real session kill like it used to be. But, now
> it’s more reliable. Use timing.forSessionSleep() to wait for session
> timeout.
>
> -JZ
>
>
> On Feb 7, 2016, at 8:21 PM, Scott Blum <dr...@gmail.com> wrote:
>
> Can you describe the change then? Because kill session doesn't seem to now
> ensure that ephemeral nodes bound the the killed session disappear in a
> timely manner
> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com>
> wrote:
>
>> Are we using a new zookeeper?
>>
>>
>> In Curator 3.0 with “new” connection state handling which simulates a
>> Session timeout. However, all tests are run twice. First with the old
>> handling and then with the new.
>>
>> Or did something change with our implementation of KillSession.kill()?
>>
>>
>> KillSession did change though. A change I made to ZK got added in 3.5 and
>> we now use that.
>>
>> -Jordan
>>
>> On Feb 7, 2016, at 1:55 PM, Scott Blum <dr...@gmail.com> wrote:
>>
>> Are we using a new zookeeper?  Or did something change with our
>> implementation of KillSession.kill()?
>>
>> Or maybe there's a timing issue with Curator's ConnectionStateManager
>> State change: LOST?  I don't understand how we could get a LOST event
>> without the ephemeral node attached to that session having disappeared?
>>
>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>>> I don’t know if anything changed in ZooKeeper itself. I know that the
>>> connection states changed in Curator, but Curator now tests both the old
>>> mode and the new mode and they both fail here.
>>>
>>> -JZ
>>>
>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dr...@gmail.com> wrote:
>>>
>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0
>>> branch is that the ephemeral node /test/me created in
>>> testKilledSession() really isn't disappearing when it should.
>>>
>>> After the session loss and the reconnect, /test still shows 2 children
>>> [foo, me] and /test/me still returns a node.
>>>
>>> Any idea why the timing here would have changed?
>>>
>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dr...@gmail.com>
>>> wrote:
>>>
>>>> https://issues.apache.org/jira/browse/CURATOR-302
>>>>
>>>> I need to trace through what's really going on under the hood rather
>>>> than band-aid the test.  Should be able to in next couple of days.
>>>>
>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <
>>>> jordan@jordanzimmerman.com> wrote:
>>>>
>>>>> Any update on this? I think you should create a Jira for it.
>>>>>
>>>>> -Jordan
>>>>>
>>>>>
>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dr...@gmail.com> wrote:
>>>>>
>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master
>>>>> issue.  I think I'm going to just have to dump in a ton of log messages and
>>>>> see what differs.
>>>>>
>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>
>>>>>> OK - please create a new Issue in Jira for this.
>>>>>>
>>>>>> -Jordan
>>>>>>
>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have
>>>>>> been broken for a while.  Maybe I'll have to git bisect...
>>>>>>
>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I
>>>>>>> think there is a legit bug/race in TreeCache, and the following patch
>>>>>>> *should* remedy:
>>>>>>>
>>>>>>> diff --git
>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>> index df4403c..a4a022b 100644
>>>>>>> ---
>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>> +++
>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>>>          void wasDeleted() throws Exception
>>>>>>>          {
>>>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>>>> -
>>>>>>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>>>              ConcurrentMap<String, TreeNode> childMap =
>>>>>>> children.getAndSet(null);
>>>>>>>              if ( childMap != null )
>>>>>>>              {
>>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>>>          case RECONNECTED:
>>>>>>>              try
>>>>>>>              {
>>>>>>> +                outstandingOps.incrementAndGet();
>>>>>>>                  root.wasReconnected();
>>>>>>>
>>>>>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>>>> +                {
>>>>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>>>>> +                    {
>>>>>>> +
>>>>>>>  publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>> +                    }
>>>>>>> +                }
>>>>>>>              }
>>>>>>>              catch ( Exception e )
>>>>>>>              {
>>>>>>>
>>>>>>> That should guarantee that the initialized event gets deferred until
>>>>>>> all outstanding refreshes finish.. but it's not.  Something seems to have
>>>>>>> changed under the hood in how background events are getting sent to
>>>>>>> TreeCache, and I don't really understand it yet.  And running the debugger
>>>>>>> seems to affect the timing, like something racy is going on. :(
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dr...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>>>>
>>>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>
>>>>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>
>>>>>>>>> This fails. But, if I switch them it works:
>>>>>>>>>
>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>
>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>>>
>>>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> So you end up with 2 initialized events?
>>>>>>>>>
>>>>>>>>> You mean this?
>>>>>>>>>
>>>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>>>> "data".getBytes());
>>>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>
>>>>>>>>> Seems weird if there are two, but I can help look.
>>>>>>>>>
>>>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Scott,
>>>>>>>>>>
>>>>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>>>>>
>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>>>>> "data".getBytes());
>>>>>>>>>>
>>>>>>>>>> However, if I change the two asserts to:
>>>>>>>>>>
>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>>>>> "data".getBytes());
>>>>>>>>>>
>>>>>>>>>> it works. Does that make any sense?
>>>>>>>>>>
>>>>>>>>>> -Jordan
>>>>>>>>>>
>>>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>> >
>>>>>>>>>> > Devs,
>>>>>>>>>> >
>>>>>>>>>> > In trying to fix the bad log message "Failed to find watcher”
>>>>>>>>>> (which turns out to be a ZK client issue), I realize that the
>>>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still
>>>>>>>>>> working on getting all tests to pass but I’d appreciate more sets of eyes
>>>>>>>>>> on this change. Please review carefully if you can.
>>>>>>>>>> >
>>>>>>>>>> > https://github.com/apache/curator/pull/131
>>>>>>>>>> >
>>>>>>>>>> > -Jordan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
FYI - I added a new method to Timing to help with this:
forSessionSleep()
The new version of KillSession inserts an eventOfDeath directly into the client. So, no, it’s not a real session kill like it used to be. But, now it’s more reliable. Use timing.forSessionSleep() to wait for session timeout.

-JZ


> On Feb 7, 2016, at 8:21 PM, Scott Blum <dr...@gmail.com> wrote:
> 
> Can you describe the change then? Because kill session doesn't seem to now ensure that ephemeral nodes bound the the killed session disappear in a timely manner
> 
> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>> Are we using a new zookeeper?
> 
> 
> In Curator 3.0 with “new” connection state handling which simulates a Session timeout. However, all tests are run twice. First with the old handling and then with the new.
> 
>> Or did something change with our implementation of KillSession.kill()?
> 
> KillSession did change though. A change I made to ZK got added in 3.5 and we now use that.
> 
> -Jordan
> 
>> On Feb 7, 2016, at 1:55 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Are we using a new zookeeper?  Or did something change with our implementation of KillSession.kill()?
>> 
>> Or maybe there's a timing issue with Curator's ConnectionStateManager State change: LOST?  I don't understand how we could get a LOST event without the ephemeral node attached to that session having disappeared?
>> 
>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>> I don’t know if anything changed in ZooKeeper itself. I know that the connection states changed in Curator, but Curator now tests both the old mode and the new mode and they both fail here.
>> 
>> -JZ
>> 
>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 branch is that the ephemeral node /test/me created in testKilledSession() really isn't disappearing when it should.
>>> 
>>> After the session loss and the reconnect, /test still shows 2 children [foo, me] and /test/me still returns a node.
>>> 
>>> Any idea why the timing here would have changed?
>>> 
>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>> https://issues.apache.org/jira/browse/CURATOR-302 <https://issues.apache.org/jira/browse/CURATOR-302>
>>> 
>>> I need to trace through what's really going on under the hood rather than band-aid the test.  Should be able to in next couple of days.
>>> 
>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>> Any update on this? I think you should create a Jira for it.
>>> 
>>> -Jordan
>>> 
>>> 
>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master issue.  I think I'm going to just have to dump in a ton of log messages and see what differs.
>>>> 
>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>> OK - please create a new Issue in Jira for this.
>>>> 
>>>> -Jordan
>>>> 
>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been broken for a while.  Maybe I'll have to git bisect...
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I think there is a legit bug/race in TreeCache, and the following patch *should* remedy:
>>>>> 
>>>>> diff --git a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> index df4403c..a4a022b 100644
>>>>> --- a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> +++ b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>          void wasDeleted() throws Exception
>>>>>          {
>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>> -            client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>              ConcurrentMap<String, TreeNode> childMap = children.getAndSet(null);
>>>>>              if ( childMap != null )
>>>>>              {
>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>          case RECONNECTED:
>>>>>              try
>>>>>              {
>>>>> +                outstandingOps.incrementAndGet();
>>>>>                  root.wasReconnected();
>>>>>                  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>> +                {
>>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>>> +                    {
>>>>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>> +                    }
>>>>> +                }
>>>>>              }
>>>>>              catch ( Exception e )
>>>>>              {
>>>>> 
>>>>> That should guarantee that the initialized event gets deferred until all outstanding refreshes finish.. but it's not.  Something seems to have changed under the hood in how background events are getting sent to TreeCache, and I don't really understand it yet.  And running the debugger seems to affect the timing, like something racy is going on. :(
>>>>> 
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>>> No, sorry. The last few lines of the test currently are:
>>>>> 
>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>> 
>>>>> This fails. But, if I switch them it works:
>>>>> 
>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>> 
>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> So you end up with 2 initialized events?
>>>>>> 
>>>>>> You mean this?
>>>>>> 
>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>> 
>>>>>> Seems weird if there are two, but I can help look.
>>>>>> 
>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>>>> Hey Scott,
>>>>>> 
>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>> 
>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>> 
>>>>>> However, if I change the two asserts to:
>>>>>> 
>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>> 
>>>>>> it works. Does that make any sense?
>>>>>> 
>>>>>> -Jordan
>>>>>> 
>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>>>> >
>>>>>> > Devs,
>>>>>> >
>>>>>> > In trying to fix the bad log message "Failed to find watcher” (which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate more sets of eyes on this change. Please review carefully if you can.
>>>>>> >
>>>>>> > https://github.com/apache/curator/pull/131 <https://github.com/apache/curator/pull/131>
>>>>>> >
>>>>>> > -Jordan
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 


Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
Can you describe the change then? Because kill session doesn't seem to now
ensure that ephemeral nodes bound the the killed session disappear in a
timely manner
On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <jo...@jordanzimmerman.com>
wrote:

> Are we using a new zookeeper?
>
>
> In Curator 3.0 with “new” connection state handling which simulates a
> Session timeout. However, all tests are run twice. First with the old
> handling and then with the new.
>
> Or did something change with our implementation of KillSession.kill()?
>
>
> KillSession did change though. A change I made to ZK got added in 3.5 and
> we now use that.
>
> -Jordan
>
> On Feb 7, 2016, at 1:55 PM, Scott Blum <dr...@gmail.com> wrote:
>
> Are we using a new zookeeper?  Or did something change with our
> implementation of KillSession.kill()?
>
> Or maybe there's a timing issue with Curator's ConnectionStateManager
> State change: LOST?  I don't understand how we could get a LOST event
> without the ephemeral node attached to that session having disappeared?
>
> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> I don’t know if anything changed in ZooKeeper itself. I know that the
>> connection states changed in Curator, but Curator now tests both the old
>> mode and the new mode and they both fail here.
>>
>> -JZ
>>
>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dr...@gmail.com> wrote:
>>
>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0
>> branch is that the ephemeral node /test/me created in testKilledSession() really
>> isn't disappearing when it should.
>>
>> After the session loss and the reconnect, /test still shows 2 children
>> [foo, me] and /test/me still returns a node.
>>
>> Any idea why the timing here would have changed?
>>
>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dr...@gmail.com> wrote:
>>
>>> https://issues.apache.org/jira/browse/CURATOR-302
>>>
>>> I need to trace through what's really going on under the hood rather
>>> than band-aid the test.  Should be able to in next couple of days.
>>>
>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <
>>> jordan@jordanzimmerman.com> wrote:
>>>
>>>> Any update on this? I think you should create a Jira for it.
>>>>
>>>> -Jordan
>>>>
>>>>
>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dr...@gmail.com> wrote:
>>>>
>>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master
>>>> issue.  I think I'm going to just have to dump in a ton of log messages and
>>>> see what differs.
>>>>
>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
>>>> jordan@jordanzimmerman.com> wrote:
>>>>
>>>>> OK - please create a new Issue in Jira for this.
>>>>>
>>>>> -Jordan
>>>>>
>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dr...@gmail.com> wrote:
>>>>>
>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been
>>>>> broken for a while.  Maybe I'll have to git bisect...
>>>>>
>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I
>>>>>> think there is a legit bug/race in TreeCache, and the following patch
>>>>>> *should* remedy:
>>>>>>
>>>>>> diff --git
>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> index df4403c..a4a022b 100644
>>>>>> ---
>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> +++
>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>>          void wasDeleted() throws Exception
>>>>>>          {
>>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>>> -
>>>>>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>>              ConcurrentMap<String, TreeNode> childMap =
>>>>>> children.getAndSet(null);
>>>>>>              if ( childMap != null )
>>>>>>              {
>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>>          case RECONNECTED:
>>>>>>              try
>>>>>>              {
>>>>>> +                outstandingOps.incrementAndGet();
>>>>>>                  root.wasReconnected();
>>>>>>
>>>>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>>> +                {
>>>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>>>> +                    {
>>>>>> +
>>>>>>  publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>> +                    }
>>>>>> +                }
>>>>>>              }
>>>>>>              catch ( Exception e )
>>>>>>              {
>>>>>>
>>>>>> That should guarantee that the initialized event gets deferred until
>>>>>> all outstanding refreshes finish.. but it's not.  Something seems to have
>>>>>> changed under the hood in how background events are getting sent to
>>>>>> TreeCache, and I don't really understand it yet.  And running the debugger
>>>>>> seems to affect the timing, like something racy is going on. :(
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>>>
>>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>
>>>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>>>>
>>>>>>>>
>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>
>>>>>>>> This fails. But, if I switch them it works:
>>>>>>>>
>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>
>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>>
>>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> So you end up with 2 initialized events?
>>>>>>>>
>>>>>>>> You mean this?
>>>>>>>>
>>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>>> "data".getBytes());
>>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>
>>>>>>>> Seems weird if there are two, but I can help look.
>>>>>>>>
>>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>
>>>>>>>>> Hey Scott,
>>>>>>>>>
>>>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>>>>
>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>>>> "data".getBytes());
>>>>>>>>>
>>>>>>>>> However, if I change the two asserts to:
>>>>>>>>>
>>>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>>>> "data".getBytes());
>>>>>>>>>
>>>>>>>>> it works. Does that make any sense?
>>>>>>>>>
>>>>>>>>> -Jordan
>>>>>>>>>
>>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>> >
>>>>>>>>> > Devs,
>>>>>>>>> >
>>>>>>>>> > In trying to fix the bad log message "Failed to find watcher”
>>>>>>>>> (which turns out to be a ZK client issue), I realize that the
>>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still
>>>>>>>>> working on getting all tests to pass but I’d appreciate more sets of eyes
>>>>>>>>> on this change. Please review carefully if you can.
>>>>>>>>> >
>>>>>>>>> > https://github.com/apache/curator/pull/131
>>>>>>>>> >
>>>>>>>>> > -Jordan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
> Are we using a new zookeeper?


In Curator 3.0 with “new” connection state handling which simulates a Session timeout. However, all tests are run twice. First with the old handling and then with the new.

> Or did something change with our implementation of KillSession.kill()?

KillSession did change though. A change I made to ZK got added in 3.5 and we now use that.

-Jordan

> On Feb 7, 2016, at 1:55 PM, Scott Blum <dr...@gmail.com> wrote:
> 
> Are we using a new zookeeper?  Or did something change with our implementation of KillSession.kill()?
> 
> Or maybe there's a timing issue with Curator's ConnectionStateManager State change: LOST?  I don't understand how we could get a LOST event without the ephemeral node attached to that session having disappeared?
> 
> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
> I don’t know if anything changed in ZooKeeper itself. I know that the connection states changed in Curator, but Curator now tests both the old mode and the new mode and they both fail here.
> 
> -JZ
> 
>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 branch is that the ephemeral node /test/me created in testKilledSession() really isn't disappearing when it should.
>> 
>> After the session loss and the reconnect, /test still shows 2 children [foo, me] and /test/me still returns a node.
>> 
>> Any idea why the timing here would have changed?
>> 
>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> https://issues.apache.org/jira/browse/CURATOR-302 <https://issues.apache.org/jira/browse/CURATOR-302>
>> 
>> I need to trace through what's really going on under the hood rather than band-aid the test.  Should be able to in next couple of days.
>> 
>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>> Any update on this? I think you should create a Jira for it.
>> 
>> -Jordan
>> 
>> 
>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master issue.  I think I'm going to just have to dump in a ton of log messages and see what differs.
>>> 
>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>> OK - please create a new Issue in Jira for this.
>>> 
>>> -Jordan
>>> 
>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been broken for a while.  Maybe I'll have to git bisect...
>>>> 
>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I think there is a legit bug/race in TreeCache, and the following patch *should* remedy:
>>>> 
>>>> diff --git a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>> index df4403c..a4a022b 100644
>>>> --- a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>> +++ b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>          void wasDeleted() throws Exception
>>>>          {
>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>> -            client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>              ConcurrentMap<String, TreeNode> childMap = children.getAndSet(null);
>>>>              if ( childMap != null )
>>>>              {
>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>          case RECONNECTED:
>>>>              try
>>>>              {
>>>> +                outstandingOps.incrementAndGet();
>>>>                  root.wasReconnected();
>>>>                  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>> +                {
>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>> +                    {
>>>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>> +                    }
>>>> +                }
>>>>              }
>>>>              catch ( Exception e )
>>>>              {
>>>> 
>>>> That should guarantee that the initialized event gets deferred until all outstanding refreshes finish.. but it's not.  Something seems to have changed under the hood in how background events are getting sent to TreeCache, and I don't really understand it yet.  And running the debugger seems to affect the timing, like something racy is going on. :(
>>>> 
>>>> 
>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>> Ok, that is kind of weird.  I'll take a look.
>>>> 
>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>> No, sorry. The last few lines of the test currently are:
>>>> 
>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>> 
>>>> This fails. But, if I switch them it works:
>>>> 
>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>> 
>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> So you end up with 2 initialized events?
>>>>> 
>>>>> You mean this?
>>>>> 
>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>> 
>>>>> Seems weird if there are two, but I can help look.
>>>>> 
>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>>> Hey Scott,
>>>>> 
>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>> 
>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>> 
>>>>> However, if I change the two asserts to:
>>>>> 
>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>> 
>>>>> it works. Does that make any sense?
>>>>> 
>>>>> -Jordan
>>>>> 
>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>>> >
>>>>> > Devs,
>>>>> >
>>>>> > In trying to fix the bad log message "Failed to find watcher” (which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate more sets of eyes on this change. Please review carefully if you can.
>>>>> >
>>>>> > https://github.com/apache/curator/pull/131 <https://github.com/apache/curator/pull/131>
>>>>> >
>>>>> > -Jordan
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 


Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
Are we using a new zookeeper?  Or did something change with our
implementation of KillSession.kill()?

Or maybe there's a timing issue with Curator's ConnectionStateManager State
change: LOST?  I don't understand how we could get a LOST event without the
ephemeral node attached to that session having disappeared?

On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> I don’t know if anything changed in ZooKeeper itself. I know that the
> connection states changed in Curator, but Curator now tests both the old
> mode and the new mode and they both fail here.
>
> -JZ
>
> On Feb 7, 2016, at 1:06 AM, Scott Blum <dr...@gmail.com> wrote:
>
> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 branch
> is that the ephemeral node /test/me created in testKilledSession() really
> isn't disappearing when it should.
>
> After the session loss and the reconnect, /test still shows 2 children
> [foo, me] and /test/me still returns a node.
>
> Any idea why the timing here would have changed?
>
> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dr...@gmail.com> wrote:
>
>> https://issues.apache.org/jira/browse/CURATOR-302
>>
>> I need to trace through what's really going on under the hood rather than
>> band-aid the test.  Should be able to in next couple of days.
>>
>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>>> Any update on this? I think you should create a Jira for it.
>>>
>>> -Jordan
>>>
>>>
>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dr...@gmail.com> wrote:
>>>
>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master
>>> issue.  I think I'm going to just have to dump in a ton of log messages and
>>> see what differs.
>>>
>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
>>> jordan@jordanzimmerman.com> wrote:
>>>
>>>> OK - please create a new Issue in Jira for this.
>>>>
>>>> -Jordan
>>>>
>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dr...@gmail.com> wrote:
>>>>
>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been
>>>> broken for a while.  Maybe I'll have to git bisect...
>>>>
>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I
>>>>> think there is a legit bug/race in TreeCache, and the following patch
>>>>> *should* remedy:
>>>>>
>>>>> diff --git
>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> index df4403c..a4a022b 100644
>>>>> ---
>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> +++
>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>          void wasDeleted() throws Exception
>>>>>          {
>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>> -
>>>>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>              ConcurrentMap<String, TreeNode> childMap =
>>>>> children.getAndSet(null);
>>>>>              if ( childMap != null )
>>>>>              {
>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>          case RECONNECTED:
>>>>>              try
>>>>>              {
>>>>> +                outstandingOps.incrementAndGet();
>>>>>                  root.wasReconnected();
>>>>>
>>>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>> +                {
>>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>>> +                    {
>>>>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>> +                    }
>>>>> +                }
>>>>>              }
>>>>>              catch ( Exception e )
>>>>>              {
>>>>>
>>>>> That should guarantee that the initialized event gets deferred until
>>>>> all outstanding refreshes finish.. but it's not.  Something seems to have
>>>>> changed under the hood in how background events are getting sent to
>>>>> TreeCache, and I don't really understand it yet.  And running the debugger
>>>>> seems to affect the timing, like something racy is going on. :(
>>>>>
>>>>>
>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>>
>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>
>>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>>>
>>>>>>>
>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>
>>>>>>> This fails. But, if I switch them it works:
>>>>>>>
>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>
>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>>
>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> So you end up with 2 initialized events?
>>>>>>>
>>>>>>> You mean this?
>>>>>>>
>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>> "data".getBytes());
>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>
>>>>>>> Seems weird if there are two, but I can help look.
>>>>>>>
>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>
>>>>>>>> Hey Scott,
>>>>>>>>
>>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>>>
>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>>> "data".getBytes());
>>>>>>>>
>>>>>>>> However, if I change the two asserts to:
>>>>>>>>
>>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>>> "data".getBytes());
>>>>>>>>
>>>>>>>> it works. Does that make any sense?
>>>>>>>>
>>>>>>>> -Jordan
>>>>>>>>
>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>> >
>>>>>>>> > Devs,
>>>>>>>> >
>>>>>>>> > In trying to fix the bad log message "Failed to find watcher”
>>>>>>>> (which turns out to be a ZK client issue), I realize that the
>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still
>>>>>>>> working on getting all tests to pass but I’d appreciate more sets of eyes
>>>>>>>> on this change. Please review carefully if you can.
>>>>>>>> >
>>>>>>>> > https://github.com/apache/curator/pull/131
>>>>>>>> >
>>>>>>>> > -Jordan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
I don’t know if anything changed in ZooKeeper itself. I know that the connection states changed in Curator, but Curator now tests both the old mode and the new mode and they both fail here.

-JZ

> On Feb 7, 2016, at 1:06 AM, Scott Blum <dr...@gmail.com> wrote:
> 
> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 branch is that the ephemeral node /test/me created in testKilledSession() really isn't disappearing when it should.
> 
> After the session loss and the reconnect, /test still shows 2 children [foo, me] and /test/me still returns a node.
> 
> Any idea why the timing here would have changed?
> 
> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
> https://issues.apache.org/jira/browse/CURATOR-302 <https://issues.apache.org/jira/browse/CURATOR-302>
> 
> I need to trace through what's really going on under the hood rather than band-aid the test.  Should be able to in next couple of days.
> 
> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
> Any update on this? I think you should create a Jira for it.
> 
> -Jordan
> 
> 
>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> 
>> BTW, this test passes on master... so it's some kind of 3.0 vs. master issue.  I think I'm going to just have to dump in a ton of log messages and see what differs.
>> 
>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>> OK - please create a new Issue in Jira for this.
>> 
>> -Jordan
>> 
>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been broken for a while.  Maybe I'll have to git bisect...
>>> 
>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I think there is a legit bug/race in TreeCache, and the following patch *should* remedy:
>>> 
>>> diff --git a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>> index df4403c..a4a022b 100644
>>> --- a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>> +++ b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>          void wasDeleted() throws Exception
>>>          {
>>>              ChildData oldChildData = childData.getAndSet(null);
>>> -            client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>              ConcurrentMap<String, TreeNode> childMap = children.getAndSet(null);
>>>              if ( childMap != null )
>>>              {
>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>          case RECONNECTED:
>>>              try
>>>              {
>>> +                outstandingOps.incrementAndGet();
>>>                  root.wasReconnected();
>>>                  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>> +                {
>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>> +                    {
>>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>> +                    }
>>> +                }
>>>              }
>>>              catch ( Exception e )
>>>              {
>>> 
>>> That should guarantee that the initialized event gets deferred until all outstanding refreshes finish.. but it's not.  Something seems to have changed under the hood in how background events are getting sent to TreeCache, and I don't really understand it yet.  And running the debugger seems to affect the timing, like something racy is going on. :(
>>> 
>>> 
>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>> Ok, that is kind of weird.  I'll take a look.
>>> 
>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>> No, sorry. The last few lines of the test currently are:
>>> 
>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>> 
>>> This fails. But, if I switch them it works:
>>> 
>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>> 
>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> So you end up with 2 initialized events?
>>>> 
>>>> You mean this?
>>>> 
>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>> 
>>>> Seems weird if there are two, but I can help look.
>>>> 
>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>> Hey Scott,
>>>> 
>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>> 
>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>> 
>>>> However, if I change the two asserts to:
>>>> 
>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>> 
>>>> it works. Does that make any sense?
>>>> 
>>>> -Jordan
>>>> 
>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>>> >
>>>> > Devs,
>>>> >
>>>> > In trying to fix the bad log message "Failed to find watcher” (which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate more sets of eyes on this change. Please review carefully if you can.
>>>> >
>>>> > https://github.com/apache/curator/pull/131 <https://github.com/apache/curator/pull/131>
>>>> >
>>>> > -Jordan
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 


Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
I need to analyze this a bit deeper, but what I'm seeing on the 3.0 branch
is that the ephemeral node /test/me created in testKilledSession() really
isn't disappearing when it should.

After the session loss and the reconnect, /test still shows 2 children
[foo, me] and /test/me still returns a node.

Any idea why the timing here would have changed?

On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dr...@gmail.com> wrote:

> https://issues.apache.org/jira/browse/CURATOR-302
>
> I need to trace through what's really going on under the hood rather than
> band-aid the test.  Should be able to in next couple of days.
>
> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> Any update on this? I think you should create a Jira for it.
>>
>> -Jordan
>>
>>
>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dr...@gmail.com> wrote:
>>
>> BTW, this test passes on master... so it's some kind of 3.0 vs. master
>> issue.  I think I'm going to just have to dump in a ton of log messages and
>> see what differs.
>>
>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>>> OK - please create a new Issue in Jira for this.
>>>
>>> -Jordan
>>>
>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dr...@gmail.com> wrote:
>>>
>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been
>>> broken for a while.  Maybe I'll have to git bisect...
>>>
>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dr...@gmail.com>
>>> wrote:
>>>
>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I
>>>> think there is a legit bug/race in TreeCache, and the following patch
>>>> *should* remedy:
>>>>
>>>> diff --git
>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>> index df4403c..a4a022b 100644
>>>> ---
>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>> +++
>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>          void wasDeleted() throws Exception
>>>>          {
>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>> -
>>>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>              ConcurrentMap<String, TreeNode> childMap =
>>>> children.getAndSet(null);
>>>>              if ( childMap != null )
>>>>              {
>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>          case RECONNECTED:
>>>>              try
>>>>              {
>>>> +                outstandingOps.incrementAndGet();
>>>>                  root.wasReconnected();
>>>>
>>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>> +                {
>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>> +                    {
>>>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>> +                    }
>>>> +                }
>>>>              }
>>>>              catch ( Exception e )
>>>>              {
>>>>
>>>> That should guarantee that the initialized event gets deferred until
>>>> all outstanding refreshes finish.. but it's not.  Something seems to have
>>>> changed under the hood in how background events are getting sent to
>>>> TreeCache, and I don't really understand it yet.  And running the debugger
>>>> seems to affect the timing, like something racy is going on. :(
>>>>
>>>>
>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>
>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>
>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>>
>>>>>>
>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>
>>>>>> This fails. But, if I switch them it works:
>>>>>>
>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>
>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>>
>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com> wrote:
>>>>>>
>>>>>> So you end up with 2 initialized events?
>>>>>>
>>>>>> You mean this?
>>>>>>
>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>> "data".getBytes());
>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>
>>>>>> Seems weird if there are two, but I can help look.
>>>>>>
>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>
>>>>>>> Hey Scott,
>>>>>>>
>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>>
>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>> "data".getBytes());
>>>>>>>
>>>>>>> However, if I change the two asserts to:
>>>>>>>
>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>>> "data".getBytes());
>>>>>>>
>>>>>>> it works. Does that make any sense?
>>>>>>>
>>>>>>> -Jordan
>>>>>>>
>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>> >
>>>>>>> > Devs,
>>>>>>> >
>>>>>>> > In trying to fix the bad log message "Failed to find watcher”
>>>>>>> (which turns out to be a ZK client issue), I realize that the
>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still
>>>>>>> working on getting all tests to pass but I’d appreciate more sets of eyes
>>>>>>> on this change. Please review carefully if you can.
>>>>>>> >
>>>>>>> > https://github.com/apache/curator/pull/131
>>>>>>> >
>>>>>>> > -Jordan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
https://issues.apache.org/jira/browse/CURATOR-302

I need to trace through what's really going on under the hood rather than
band-aid the test.  Should be able to in next couple of days.

On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <jordan@jordanzimmerman.com
> wrote:

> Any update on this? I think you should create a Jira for it.
>
> -Jordan
>
>
> On Feb 5, 2016, at 12:30 PM, Scott Blum <dr...@gmail.com> wrote:
>
> BTW, this test passes on master... so it's some kind of 3.0 vs. master
> issue.  I think I'm going to just have to dump in a ton of log messages and
> see what differs.
>
> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> OK - please create a new Issue in Jira for this.
>>
>> -Jordan
>>
>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dr...@gmail.com> wrote:
>>
>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been
>> broken for a while.  Maybe I'll have to git bisect...
>>
>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dr...@gmail.com>
>> wrote:
>>
>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I
>>> think there is a legit bug/race in TreeCache, and the following patch
>>> *should* remedy:
>>>
>>> diff --git
>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>> index df4403c..a4a022b 100644
>>> ---
>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>> +++
>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>          void wasDeleted() throws Exception
>>>          {
>>>              ChildData oldChildData = childData.getAndSet(null);
>>> -
>>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>              ConcurrentMap<String, TreeNode> childMap =
>>> children.getAndSet(null);
>>>              if ( childMap != null )
>>>              {
>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>          case RECONNECTED:
>>>              try
>>>              {
>>> +                outstandingOps.incrementAndGet();
>>>                  root.wasReconnected();
>>>
>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>> +                {
>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>> +                    {
>>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>> +                    }
>>> +                }
>>>              }
>>>              catch ( Exception e )
>>>              {
>>>
>>> That should guarantee that the initialized event gets deferred until all
>>> outstanding refreshes finish.. but it's not.  Something seems to have
>>> changed under the hood in how background events are getting sent to
>>> TreeCache, and I don't really understand it yet.  And running the debugger
>>> seems to affect the timing, like something racy is going on. :(
>>>
>>>
>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dr...@gmail.com>
>>> wrote:
>>>
>>>> Ok, that is kind of weird.  I'll take a look.
>>>>
>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>>> jordan@jordanzimmerman.com> wrote:
>>>>
>>>>> No, sorry. The last few lines of the test currently are:
>>>>>
>>>>>
>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>
>>>>> This fails. But, if I switch them it works:
>>>>>
>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>
>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>>
>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com> wrote:
>>>>>
>>>>> So you end up with 2 initialized events?
>>>>>
>>>>> You mean this?
>>>>>
>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>> "data".getBytes());
>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>
>>>>> Seems weird if there are two, but I can help look.
>>>>>
>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>
>>>>>> Hey Scott,
>>>>>>
>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>
>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>> "data".getBytes());
>>>>>>
>>>>>> However, if I change the two asserts to:
>>>>>>
>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>>> "data".getBytes());
>>>>>>
>>>>>> it works. Does that make any sense?
>>>>>>
>>>>>> -Jordan
>>>>>>
>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>> >
>>>>>> > Devs,
>>>>>> >
>>>>>> > In trying to fix the bad log message "Failed to find watcher”
>>>>>> (which turns out to be a ZK client issue), I realize that the
>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still
>>>>>> working on getting all tests to pass but I’d appreciate more sets of eyes
>>>>>> on this change. Please review carefully if you can.
>>>>>> >
>>>>>> > https://github.com/apache/curator/pull/131
>>>>>> >
>>>>>> > -Jordan
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Any update on this? I think you should create a Jira for it.

-Jordan

> On Feb 5, 2016, at 12:30 PM, Scott Blum <dr...@gmail.com> wrote:
> 
> BTW, this test passes on master... so it's some kind of 3.0 vs. master issue.  I think I'm going to just have to dump in a ton of log messages and see what differs.
> 
> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
> OK - please create a new Issue in Jira for this.
> 
> -Jordan
> 
>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> 
>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been broken for a while.  Maybe I'll have to git bisect...
>> 
>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I think there is a legit bug/race in TreeCache, and the following patch *should* remedy:
>> 
>> diff --git a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>> index df4403c..a4a022b 100644
>> --- a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>> +++ b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>          void wasDeleted() throws Exception
>>          {
>>              ChildData oldChildData = childData.getAndSet(null);
>> -            client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>              ConcurrentMap<String, TreeNode> childMap = children.getAndSet(null);
>>              if ( childMap != null )
>>              {
>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>          case RECONNECTED:
>>              try
>>              {
>> +                outstandingOps.incrementAndGet();
>>                  root.wasReconnected();
>>                  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>> +                if ( outstandingOps.decrementAndGet() == 0 )
>> +                {
>> +                    if ( isInitialized.compareAndSet(false, true) )
>> +                    {
>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>> +                    }
>> +                }
>>              }
>>              catch ( Exception e )
>>              {
>> 
>> That should guarantee that the initialized event gets deferred until all outstanding refreshes finish.. but it's not.  Something seems to have changed under the hood in how background events are getting sent to TreeCache, and I don't really understand it yet.  And running the debugger seems to affect the timing, like something racy is going on. :(
>> 
>> 
>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> Ok, that is kind of weird.  I'll take a look.
>> 
>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>> No, sorry. The last few lines of the test currently are:
>> 
>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>> 
>> This fails. But, if I switch them it works:
>> 
>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>> 
>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> So you end up with 2 initialized events?
>>> 
>>> You mean this?
>>> 
>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>> 
>>> Seems weird if there are two, but I can help look.
>>> 
>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>> Hey Scott,
>>> 
>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>> 
>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>> 
>>> However, if I change the two asserts to:
>>> 
>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>> 
>>> it works. Does that make any sense?
>>> 
>>> -Jordan
>>> 
>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>>> >
>>> > Devs,
>>> >
>>> > In trying to fix the bad log message "Failed to find watcher” (which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate more sets of eyes on this change. Please review carefully if you can.
>>> >
>>> > https://github.com/apache/curator/pull/131 <https://github.com/apache/curator/pull/131>
>>> >
>>> > -Jordan
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 


Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
BTW, this test passes on master... so it's some kind of 3.0 vs. master
issue.  I think I'm going to just have to dump in a ton of log messages and
see what differs.

On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> OK - please create a new Issue in Jira for this.
>
> -Jordan
>
> On Feb 5, 2016, at 12:24 PM, Scott Blum <dr...@gmail.com> wrote:
>
> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been
> broken for a while.  Maybe I'll have to git bisect...
>
> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dr...@gmail.com> wrote:
>
>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I think
>> there is a legit bug/race in TreeCache, and the following patch *should*
>> remedy:
>>
>> diff --git
>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>> index df4403c..a4a022b 100644
>> ---
>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>> +++
>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>          void wasDeleted() throws Exception
>>          {
>>              ChildData oldChildData = childData.getAndSet(null);
>> -
>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>              ConcurrentMap<String, TreeNode> childMap =
>> children.getAndSet(null);
>>              if ( childMap != null )
>>              {
>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>          case RECONNECTED:
>>              try
>>              {
>> +                outstandingOps.incrementAndGet();
>>                  root.wasReconnected();
>>                  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>> +                if ( outstandingOps.decrementAndGet() == 0 )
>> +                {
>> +                    if ( isInitialized.compareAndSet(false, true) )
>> +                    {
>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>> +                    }
>> +                }
>>              }
>>              catch ( Exception e )
>>              {
>>
>> That should guarantee that the initialized event gets deferred until all
>> outstanding refreshes finish.. but it's not.  Something seems to have
>> changed under the hood in how background events are getting sent to
>> TreeCache, and I don't really understand it yet.  And running the debugger
>> seems to affect the timing, like something racy is going on. :(
>>
>>
>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dr...@gmail.com>
>> wrote:
>>
>>> Ok, that is kind of weird.  I'll take a look.
>>>
>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>> jordan@jordanzimmerman.com> wrote:
>>>
>>>> No, sorry. The last few lines of the test currently are:
>>>>
>>>>
>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>
>>>> This fails. But, if I switch them it works:
>>>>
>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>
>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>
>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com> wrote:
>>>>
>>>> So you end up with 2 initialized events?
>>>>
>>>> You mean this?
>>>>
>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>> "data".getBytes());
>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>
>>>> Seems weird if there are two, but I can help look.
>>>>
>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>> jordan@jordanzimmerman.com> wrote:
>>>>
>>>>> Hey Scott,
>>>>>
>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>
>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>> "data".getBytes());
>>>>>
>>>>> However, if I change the two asserts to:
>>>>>
>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>>> "data".getBytes());
>>>>>
>>>>> it works. Does that make any sense?
>>>>>
>>>>> -Jordan
>>>>>
>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>>> jordan@jordanzimmerman.com> wrote:
>>>>> >
>>>>> > Devs,
>>>>> >
>>>>> > In trying to fix the bad log message "Failed to find watcher” (which
>>>>> turns out to be a ZK client issue), I realize that the NamespaceWatcher and
>>>>> WatcherWrapper stuff could be improved. I’m still working on getting all
>>>>> tests to pass but I’d appreciate more sets of eyes on this change. Please
>>>>> review carefully if you can.
>>>>> >
>>>>> > https://github.com/apache/curator/pull/131
>>>>> >
>>>>> > -Jordan
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
OK - please create a new Issue in Jira for this.

-Jordan

> On Feb 5, 2016, at 12:24 PM, Scott Blum <dr...@gmail.com> wrote:
> 
> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been broken for a while.  Maybe I'll have to git bisect...
> 
> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
> Okay, so I looked into this for a bit, and I hit kind of a wall.  I think there is a legit bug/race in TreeCache, and the following patch *should* remedy:
> 
> diff --git a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
> index df4403c..a4a022b 100644
> --- a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
> +++ b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>          void wasDeleted() throws Exception
>          {
>              ChildData oldChildData = childData.getAndSet(null);
> -            client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>              ConcurrentMap<String, TreeNode> childMap = children.getAndSet(null);
>              if ( childMap != null )
>              {
> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>          case RECONNECTED:
>              try
>              {
> +                outstandingOps.incrementAndGet();
>                  root.wasReconnected();
>                  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
> +                if ( outstandingOps.decrementAndGet() == 0 )
> +                {
> +                    if ( isInitialized.compareAndSet(false, true) )
> +                    {
> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
> +                    }
> +                }
>              }
>              catch ( Exception e )
>              {
> 
> That should guarantee that the initialized event gets deferred until all outstanding refreshes finish.. but it's not.  Something seems to have changed under the hood in how background events are getting sent to TreeCache, and I don't really understand it yet.  And running the debugger seems to affect the timing, like something racy is going on. :(
> 
> 
> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
> Ok, that is kind of weird.  I'll take a look.
> 
> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
> No, sorry. The last few lines of the test currently are:
> 
> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
> assertEvent(TreeCacheEvent.Type.INITIALIZED);
> 
> This fails. But, if I switch them it works:
> 
> assertEvent(TreeCacheEvent.Type.INITIALIZED);
> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
> 
>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> 
>> So you end up with 2 initialized events?
>> 
>> You mean this?
>> 
>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>> 
>> Seems weird if there are two, but I can help look.
>> 
>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>> Hey Scott,
>> 
>> In this branch, TestTreeCache.testKilledSession() is failing at:
>> 
>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>> 
>> However, if I change the two asserts to:
>> 
>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>> 
>> it works. Does that make any sense?
>> 
>> -Jordan
>> 
>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
>> >
>> > Devs,
>> >
>> > In trying to fix the bad log message "Failed to find watcher” (which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate more sets of eyes on this change. Please review carefully if you can.
>> >
>> > https://github.com/apache/curator/pull/131 <https://github.com/apache/curator/pull/131>
>> >
>> > -Jordan
>> 
>> 
> 
> 
> 
> 


Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
BTW: this is broken on CURATOR-3.0 as well, so it appears to have been
broken for a while.  Maybe I'll have to git bisect...

On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dr...@gmail.com> wrote:

> Okay, so I looked into this for a bit, and I hit kind of a wall.  I think
> there is a legit bug/race in TreeCache, and the following patch *should*
> remedy:
>
> diff --git
> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
> index df4403c..a4a022b 100644
> ---
> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
> +++
> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>          void wasDeleted() throws Exception
>          {
>              ChildData oldChildData = childData.getAndSet(null);
> -
>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>              ConcurrentMap<String, TreeNode> childMap =
> children.getAndSet(null);
>              if ( childMap != null )
>              {
> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>          case RECONNECTED:
>              try
>              {
> +                outstandingOps.incrementAndGet();
>                  root.wasReconnected();
>                  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
> +                if ( outstandingOps.decrementAndGet() == 0 )
> +                {
> +                    if ( isInitialized.compareAndSet(false, true) )
> +                    {
> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
> +                    }
> +                }
>              }
>              catch ( Exception e )
>              {
>
> That should guarantee that the initialized event gets deferred until all
> outstanding refreshes finish.. but it's not.  Something seems to have
> changed under the hood in how background events are getting sent to
> TreeCache, and I don't really understand it yet.  And running the debugger
> seems to affect the timing, like something racy is going on. :(
>
>
> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dr...@gmail.com> wrote:
>
>> Ok, that is kind of weird.  I'll take a look.
>>
>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>>> No, sorry. The last few lines of the test currently are:
>>>
>>>
>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>
>>> This fails. But, if I switch them it works:
>>>
>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>
>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>
>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com> wrote:
>>>
>>> So you end up with 2 initialized events?
>>>
>>> You mean this?
>>>
>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>> "data".getBytes());
>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>
>>> Seems weird if there are two, but I can help look.
>>>
>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>> jordan@jordanzimmerman.com> wrote:
>>>
>>>> Hey Scott,
>>>>
>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>
>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>> "data".getBytes());
>>>>
>>>> However, if I change the two asserts to:
>>>>
>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>>> "data".getBytes());
>>>>
>>>> it works. Does that make any sense?
>>>>
>>>> -Jordan
>>>>
>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>> jordan@jordanzimmerman.com> wrote:
>>>> >
>>>> > Devs,
>>>> >
>>>> > In trying to fix the bad log message "Failed to find watcher” (which
>>>> turns out to be a ZK client issue), I realize that the NamespaceWatcher and
>>>> WatcherWrapper stuff could be improved. I’m still working on getting all
>>>> tests to pass but I’d appreciate more sets of eyes on this change. Please
>>>> review carefully if you can.
>>>> >
>>>> > https://github.com/apache/curator/pull/131
>>>> >
>>>> > -Jordan
>>>>
>>>>
>>>
>>>
>>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
Okay, so I looked into this for a bit, and I hit kind of a wall.  I think
there is a legit bug/race in TreeCache, and the following patch *should*
remedy:

diff --git
a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
index df4403c..a4a022b 100644
---
a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
+++
b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
@@ -303,7 +303,6 @@ public class TreeCache implements Closeable
         void wasDeleted() throws Exception
         {
             ChildData oldChildData = childData.getAndSet(null);
-
 client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
             ConcurrentMap<String, TreeNode> childMap =
children.getAndSet(null);
             if ( childMap != null )
             {
@@ -807,8 +806,16 @@ public class TreeCache implements Closeable
         case RECONNECTED:
             try
             {
+                outstandingOps.incrementAndGet();
                 root.wasReconnected();
                 publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
+                if ( outstandingOps.decrementAndGet() == 0 )
+                {
+                    if ( isInitialized.compareAndSet(false, true) )
+                    {
+                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
+                    }
+                }
             }
             catch ( Exception e )
             {

That should guarantee that the initialized event gets deferred until all
outstanding refreshes finish.. but it's not.  Something seems to have
changed under the hood in how background events are getting sent to
TreeCache, and I don't really understand it yet.  And running the debugger
seems to affect the timing, like something racy is going on. :(


On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dr...@gmail.com> wrote:

> Ok, that is kind of weird.  I'll take a look.
>
> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> No, sorry. The last few lines of the test currently are:
>>
>>
>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>
>> This fails. But, if I switch them it works:
>>
>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>
>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>
>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com> wrote:
>>
>> So you end up with 2 initialized events?
>>
>> You mean this?
>>
>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>> "data".getBytes());
>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>
>> Seems weird if there are two, but I can help look.
>>
>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>>> Hey Scott,
>>>
>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>
>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>> "data".getBytes());
>>>
>>> However, if I change the two asserts to:
>>>
>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>>> "data".getBytes());
>>>
>>> it works. Does that make any sense?
>>>
>>> -Jordan
>>>
>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>> jordan@jordanzimmerman.com> wrote:
>>> >
>>> > Devs,
>>> >
>>> > In trying to fix the bad log message "Failed to find watcher” (which
>>> turns out to be a ZK client issue), I realize that the NamespaceWatcher and
>>> WatcherWrapper stuff could be improved. I’m still working on getting all
>>> tests to pass but I’d appreciate more sets of eyes on this change. Please
>>> review carefully if you can.
>>> >
>>> > https://github.com/apache/curator/pull/131
>>> >
>>> > -Jordan
>>>
>>>
>>
>>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
Ok, that is kind of weird.  I'll take a look.

On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <jordan@jordanzimmerman.com
> wrote:

> No, sorry. The last few lines of the test currently are:
>
>
> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>
> This fails. But, if I switch them it works:
>
> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>
> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>
> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com> wrote:
>
> So you end up with 2 initialized events?
>
> You mean this?
>
>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
> "data".getBytes());
>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>
> Seems weird if there are two, but I can help look.
>
> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> Hey Scott,
>>
>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>
>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>> "data".getBytes());
>>
>> However, if I change the two asserts to:
>>
>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
>> "data".getBytes());
>>
>> it works. Does that make any sense?
>>
>> -Jordan
>>
>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>> >
>> > Devs,
>> >
>> > In trying to fix the bad log message "Failed to find watcher” (which
>> turns out to be a ZK client issue), I realize that the NamespaceWatcher and
>> WatcherWrapper stuff could be improved. I’m still working on getting all
>> tests to pass but I’d appreciate more sets of eyes on this change. Please
>> review carefully if you can.
>> >
>> > https://github.com/apache/curator/pull/131
>> >
>> > -Jordan
>>
>>
>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
No, sorry. The last few lines of the test currently are:

assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
assertEvent(TreeCacheEvent.Type.INITIALIZED);

This fails. But, if I switch them it works:

assertEvent(TreeCacheEvent.Type.INITIALIZED);
assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());

> On Feb 5, 2016, at 2:57 AM, Scott Blum <dr...@gmail.com> wrote:
> 
> So you end up with 2 initialized events?
> 
> You mean this?
> 
>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
> 
> Seems weird if there are two, but I can help look.
> 
> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
> Hey Scott,
> 
> In this branch, TestTreeCache.testKilledSession() is failing at:
> 
>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
> 
> However, if I change the two asserts to:
> 
>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
> 
> it works. Does that make any sense?
> 
> -Jordan
> 
> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
> >
> > Devs,
> >
> > In trying to fix the bad log message "Failed to find watcher” (which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate more sets of eyes on this change. Please review carefully if you can.
> >
> > https://github.com/apache/curator/pull/131 <https://github.com/apache/curator/pull/131>
> >
> > -Jordan
> 
> 


Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Scott Blum <dr...@gmail.com>.
So you end up with 2 initialized events?

You mean this?

         assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
+        assertEvent(TreeCacheEvent.Type.INITIALIZED);
         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
"data".getBytes());
         assertEvent(TreeCacheEvent.Type.INITIALIZED);

Seems weird if there are two, but I can help look.

On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> Hey Scott,
>
> In this branch, TestTreeCache.testKilledSession() is failing at:
>
>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
> "data".getBytes());
>
> However, if I change the two asserts to:
>
>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
> "data".getBytes());
>
> it works. Does that make any sense?
>
> -Jordan
>
> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jo...@jordanzimmerman.com>
> wrote:
> >
> > Devs,
> >
> > In trying to fix the bad log message "Failed to find watcher” (which
> turns out to be a ZK client issue), I realize that the NamespaceWatcher and
> WatcherWrapper stuff could be improved. I’m still working on getting all
> tests to pass but I’d appreciate more sets of eyes on this change. Please
> review carefully if you can.
> >
> > https://github.com/apache/curator/pull/131
> >
> > -Jordan
>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Hey Scott,

In this branch, TestTreeCache.testKilledSession() is failing at:

	assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());

However, if I change the two asserts to:

	assertEvent(TreeCacheEvent.Type.INITIALIZED);
	assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());

it works. Does that make any sense?

-Jordan

> On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
> 
> Devs,
> 
> In trying to fix the bad log message "Failed to find watcher” (which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate more sets of eyes on this change. Please review carefully if you can.
> 
> https://github.com/apache/curator/pull/131
> 
> -Jordan