You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@curator.apache.org by Alfredo Gimenez <al...@gmail.com> on 2018/12/06 21:42:21 UTC

Possible deadlock in blockUntilConnected?

I ran into what looks like a deadlock in blockUntilConnected and wanted to
give a high-level description in case someone can help me debug the issue.
I can try to make a reproducible example, but for reasons that will be
apparent, that's not straightforward.

I am using Curator within a custom Kafka Connect source. As a result, I
have a process per node on 11 nodes, and up to 12 tasks (threads) per node,
each with its own Curator client. Every node is also running zookeeper, so
I initialize the Curator clients by pointing to localhost:2181. On 9 nodes,
everything works perfectly, but on the other 2, all tasks seem to hang at
blockUntilConnected (specifically here:
https://github.com/apache/curator/blob/ae309a29643afc6df511d1d9a162526ce474598b/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L224).
I found this by observing no activity in my Kafka Connect logs and grabbing
a stacktrace via jstack on the offending nodes.

I also made a small test program that just initializes a client and runs
blockUntilConnected (nothing else) and ran it at the same time, and it also
hangs there forever. Meanwhile, I can use zookeeper-shell on localhost just
fine, and if I initialize a Curator client pointing to one of the other
nodes (not localhost) the Curator client initializes fine.

Is this a possible deadlock from initializing Curator clients across
multiple threads concurrently?

Re: Possible deadlock in blockUntilConnected?

Posted by Alfredo Gimenez <al...@gmail.com>.
Thanks! If the deadlock shows up elsewhere after removing
blockUntilConnected I'll be sure to fish out the logs and post them.

On Fri, Dec 7, 2018 at 10:55 AM Jordan Zimmerman <jo...@jordanzimmerman.com>
wrote:

> A few things to try:
>
>
>    - Integer.MAX_VALUE is not very useful for the session/connection
>    timeouts. Try reasonable numbers.
>    - The call to client.blockUntilConnected() isn't needed, remove it
>
>
> Other than that, I'd need to see logs to see why ZooKeeper itself isn't
> connecting.
>
> -JZ
>
> On Dec 7, 2018, at 1:36 PM, Alfredo Gimenez <al...@gmail.com>
> wrote:
>
> Absolutely, I just wanted to give the high-level description in case this
> was a clear anti-pattern (multiple threads connecting concurrently to ZK on
> localhost).
>
> Each thread has its own Curator client right now because of the design of
> Kafka Connect--Connect "tasks", which run on separate threads, are meant to
> run independently of each other (no shared state in the VM). I'll see if
> it's possible to modify them to share a client--if that's necessary, does
> that mean client initialization is not thread-safe?
>
> The thread dump of the deadlocked tasks (all have this same dump):
>
> "pool-1-thread-1" #64 prio=5 os_prio=0 tid=0x00002ab5c0062800 nid=0x5b2d
> in Object.wait() [0x00002ab5a520c000]
>    java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at
> org.apache.curator.framework.state.ConnectionStateManager.blockUntilConnected(ConnectionStateManager.java:224)
> - locked <0x00002aac5183c9a0> (a
> org.apache.curator.framework.state.ConnectionStateManager)
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:272)
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:278)
> at
> gov.llnl.sonar.kafka.connect.offsetmanager.FileOffsetManager.<init>(FileOffsetManager.java:72)
> at
> gov.llnl.sonar.kafka.connect.connectors.DirectorySourceTask.start(DirectorySourceTask.java:89)
> at
> org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:182)
> at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
> at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> And my Curator code in FileOffsetManager.java, which does little more than
> create a very persistent client with a simple listener
> (DirectorySourceTask.java doesn't do any Curator stuff, here it is just
> creating a FileOffsetManager instance with a provided zookeeper host/port):
>
> client = CuratorFrameworkFactory.newClient(
>         zooKeeperHost + ":" + zooKeeperPort,
>         Integer.MAX_VALUE,
>         Integer.MAX_VALUE,
>         new RetryForever(1000));
>
> FileOffsetManager thisRef = this;
> client.getConnectionStateListenable().addListener(new ConnectionStateListener() {
>     @Override
>     public void stateChanged(CuratorFramework client, ConnectionState newState) {
>         if (!newState.isConnected()) {
>             log.warn("Thread {}: Curator state changed to {} with contents: {}", threadID, newState.toString(), thisRef.toString());
>         }
>     }
> });
> client.start();
> client.blockUntilConnected(); // Sometimes we get stuck here...
>
>
> And my test code does the same without a listener:
>
> CuratorFramework client = CuratorFrameworkFactory.newClient(
>         zooKeeperHost + ":" + zooKeeperPort,
>         Integer.MAX_VALUE,
>         Integer.MAX_VALUE,
>         new RetryForever(1000));
>
> client.start();
> client.blockUntilConnected();
>
>
> On Fri, Dec 7, 2018 at 5:21 AM Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> There isn't much to go on in your description. Please send some sample
>> code, logs, possibly a thread dump. Maybe send your test program. One thing
>> that sticks out is that you say each thread has its own Curator client. Why
>> is that? You only need 1 Curator client per ZK ensemble in a VM.
>>
>> -Jordan
>>
>> On Dec 6, 2018, at 4:42 PM, Alfredo Gimenez <al...@gmail.com>
>> wrote:
>>
>> I ran into what looks like a deadlock in blockUntilConnected and wanted
>> to give a high-level description in case someone can help me debug the
>> issue. I can try to make a reproducible example, but for reasons that will
>> be apparent, that's not straightforward.
>>
>> I am using Curator within a custom Kafka Connect source. As a result, I
>> have a process per node on 11 nodes, and up to 12 tasks (threads) per node,
>> each with its own Curator client. Every node is also running zookeeper, so
>> I initialize the Curator clients by pointing to localhost:2181. On 9 nodes,
>> everything works perfectly, but on the other 2, all tasks seem to hang at
>> blockUntilConnected (specifically here:
>> https://github.com/apache/curator/blob/ae309a29643afc6df511d1d9a162526ce474598b/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L224).
>> I found this by observing no activity in my Kafka Connect logs and grabbing
>> a stacktrace via jstack on the offending nodes.
>>
>> I also made a small test program that just initializes a client and runs
>> blockUntilConnected (nothing else) and ran it at the same time, and it also
>> hangs there forever. Meanwhile, I can use zookeeper-shell on localhost just
>> fine, and if I initialize a Curator client pointing to one of the other
>> nodes (not localhost) the Curator client initializes fine.
>>
>> Is this a possible deadlock from initializing Curator clients across
>> multiple threads concurrently?
>>
>>
>>
>

Re: Possible deadlock in blockUntilConnected?

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
A few things to try:

Integer.MAX_VALUE is not very useful for the session/connection timeouts. Try reasonable numbers.
The call to client.blockUntilConnected() isn't needed, remove it

Other than that, I'd need to see logs to see why ZooKeeper itself isn't connecting.

-JZ

> On Dec 7, 2018, at 1:36 PM, Alfredo Gimenez <al...@gmail.com> wrote:
> 
> Absolutely, I just wanted to give the high-level description in case this was a clear anti-pattern (multiple threads connecting concurrently to ZK on localhost).
> 
> Each thread has its own Curator client right now because of the design of Kafka Connect--Connect "tasks", which run on separate threads, are meant to run independently of each other (no shared state in the VM). I'll see if it's possible to modify them to share a client--if that's necessary, does that mean client initialization is not thread-safe?
> 
> The thread dump of the deadlocked tasks (all have this same dump):
> 
> "pool-1-thread-1" #64 prio=5 os_prio=0 tid=0x00002ab5c0062800 nid=0x5b2d in Object.wait() [0x00002ab5a520c000]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	at java.lang.Object.wait(Object.java:502)
> 	at org.apache.curator.framework.state.ConnectionStateManager.blockUntilConnected(ConnectionStateManager.java:224)
> 	- locked <0x00002aac5183c9a0> (a org.apache.curator.framework.state.ConnectionStateManager)
> 	at org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:272)
> 	at org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:278)
> 	at gov.llnl.sonar.kafka.connect.offsetmanager.FileOffsetManager.<init>(FileOffsetManager.java:72)
> 	at gov.llnl.sonar.kafka.connect.connectors.DirectorySourceTask.start(DirectorySourceTask.java:89)
> 	at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:182)
> 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
> 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> 
> And my Curator code in FileOffsetManager.java, which does little more than create a very persistent client with a simple listener (DirectorySourceTask.java doesn't do any Curator stuff, here it is just creating a FileOffsetManager instance with a provided zookeeper host/port):
> client = CuratorFrameworkFactory.newClient(
>         zooKeeperHost + ":" + zooKeeperPort,
>         Integer.MAX_VALUE,
>         Integer.MAX_VALUE,
>         new RetryForever(1000));
> 
> FileOffsetManager thisRef = this;
> client.getConnectionStateListenable().addListener(new ConnectionStateListener() {
>     @Override
>     public void stateChanged(CuratorFramework client, ConnectionState newState) {
>         if (!newState.isConnected()) {
>             log.warn("Thread {}: Curator state changed to {} with contents: {}", threadID, newState.toString(), thisRef.toString());
>         }
>     }
> });
> client.start();
> client.blockUntilConnected(); // Sometimes we get stuck here...
> 
> And my test code does the same without a listener:
> CuratorFramework client = CuratorFrameworkFactory.newClient(
>         zooKeeperHost + ":" + zooKeeperPort,
>         Integer.MAX_VALUE,
>         Integer.MAX_VALUE,
>         new RetryForever(1000));
> 
> client.start();
> client.blockUntilConnected();
> 
> On Fri, Dec 7, 2018 at 5:21 AM Jordan Zimmerman <jordan@jordanzimmerman.com <ma...@jordanzimmerman.com>> wrote:
> There isn't much to go on in your description. Please send some sample code, logs, possibly a thread dump. Maybe send your test program. One thing that sticks out is that you say each thread has its own Curator client. Why is that? You only need 1 Curator client per ZK ensemble in a VM.
> 
> -Jordan
> 
>> On Dec 6, 2018, at 4:42 PM, Alfredo Gimenez <alfredo.gimenez@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I ran into what looks like a deadlock in blockUntilConnected and wanted to give a high-level description in case someone can help me debug the issue. I can try to make a reproducible example, but for reasons that will be apparent, that's not straightforward.
>> 
>> I am using Curator within a custom Kafka Connect source. As a result, I have a process per node on 11 nodes, and up to 12 tasks (threads) per node, each with its own Curator client. Every node is also running zookeeper, so I initialize the Curator clients by pointing to localhost:2181. On 9 nodes, everything works perfectly, but on the other 2, all tasks seem to hang at blockUntilConnected (specifically here: https://github.com/apache/curator/blob/ae309a29643afc6df511d1d9a162526ce474598b/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L224 <https://github.com/apache/curator/blob/ae309a29643afc6df511d1d9a162526ce474598b/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L224>). I found this by observing no activity in my Kafka Connect logs and grabbing a stacktrace via jstack on the offending nodes.
>> 
>> I also made a small test program that just initializes a client and runs blockUntilConnected (nothing else) and ran it at the same time, and it also hangs there forever. Meanwhile, I can use zookeeper-shell on localhost just fine, and if I initialize a Curator client pointing to one of the other nodes (not localhost) the Curator client initializes fine. 
>> 
>> Is this a possible deadlock from initializing Curator clients across multiple threads concurrently?
> 


Re: Possible deadlock in blockUntilConnected?

Posted by Alfredo Gimenez <al...@gmail.com>.
Absolutely, I just wanted to give the high-level description in case this
was a clear anti-pattern (multiple threads connecting concurrently to ZK on
localhost).

Each thread has its own Curator client right now because of the design of
Kafka Connect--Connect "tasks", which run on separate threads, are meant to
run independently of each other (no shared state in the VM). I'll see if
it's possible to modify them to share a client--if that's necessary, does
that mean client initialization is not thread-safe?

The thread dump of the deadlocked tasks (all have this same dump):

"pool-1-thread-1" #64 prio=5 os_prio=0 tid=0x00002ab5c0062800 nid=0x5b2d in
Object.wait() [0x00002ab5a520c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at
org.apache.curator.framework.state.ConnectionStateManager.blockUntilConnected(ConnectionStateManager.java:224)
- locked <0x00002aac5183c9a0> (a
org.apache.curator.framework.state.ConnectionStateManager)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:272)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:278)
at
gov.llnl.sonar.kafka.connect.offsetmanager.FileOffsetManager.<init>(FileOffsetManager.java:72)
at
gov.llnl.sonar.kafka.connect.connectors.DirectorySourceTask.start(DirectorySourceTask.java:89)
at
org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:182)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

And my Curator code in FileOffsetManager.java, which does little more than
create a very persistent client with a simple listener
(DirectorySourceTask.java doesn't do any Curator stuff, here it is just
creating a FileOffsetManager instance with a provided zookeeper host/port):

client = CuratorFrameworkFactory.newClient(
        zooKeeperHost + ":" + zooKeeperPort,
        Integer.MAX_VALUE,
        Integer.MAX_VALUE,
        new RetryForever(1000));

FileOffsetManager thisRef = this;
client.getConnectionStateListenable().addListener(new
ConnectionStateListener() {
    @Override
    public void stateChanged(CuratorFramework client, ConnectionState
newState) {
        if (!newState.isConnected()) {
            log.warn("Thread {}: Curator state changed to {} with
contents: {}", threadID, newState.toString(), thisRef.toString());
        }
    }
});
client.start();
client.blockUntilConnected(); // Sometimes we get stuck here...


And my test code does the same without a listener:

CuratorFramework client = CuratorFrameworkFactory.newClient(
        zooKeeperHost + ":" + zooKeeperPort,
        Integer.MAX_VALUE,
        Integer.MAX_VALUE,
        new RetryForever(1000));

client.start();
client.blockUntilConnected();


On Fri, Dec 7, 2018 at 5:21 AM Jordan Zimmerman <jo...@jordanzimmerman.com>
wrote:

> There isn't much to go on in your description. Please send some sample
> code, logs, possibly a thread dump. Maybe send your test program. One thing
> that sticks out is that you say each thread has its own Curator client. Why
> is that? You only need 1 Curator client per ZK ensemble in a VM.
>
> -Jordan
>
> On Dec 6, 2018, at 4:42 PM, Alfredo Gimenez <al...@gmail.com>
> wrote:
>
> I ran into what looks like a deadlock in blockUntilConnected and wanted to
> give a high-level description in case someone can help me debug the issue.
> I can try to make a reproducible example, but for reasons that will be
> apparent, that's not straightforward.
>
> I am using Curator within a custom Kafka Connect source. As a result, I
> have a process per node on 11 nodes, and up to 12 tasks (threads) per node,
> each with its own Curator client. Every node is also running zookeeper, so
> I initialize the Curator clients by pointing to localhost:2181. On 9 nodes,
> everything works perfectly, but on the other 2, all tasks seem to hang at
> blockUntilConnected (specifically here:
> https://github.com/apache/curator/blob/ae309a29643afc6df511d1d9a162526ce474598b/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L224).
> I found this by observing no activity in my Kafka Connect logs and grabbing
> a stacktrace via jstack on the offending nodes.
>
> I also made a small test program that just initializes a client and runs
> blockUntilConnected (nothing else) and ran it at the same time, and it also
> hangs there forever. Meanwhile, I can use zookeeper-shell on localhost just
> fine, and if I initialize a Curator client pointing to one of the other
> nodes (not localhost) the Curator client initializes fine.
>
> Is this a possible deadlock from initializing Curator clients across
> multiple threads concurrently?
>
>
>

Re: Possible deadlock in blockUntilConnected?

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
There isn't much to go on in your description. Please send some sample code, logs, possibly a thread dump. Maybe send your test program. One thing that sticks out is that you say each thread has its own Curator client. Why is that? You only need 1 Curator client per ZK ensemble in a VM.

-Jordan

> On Dec 6, 2018, at 4:42 PM, Alfredo Gimenez <al...@gmail.com> wrote:
> 
> I ran into what looks like a deadlock in blockUntilConnected and wanted to give a high-level description in case someone can help me debug the issue. I can try to make a reproducible example, but for reasons that will be apparent, that's not straightforward.
> 
> I am using Curator within a custom Kafka Connect source. As a result, I have a process per node on 11 nodes, and up to 12 tasks (threads) per node, each with its own Curator client. Every node is also running zookeeper, so I initialize the Curator clients by pointing to localhost:2181. On 9 nodes, everything works perfectly, but on the other 2, all tasks seem to hang at blockUntilConnected (specifically here: https://github.com/apache/curator/blob/ae309a29643afc6df511d1d9a162526ce474598b/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L224 <https://github.com/apache/curator/blob/ae309a29643afc6df511d1d9a162526ce474598b/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L224>). I found this by observing no activity in my Kafka Connect logs and grabbing a stacktrace via jstack on the offending nodes.
> 
> I also made a small test program that just initializes a client and runs blockUntilConnected (nothing else) and ran it at the same time, and it also hangs there forever. Meanwhile, I can use zookeeper-shell on localhost just fine, and if I initialize a Curator client pointing to one of the other nodes (not localhost) the Curator client initializes fine. 
> 
> Is this a possible deadlock from initializing Curator clients across multiple threads concurrently?