You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@helix.apache.org by Subbu Subramaniam <ss...@linkedin.com.INVALID> on 2015/10/22 04:09:04 UTC

A possible leak of ZkClient object in helix?

Hi,

We are using helix-0.6.5.22 version of helix for pinot development (
https://github.com/linkedin/pinot)

The pinot controller  constructs a ZkHelixPropertyStore() and adds a
listener to listen to changes on all nodes under the path specified.

If the zookeeper server is restarted, then the controller tries to
re-connect and send the entire watch list to the server. Since there is a
1M limit on the size of the message that the server can receive, the server
drops the connection.

At this point, we see that the controller goes into a spin, creating a new
ZkClient object as the connection moves between SyncConnected and
Disconnected states.

I have not tried to reproduce this problem independent of pinot-controller,
but I suspect the following steps could do it.


   - Start zk server
   - Start a program that calls new ZkHelixPropertyStore() that ends up
   setting watches on a bunch of nodes that may exceed 10k bytes (total of all
   path name lengths).
   - Stop the zk server
   - Restart the zk server setting jute.maxbuffer to 10k.
   - The client will probably go into a spin leaking ZkClient objects.


The objects seem to be in a LinkedBlockingQueue, since jvisualvm also shows
LinkedBlockingQueue$Node increasing proportionately by the same number.

Here is a sequence of error messages that may help:

2015/10/21 17:25:18.780 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Unable to read additional data from server sessionid
0x1508cecdce10001, likely server has closed socket, closing socket
connection and attempting reconnect
2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
[pinot-controller] [] zookeeper state changed (Disconnected)
2015/10/21 17:25:18.880 INFO [ZKHelixManager]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, instance:
ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
2015/10/21 17:25:19.741 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Opening socket connection to server localhost/
127.0.0.1:2181
2015/10/21 17:25:19.741 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Socket connection established to localhost/
127.0.0.1:2181, initiating session
2015/10/21 17:25:19.743 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Session establishment complete on server localhost/
127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = 30000
2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
[pinot-controller] [] zookeeper state changed (SyncConnected)
2015/10/21 17:25:19.743 INFO [ZKHelixManager]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] KeeperState:
SyncConnected, zookeeper:State:CONNECTED Timeout:30000
sessionid:0x1508cecdce10001 local:/127.0.0.1:50492 remoteserver:localhost/
127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
pendingresp:1 queuedevents:0
2015/10/21 17:25:19.744 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Unable to read additional data from server sessionid
0x1508cecdce10001, likely server has closed socket, closing socket
connection and attempting reconnect
2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread]
[pinot-controller] [] zookeeper state changed (Disconnected)
2015/10/21 17:25:19.844 INFO [ZKHelixManager]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, instance:
ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
2015/10/21 17:25:19.844 ERROR [ZKHelixManager]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect it.
 maxDisconnectThreshold: 5 disconnects in 300000ms.
2015/10/21 17:25:19.844 INFO [ZKHelixManager]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] disconnect
ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy
2015/10/21 17:25:19.844 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
down HelixTaskExecutor
2015/10/21 17:25:19.845 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
HelixTaskExecutor
2015/10/21 17:25:19.845 INFO [MessageQueueMonitor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
Unregistering ClusterStatus:
cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862
2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool:
java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size = 0,
active threads = 0, queued tasks = 0, completed tasks = 0]
2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool
size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
exectuor for msgType: CONTROLLER_MSG, pool:
java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size = 0,
active threads = 0, queued tasks = 0, completed tasks = 0]
2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool
size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
exectuor for msgType: SCHEDULER_MSG, pool:
java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size = 0,
active threads = 0, queued tasks = 0, completed tasks = 0]
2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool
size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
exectuor for msgType: TASK_REPLY, pool:
java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size = 0,
active threads = 0, queued tasks = 0, completed tasks = 0]
2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool
size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutdown
HelixTaskExecutor finished
2015/10/21 17:25:19.847 INFO [CallbackHandler]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34
START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES
listener:org.apache.helix.messaging.handling.HelixTaskExecutor
2015/10/21 17:25:19.847 INFO [CallbackHandler]
[ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path:
/pinotDevDeploy/CONTROLLER/MESSAGES, listener:
org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc
2015/10/21 17:25:20.340 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Opening socket connection to server
localhost/0:0:0:0:0:0:0:1:2181
2015/10/21 17:25:20.341 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Socket connection established to
localhost/0:0:0:0:0:0:0:1:2181, initiating session
2015/10/21 17:25:20.342 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Session establishment complete on server
localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001, negotiated
timeout = 30000
2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread]
[pinot-controller] [] zookeeper state changed (SyncConnected)
2015/10/21 17:25:20.343 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Unable to read additional data from server sessionid
0x1508cecdce10001, likely server has closed socket, closing socket
connection and attempting reconnect
2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread]
[pinot-controller] [] zookeeper state changed (Disconnected)
2015/10/21 17:25:20.534 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Opening socket connection to server localhost/
127.0.0.1:2181
2015/10/21 17:25:20.535 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Socket connection established to localhost/
127.0.0.1:2181, initiating session
2015/10/21 17:25:20.536 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Session establishment complete on server localhost/
127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = 30000
2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread]
[pinot-controller] [] zookeeper state changed (SyncConnected)
2015/10/21 17:25:20.537 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
[pinot-controller] [] Unable to read additional data from server sessionid
0x1508cecdce10001, likely server has closed socket, closing socket
connection and attempting reconnect
2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread]
[pinot-controller] [] zookeeper state changed (Disconnected)


And so on.

So, it does look like Helix is trying to shut down, and prints the
START:INVOKE, but it never prints the END:INVOKE for the same listener.

Any ideas?

-Subbu

Re: A possible leak of ZkClient object in helix?

Posted by Lei Xia <lx...@linkedin.com.INVALID>.
Actually, the fix has been back-ported to 3.4.9 release. I did verify it in
my local environment and the session reestablishment issue is gone after
bump ZK client version in Helix controller.  I will send a pull request to
bump the zk version.


Thanks
Lei

On Tue, Sep 6, 2016 at 10:32 AM, Lei Xia <lx...@linkedin.com> wrote:

> I verified locally and was able to reproduce the error.  This is not a
> ZkWatch leak,  it seems a known issue (https://issues.apache.org/
> jira/browse/ZOOKEEPER-706) which has been fixed in ZK 3.5.x.  However,
> 3.5.x is still in alpha release, not sure whether should we bump it in
> Helix.
>
>
> Thanks
> Lei
>
> On Wed, Oct 21, 2015 at 9:58 PM, kishore g <g....@gmail.com> wrote:
>
>> Hi Subbu,
>>
>> Can you show the jmap output that showed that the number of zkclients
>> increased.
>>
>> Jason, simulating this is straight forward.
>>
>> - start ZK
>> - set up a dummy cluster
>> - start a dummy server/broker and controller
>> - create a resource
>> - stop ZKServer and restart it after setting jute.maxBuffer to very low
>> number like 10.
>>
>> This will make the client go into connecting/disconnecting loop.  All the
>> while we can do
>>
>> jmap -histo:live and see that the zkClient object keeps increasing.
>>
>> I looked at the code and cant find any place where we would be creating
>> new
>> zkclients. I dont think zkproperty store is needed to reproduce this
>> issue.
>>
>>
>>
>>
>>
>>
>> On Wed, Oct 21, 2015 at 9:25 PM, Zhen Zhang <ne...@gmail.com> wrote:
>>
>> > Hi Subbu,
>> >
>> > I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread
>> inside:
>> >
>> > https://github.com/sgroschupf/zkclient/blob/master/src/main/
>> java/org/I0Itec/zkclient/ZkClient.java#L70
>> >
>> > The ZkEventThread has a unique id:
>> >
>> > https://github.com/sgroschupf/zkclient/blob/master/src/main/
>> java/org/I0Itec/zkclient/ZkEventThread.java#L59
>> >
>> > From the log it seems the ZkEventThread remains the same "
>> > ZkClient-EventThread-34", so there is only one zk client.
>> >
>> > One reason I can think of is zk watch leak. This jira describes the
>> issue:
>> > https://issues.apache.org/jira/browse/HELIX-527
>> >
>> > Does restart the controller resolve the issue?
>> >
>> > Thanks,
>> > Jason
>> >
>> > On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam <
>> > ssubramaniam@linkedin.com.invalid> wrote:
>> >
>> > > Hi,
>> > >
>> > > We are using helix-0.6.5.22 version of helix for pinot development (
>> > > https://github.com/linkedin/pinot)
>> > >
>> > > The pinot controller  constructs a ZkHelixPropertyStore() and adds a
>> > > listener to listen to changes on all nodes under the path specified.
>> > >
>> > > If the zookeeper server is restarted, then the controller tries to
>> > > re-connect and send the entire watch list to the server. Since there
>> is a
>> > > 1M limit on the size of the message that the server can receive, the
>> > server
>> > > drops the connection.
>> > >
>> > > At this point, we see that the controller goes into a spin, creating a
>> > new
>> > > ZkClient object as the connection moves between SyncConnected and
>> > > Disconnected states.
>> > >
>> > > I have not tried to reproduce this problem independent of
>> > pinot-controller,
>> > > but I suspect the following steps could do it.
>> > >
>> > >
>> > >    - Start zk server
>> > >    - Start a program that calls new ZkHelixPropertyStore() that ends
>> up
>> > >    setting watches on a bunch of nodes that may exceed 10k bytes
>> (total
>> > of
>> > > all
>> > >    path name lengths).
>> > >    - Stop the zk server
>> > >    - Restart the zk server setting jute.maxbuffer to 10k.
>> > >    - The client will probably go into a spin leaking ZkClient objects.
>> > >
>> > >
>> > > The objects seem to be in a LinkedBlockingQueue, since jvisualvm also
>> > shows
>> > > LinkedBlockingQueue$Node increasing proportionately by the same
>> number.
>> > >
>> > > Here is a sequence of error messages that may help:
>> > >
>> > > 2015/10/21 17:25:18.780 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Unable to read additional data from server
>> > sessionid
>> > > 0x1508cecdce10001, likely server has closed socket, closing socket
>> > > connection and attempting reconnect
>> > > 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (Disconnected)
>> > > 2015/10/21 17:25:18.880 INFO [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
>> > instance:
>> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
>> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Opening socket connection to server localhost/
>> > > 127.0.0.1:2181
>> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Socket connection established to localhost/
>> > > 127.0.0.1:2181, initiating session
>> > > 2015/10/21 17:25:19.743 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Session establishment complete on server
>> localhost/
>> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
>> > 30000
>> > > 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
>> > > 2015/10/21 17:25:19.743 INFO [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > KeeperState:
>> > > SyncConnected, zookeeper:State:CONNECTED Timeout:30000
>> > > sessionid:0x1508cecdce10001 local:/127.0.0.1:50492
>> > remoteserver:localhost/
>> > > 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
>> > > pendingresp:1 queuedevents:0
>> > > 2015/10/21 17:25:19.744 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Unable to read additional data from server
>> > sessionid
>> > > 0x1508cecdce10001, likely server has closed socket, closing socket
>> > > connection and attempting reconnect
>> > > 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (Disconnected)
>> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
>> > instance:
>> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
>> > > 2015/10/21 17:25:19.844 ERROR [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect
>> it.
>> > >  maxDisconnectThreshold: 5 disconnects in 300000ms.
>> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> disconnect
>> > > ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy
>> > > 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down HelixTaskExecutor
>> > > 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > HelixTaskExecutor
>> > > 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > Unregistering ClusterStatus:
>> > > cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool:
>> > > java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size
>> = 0,
>> > > active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running,
>> > pool
>> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > exectuor for msgType: CONTROLLER_MSG, pool:
>> > > java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size
>> = 0,
>> > > active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running,
>> > pool
>> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > exectuor for msgType: SCHEDULER_MSG, pool:
>> > > java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size
>> = 0,
>> > > active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running,
>> > pool
>> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > exectuor for msgType: TASK_REPLY, pool:
>> > > java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size
>> = 0,
>> > > active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running,
>> > pool
>> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutdown
>> > > HelixTaskExecutor finished
>> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34
>> > > START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES
>> > > listener:org.apache.helix.messaging.handling.HelixTaskExecutor
>> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path:
>> > > /pinotDevDeploy/CONTROLLER/MESSAGES, listener:
>> > > org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc
>> > > 2015/10/21 17:25:20.340 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Opening socket connection to server
>> > > localhost/0:0:0:0:0:0:0:1:2181
>> > > 2015/10/21 17:25:20.341 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Socket connection established to
>> > > localhost/0:0:0:0:0:0:0:1:2181, initiating session
>> > > 2015/10/21 17:25:20.342 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Session establishment complete on server
>> > > localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001,
>> negotiated
>> > > timeout = 30000
>> > > 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
>> > > 2015/10/21 17:25:20.343 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Unable to read additional data from server
>> > sessionid
>> > > 0x1508cecdce10001, likely server has closed socket, closing socket
>> > > connection and attempting reconnect
>> > > 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (Disconnected)
>> > > 2015/10/21 17:25:20.534 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Opening socket connection to server localhost/
>> > > 127.0.0.1:2181
>> > > 2015/10/21 17:25:20.535 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Socket connection established to localhost/
>> > > 127.0.0.1:2181, initiating session
>> > > 2015/10/21 17:25:20.536 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Session establishment complete on server
>> localhost/
>> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
>> > 30000
>> > > 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
>> > > 2015/10/21 17:25:20.537 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Unable to read additional data from server
>> > sessionid
>> > > 0x1508cecdce10001, likely server has closed socket, closing socket
>> > > connection and attempting reconnect
>> > > 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (Disconnected)
>> > >
>> > >
>> > > And so on.
>> > >
>> > > So, it does look like Helix is trying to shut down, and prints the
>> > > START:INVOKE, but it never prints the END:INVOKE for the same
>> listener.
>> > >
>> > > Any ideas?
>> > >
>> > > -Subbu
>> > >
>> >
>>
>
>
>
> --
>
> *Lei Xia *Senior Software Engineer
> Data Infra/Nuage & Helix
> LinkedIn
>
> lxia@linkedin.com
> www.linkedin.com/in/lxia1
>



-- 

*Lei Xia *Senior Software Engineer
Data Infra/Nuage & Helix
LinkedIn

lxia@linkedin.com
www.linkedin.com/in/lxia1

Re: A possible leak of ZkClient object in helix?

Posted by Lei Xia <lx...@linkedin.com.INVALID>.
I verified locally and was able to reproduce the error.  This is not a
ZkWatch leak,  it seems a known issue (
https://issues.apache.org/jira/browse/ZOOKEEPER-706) which has been fixed
in ZK 3.5.x.  However, 3.5.x is still in alpha release, not sure whether
should we bump it in Helix.


Thanks
Lei

On Wed, Oct 21, 2015 at 9:58 PM, kishore g <g....@gmail.com> wrote:

> Hi Subbu,
>
> Can you show the jmap output that showed that the number of zkclients
> increased.
>
> Jason, simulating this is straight forward.
>
> - start ZK
> - set up a dummy cluster
> - start a dummy server/broker and controller
> - create a resource
> - stop ZKServer and restart it after setting jute.maxBuffer to very low
> number like 10.
>
> This will make the client go into connecting/disconnecting loop.  All the
> while we can do
>
> jmap -histo:live and see that the zkClient object keeps increasing.
>
> I looked at the code and cant find any place where we would be creating new
> zkclients. I dont think zkproperty store is needed to reproduce this issue.
>
>
>
>
>
>
> On Wed, Oct 21, 2015 at 9:25 PM, Zhen Zhang <ne...@gmail.com> wrote:
>
> > Hi Subbu,
> >
> > I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread
> inside:
> >
> > https://github.com/sgroschupf/zkclient/blob/master/src/main/
> java/org/I0Itec/zkclient/ZkClient.java#L70
> >
> > The ZkEventThread has a unique id:
> >
> > https://github.com/sgroschupf/zkclient/blob/master/src/main/
> java/org/I0Itec/zkclient/ZkEventThread.java#L59
> >
> > From the log it seems the ZkEventThread remains the same "
> > ZkClient-EventThread-34", so there is only one zk client.
> >
> > One reason I can think of is zk watch leak. This jira describes the
> issue:
> > https://issues.apache.org/jira/browse/HELIX-527
> >
> > Does restart the controller resolve the issue?
> >
> > Thanks,
> > Jason
> >
> > On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam <
> > ssubramaniam@linkedin.com.invalid> wrote:
> >
> > > Hi,
> > >
> > > We are using helix-0.6.5.22 version of helix for pinot development (
> > > https://github.com/linkedin/pinot)
> > >
> > > The pinot controller  constructs a ZkHelixPropertyStore() and adds a
> > > listener to listen to changes on all nodes under the path specified.
> > >
> > > If the zookeeper server is restarted, then the controller tries to
> > > re-connect and send the entire watch list to the server. Since there
> is a
> > > 1M limit on the size of the message that the server can receive, the
> > server
> > > drops the connection.
> > >
> > > At this point, we see that the controller goes into a spin, creating a
> > new
> > > ZkClient object as the connection moves between SyncConnected and
> > > Disconnected states.
> > >
> > > I have not tried to reproduce this problem independent of
> > pinot-controller,
> > > but I suspect the following steps could do it.
> > >
> > >
> > >    - Start zk server
> > >    - Start a program that calls new ZkHelixPropertyStore() that ends up
> > >    setting watches on a bunch of nodes that may exceed 10k bytes (total
> > of
> > > all
> > >    path name lengths).
> > >    - Stop the zk server
> > >    - Restart the zk server setting jute.maxbuffer to 10k.
> > >    - The client will probably go into a spin leaking ZkClient objects.
> > >
> > >
> > > The objects seem to be in a LinkedBlockingQueue, since jvisualvm also
> > shows
> > > LinkedBlockingQueue$Node increasing proportionately by the same number.
> > >
> > > Here is a sequence of error messages that may help:
> > >
> > > 2015/10/21 17:25:18.780 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Unable to read additional data from server
> > sessionid
> > > 0x1508cecdce10001, likely server has closed socket, closing socket
> > > connection and attempting reconnect
> > > 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (Disconnected)
> > > 2015/10/21 17:25:18.880 INFO [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
> > instance:
> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Opening socket connection to server localhost/
> > > 127.0.0.1:2181
> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Socket connection established to localhost/
> > > 127.0.0.1:2181, initiating session
> > > 2015/10/21 17:25:19.743 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Session establishment complete on server
> localhost/
> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
> > 30000
> > > 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > > 2015/10/21 17:25:19.743 INFO [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > KeeperState:
> > > SyncConnected, zookeeper:State:CONNECTED Timeout:30000
> > > sessionid:0x1508cecdce10001 local:/127.0.0.1:50492
> > remoteserver:localhost/
> > > 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
> > > pendingresp:1 queuedevents:0
> > > 2015/10/21 17:25:19.744 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Unable to read additional data from server
> > sessionid
> > > 0x1508cecdce10001, likely server has closed socket, closing socket
> > > connection and attempting reconnect
> > > 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (Disconnected)
> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
> > instance:
> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> > > 2015/10/21 17:25:19.844 ERROR [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect
> it.
> > >  maxDisconnectThreshold: 5 disconnects in 300000ms.
> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> disconnect
> > > ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy
> > > 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down HelixTaskExecutor
> > > 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > HelixTaskExecutor
> > > 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > Unregistering ClusterStatus:
> > > cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool:
> > > java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size =
> 0,
> > > active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running,
> > pool
> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > exectuor for msgType: CONTROLLER_MSG, pool:
> > > java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size =
> 0,
> > > active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running,
> > pool
> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > exectuor for msgType: SCHEDULER_MSG, pool:
> > > java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size =
> 0,
> > > active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running,
> > pool
> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > exectuor for msgType: TASK_REPLY, pool:
> > > java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size =
> 0,
> > > active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running,
> > pool
> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutdown
> > > HelixTaskExecutor finished
> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34
> > > START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES
> > > listener:org.apache.helix.messaging.handling.HelixTaskExecutor
> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path:
> > > /pinotDevDeploy/CONTROLLER/MESSAGES, listener:
> > > org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc
> > > 2015/10/21 17:25:20.340 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Opening socket connection to server
> > > localhost/0:0:0:0:0:0:0:1:2181
> > > 2015/10/21 17:25:20.341 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Socket connection established to
> > > localhost/0:0:0:0:0:0:0:1:2181, initiating session
> > > 2015/10/21 17:25:20.342 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Session establishment complete on server
> > > localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001,
> negotiated
> > > timeout = 30000
> > > 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > > 2015/10/21 17:25:20.343 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Unable to read additional data from server
> > sessionid
> > > 0x1508cecdce10001, likely server has closed socket, closing socket
> > > connection and attempting reconnect
> > > 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (Disconnected)
> > > 2015/10/21 17:25:20.534 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Opening socket connection to server localhost/
> > > 127.0.0.1:2181
> > > 2015/10/21 17:25:20.535 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Socket connection established to localhost/
> > > 127.0.0.1:2181, initiating session
> > > 2015/10/21 17:25:20.536 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Session establishment complete on server
> localhost/
> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
> > 30000
> > > 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > > 2015/10/21 17:25:20.537 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Unable to read additional data from server
> > sessionid
> > > 0x1508cecdce10001, likely server has closed socket, closing socket
> > > connection and attempting reconnect
> > > 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (Disconnected)
> > >
> > >
> > > And so on.
> > >
> > > So, it does look like Helix is trying to shut down, and prints the
> > > START:INVOKE, but it never prints the END:INVOKE for the same listener.
> > >
> > > Any ideas?
> > >
> > > -Subbu
> > >
> >
>



-- 

*Lei Xia *Senior Software Engineer
Data Infra/Nuage & Helix
LinkedIn

lxia@linkedin.com
www.linkedin.com/in/lxia1

Re: A possible leak of ZkClient object in helix?

Posted by kishore g <g....@gmail.com>.
Hi Subbu,

Can you show the jmap output that showed that the number of zkclients
increased.

Jason, simulating this is straight forward.

- start ZK
- set up a dummy cluster
- start a dummy server/broker and controller
- create a resource
- stop ZKServer and restart it after setting jute.maxBuffer to very low
number like 10.

This will make the client go into connecting/disconnecting loop.  All the
while we can do

jmap -histo:live and see that the zkClient object keeps increasing.

I looked at the code and cant find any place where we would be creating new
zkclients. I dont think zkproperty store is needed to reproduce this issue.






On Wed, Oct 21, 2015 at 9:25 PM, Zhen Zhang <ne...@gmail.com> wrote:

> Hi Subbu,
>
> I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread inside:
>
> https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java#L70
>
> The ZkEventThread has a unique id:
>
> https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkEventThread.java#L59
>
> From the log it seems the ZkEventThread remains the same "
> ZkClient-EventThread-34", so there is only one zk client.
>
> One reason I can think of is zk watch leak. This jira describes the issue:
> https://issues.apache.org/jira/browse/HELIX-527
>
> Does restart the controller resolve the issue?
>
> Thanks,
> Jason
>
> On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam <
> ssubramaniam@linkedin.com.invalid> wrote:
>
> > Hi,
> >
> > We are using helix-0.6.5.22 version of helix for pinot development (
> > https://github.com/linkedin/pinot)
> >
> > The pinot controller  constructs a ZkHelixPropertyStore() and adds a
> > listener to listen to changes on all nodes under the path specified.
> >
> > If the zookeeper server is restarted, then the controller tries to
> > re-connect and send the entire watch list to the server. Since there is a
> > 1M limit on the size of the message that the server can receive, the
> server
> > drops the connection.
> >
> > At this point, we see that the controller goes into a spin, creating a
> new
> > ZkClient object as the connection moves between SyncConnected and
> > Disconnected states.
> >
> > I have not tried to reproduce this problem independent of
> pinot-controller,
> > but I suspect the following steps could do it.
> >
> >
> >    - Start zk server
> >    - Start a program that calls new ZkHelixPropertyStore() that ends up
> >    setting watches on a bunch of nodes that may exceed 10k bytes (total
> of
> > all
> >    path name lengths).
> >    - Stop the zk server
> >    - Restart the zk server setting jute.maxbuffer to 10k.
> >    - The client will probably go into a spin leaking ZkClient objects.
> >
> >
> > The objects seem to be in a LinkedBlockingQueue, since jvisualvm also
> shows
> > LinkedBlockingQueue$Node increasing proportionately by the same number.
> >
> > Here is a sequence of error messages that may help:
> >
> > 2015/10/21 17:25:18.780 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting reconnect
> > 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (Disconnected)
> > 2015/10/21 17:25:18.880 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
> instance:
> > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Opening socket connection to server localhost/
> > 127.0.0.1:2181
> > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Socket connection established to localhost/
> > 127.0.0.1:2181, initiating session
> > 2015/10/21 17:25:19.743 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Session establishment complete on server localhost/
> > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
> 30000
> > 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > 2015/10/21 17:25:19.743 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> KeeperState:
> > SyncConnected, zookeeper:State:CONNECTED Timeout:30000
> > sessionid:0x1508cecdce10001 local:/127.0.0.1:50492
> remoteserver:localhost/
> > 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
> > pendingresp:1 queuedevents:0
> > 2015/10/21 17:25:19.744 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting reconnect
> > 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (Disconnected)
> > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
> instance:
> > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> > 2015/10/21 17:25:19.844 ERROR [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect it.
> >  maxDisconnectThreshold: 5 disconnects in 300000ms.
> > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] disconnect
> > ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy
> > 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down HelixTaskExecutor
> > 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > HelixTaskExecutor
> > 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > Unregistering ClusterStatus:
> > cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool:
> > java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size = 0,
> > active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running,
> pool
> > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > exectuor for msgType: CONTROLLER_MSG, pool:
> > java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size = 0,
> > active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running,
> pool
> > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > exectuor for msgType: SCHEDULER_MSG, pool:
> > java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size = 0,
> > active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running,
> pool
> > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > exectuor for msgType: TASK_REPLY, pool:
> > java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size = 0,
> > active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running,
> pool
> > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutdown
> > HelixTaskExecutor finished
> > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34
> > START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES
> > listener:org.apache.helix.messaging.handling.HelixTaskExecutor
> > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path:
> > /pinotDevDeploy/CONTROLLER/MESSAGES, listener:
> > org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc
> > 2015/10/21 17:25:20.340 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Opening socket connection to server
> > localhost/0:0:0:0:0:0:0:1:2181
> > 2015/10/21 17:25:20.341 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Socket connection established to
> > localhost/0:0:0:0:0:0:0:1:2181, initiating session
> > 2015/10/21 17:25:20.342 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Session establishment complete on server
> > localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001, negotiated
> > timeout = 30000
> > 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > 2015/10/21 17:25:20.343 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting reconnect
> > 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (Disconnected)
> > 2015/10/21 17:25:20.534 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Opening socket connection to server localhost/
> > 127.0.0.1:2181
> > 2015/10/21 17:25:20.535 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Socket connection established to localhost/
> > 127.0.0.1:2181, initiating session
> > 2015/10/21 17:25:20.536 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Session establishment complete on server localhost/
> > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
> 30000
> > 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > 2015/10/21 17:25:20.537 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting reconnect
> > 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (Disconnected)
> >
> >
> > And so on.
> >
> > So, it does look like Helix is trying to shut down, and prints the
> > START:INVOKE, but it never prints the END:INVOKE for the same listener.
> >
> > Any ideas?
> >
> > -Subbu
> >
>

Re: A possible leak of ZkClient object in helix?

Posted by Zhen Zhang <ne...@gmail.com>.
Hi Subbu,

I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread inside:
https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java#L70

The ZkEventThread has a unique id:
https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkEventThread.java#L59

>From the log it seems the ZkEventThread remains the same "
ZkClient-EventThread-34", so there is only one zk client.

One reason I can think of is zk watch leak. This jira describes the issue:
https://issues.apache.org/jira/browse/HELIX-527

Does restart the controller resolve the issue?

Thanks,
Jason

On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam <
ssubramaniam@linkedin.com.invalid> wrote:

> Hi,
>
> We are using helix-0.6.5.22 version of helix for pinot development (
> https://github.com/linkedin/pinot)
>
> The pinot controller  constructs a ZkHelixPropertyStore() and adds a
> listener to listen to changes on all nodes under the path specified.
>
> If the zookeeper server is restarted, then the controller tries to
> re-connect and send the entire watch list to the server. Since there is a
> 1M limit on the size of the message that the server can receive, the server
> drops the connection.
>
> At this point, we see that the controller goes into a spin, creating a new
> ZkClient object as the connection moves between SyncConnected and
> Disconnected states.
>
> I have not tried to reproduce this problem independent of pinot-controller,
> but I suspect the following steps could do it.
>
>
>    - Start zk server
>    - Start a program that calls new ZkHelixPropertyStore() that ends up
>    setting watches on a bunch of nodes that may exceed 10k bytes (total of
> all
>    path name lengths).
>    - Stop the zk server
>    - Restart the zk server setting jute.maxbuffer to 10k.
>    - The client will probably go into a spin leaking ZkClient objects.
>
>
> The objects seem to be in a LinkedBlockingQueue, since jvisualvm also shows
> LinkedBlockingQueue$Node increasing proportionately by the same number.
>
> Here is a sequence of error messages that may help:
>
> 2015/10/21 17:25:18.780 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Unable to read additional data from server sessionid
> 0x1508cecdce10001, likely server has closed socket, closing socket
> connection and attempting reconnect
> 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (Disconnected)
> 2015/10/21 17:25:18.880 INFO [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, instance:
> ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> 2015/10/21 17:25:19.741 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Opening socket connection to server localhost/
> 127.0.0.1:2181
> 2015/10/21 17:25:19.741 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Socket connection established to localhost/
> 127.0.0.1:2181, initiating session
> 2015/10/21 17:25:19.743 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Session establishment complete on server localhost/
> 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = 30000
> 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (SyncConnected)
> 2015/10/21 17:25:19.743 INFO [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] KeeperState:
> SyncConnected, zookeeper:State:CONNECTED Timeout:30000
> sessionid:0x1508cecdce10001 local:/127.0.0.1:50492 remoteserver:localhost/
> 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
> pendingresp:1 queuedevents:0
> 2015/10/21 17:25:19.744 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Unable to read additional data from server sessionid
> 0x1508cecdce10001, likely server has closed socket, closing socket
> connection and attempting reconnect
> 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (Disconnected)
> 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, instance:
> ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> 2015/10/21 17:25:19.844 ERROR [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect it.
>  maxDisconnectThreshold: 5 disconnects in 300000ms.
> 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] disconnect
> ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy
> 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down HelixTaskExecutor
> 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> HelixTaskExecutor
> 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Unregistering ClusterStatus:
> cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool:
> java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size = 0,
> active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> exectuor for msgType: CONTROLLER_MSG, pool:
> java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size = 0,
> active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> exectuor for msgType: SCHEDULER_MSG, pool:
> java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size = 0,
> active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> exectuor for msgType: TASK_REPLY, pool:
> java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size = 0,
> active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutdown
> HelixTaskExecutor finished
> 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34
> START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES
> listener:org.apache.helix.messaging.handling.HelixTaskExecutor
> 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path:
> /pinotDevDeploy/CONTROLLER/MESSAGES, listener:
> org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc
> 2015/10/21 17:25:20.340 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Opening socket connection to server
> localhost/0:0:0:0:0:0:0:1:2181
> 2015/10/21 17:25:20.341 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Socket connection established to
> localhost/0:0:0:0:0:0:0:1:2181, initiating session
> 2015/10/21 17:25:20.342 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Session establishment complete on server
> localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001, negotiated
> timeout = 30000
> 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (SyncConnected)
> 2015/10/21 17:25:20.343 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Unable to read additional data from server sessionid
> 0x1508cecdce10001, likely server has closed socket, closing socket
> connection and attempting reconnect
> 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (Disconnected)
> 2015/10/21 17:25:20.534 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Opening socket connection to server localhost/
> 127.0.0.1:2181
> 2015/10/21 17:25:20.535 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Socket connection established to localhost/
> 127.0.0.1:2181, initiating session
> 2015/10/21 17:25:20.536 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Session establishment complete on server localhost/
> 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = 30000
> 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (SyncConnected)
> 2015/10/21 17:25:20.537 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Unable to read additional data from server sessionid
> 0x1508cecdce10001, likely server has closed socket, closing socket
> connection and attempting reconnect
> 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (Disconnected)
>
>
> And so on.
>
> So, it does look like Helix is trying to shut down, and prints the
> START:INVOKE, but it never prints the END:INVOKE for the same listener.
>
> Any ideas?
>
> -Subbu
>