You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by Vinoth Chandar <vi...@uber.com> on 2015/04/30 20:53:46 UTC

NPE trying to reconnect, upon ZK Timeout

Hi guys,

I am hitting the following with 0.6.5, upon a ZK connection timeout . We
make this call to the PropertyStore to figure out an offset to resume from.
This error eventually puts every partition into an error state and comes to
a grinding halt.  Any pointers to troubleshoot this? Nonetheless, there
should nt be an NPE right?

NullPointerException

   -

   org.apache.helix.manager.zk.ZkClient$4 in call at line 241
   -

   org.apache.helix.manager.zk.ZkClient$4 in call at line 237
   -

   org.I0Itec.zkclient.ZkClient in retryUntilConnected at line 675
   -

   org.apache.helix.manager.zk.ZkClient in readData at line 237
   -

   org.I0Itec.zkclient.ZkClient in readData at line 761
   -

   org.apache.helix.manager.zk.ZkBaseDataAccessor in get at line 308
   -

   org.apache.helix.manager.zk.ZkCacheBaseDataAccessor in get at line 377
   -

   org.apache.helix.store.zk.AutoFallbackPropertyStore in get at line 100



Thanks
Vinoth

Re: NPE trying to reconnect, upon ZK Timeout

Posted by Vinoth Chandar <vi...@uber.com>.

Its awesome that there is Flap detection.

The NPE is misleading.. Let me file a ticket for these also.. (will do by
EOW)

Thanks
Vinoth

On Thu, Apr 30, 2015 at 1:39 PM, kishore g <g....@gmail.com> wrote:

> Yep, this is a safety feature where Helix automatically detects GC and
> disconnects from the cluster automatically. Unfortunately in some cases it
> surfaces as NPE.
>
> We should probably describe the reason for disabling in the instance
> config. Currently we just disable the node, we should probably add an
> attribute DISABLE_CAUSE:"TOO MANY DISCONNECTS FROM ZK. CHECK JAVA GC LOG"
> or something like that.
>
> thanks,
> Kishore G
>
> On Thu, Apr 30, 2015 at 1:18 PM, Vinoth Chandar <vi...@uber.com> wrote:
>
>> yep .. Seeing this
>>
>> $ grep -i flap /var/log/streamio/streamio.log
>> 2015-04-30 16:08:50,823 ERROR - ZKHelixManager             -
>> instanceName: ??--checkpointer is flapping. disconnect it.
>>  maxDisconnectThreshold: 5 disconnects in 300000ms.
>> 2015-04-30 16:09:30,140 ERROR - ZKHelixManager             -
>> instanceName: ??-controller- is flapping. disconnect it.
>>  maxDisconnectThreshold: 5 disconnects in 300000ms.
>> 2015-04-30 16:11:05,679 ERROR - ZKHelixManager             -
>> instanceName: ??-controller- is flapping. disconnect it.
>>  maxDisconnectThreshold: 5 disconnects in 300000ms.
>>
>> and confirmed its GCing from the logs. (Sorry, had a bad dashboard
>> originally that did not catch this)
>>
>> Thanks
>> Vinoth
>>
>> On Thu, Apr 30, 2015 at 12:12 PM, Zhen Zhang <ne...@gmail.com> wrote:
>>
>>> Hi Vinoth,
>>>
>>> The NPE indicates the zookeeper connection in ZkClient is NULL. The
>>> connection becomes NULL only when HelixManager#disconnect() is called. This
>>> may happen if you directly call HelixManager#disconnect() or there are
>>> frequent GC's and HelixManager disconnects itself. You may grep
>>> "KeeperState" to figure out the connection state changes.
>>>
>>> Thanks,
>>> Jason
>>>
>>>
>>> On Thu, Apr 30, 2015 at 11:53 AM, Vinoth Chandar <vi...@uber.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I am hitting the following with 0.6.5, upon a ZK connection timeout .
>>>> We make this call to the PropertyStore to figure out an offset to resume
>>>> from. This error eventually puts every partition into an error state and
>>>> comes to a grinding halt.  Any pointers to troubleshoot this? Nonetheless,
>>>> there should nt be an NPE right?
>>>>
>>>> NullPointerException
>>>>
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkClient$4 in call at line 241
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkClient$4 in call at line 237
>>>>    -
>>>>
>>>>    org.I0Itec.zkclient.ZkClient in retryUntilConnected at line 675
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkClient in readData at line 237
>>>>    -
>>>>
>>>>    org.I0Itec.zkclient.ZkClient in readData at line 761
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkBaseDataAccessor in get at line 308
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkCacheBaseDataAccessor in get at line
>>>>    377
>>>>    -
>>>>
>>>>    org.apache.helix.store.zk.AutoFallbackPropertyStore in get at line
>>>>    100
>>>>
>>>>
>>>>
>>>> Thanks
>>>> Vinoth
>>>>
>>>
>>>
>>
>

Re: NPE trying to reconnect, upon ZK Timeout

Posted by kishore g <g....@gmail.com>.

Yep, this is a safety feature where Helix automatically detects GC and
disconnects from the cluster automatically. Unfortunately in some cases it
surfaces as NPE.

We should probably describe the reason for disabling in the instance
config. Currently we just disable the node, we should probably add an
attribute DISABLE_CAUSE:"TOO MANY DISCONNECTS FROM ZK. CHECK JAVA GC LOG"
or something like that.

thanks,
Kishore G

On Thu, Apr 30, 2015 at 1:18 PM, Vinoth Chandar <vi...@uber.com> wrote:

> yep .. Seeing this
>
> $ grep -i flap /var/log/streamio/streamio.log
> 2015-04-30 16:08:50,823 ERROR - ZKHelixManager             - instanceName:
> ??--checkpointer is flapping. disconnect it.  maxDisconnectThreshold: 5
> disconnects in 300000ms.
> 2015-04-30 16:09:30,140 ERROR - ZKHelixManager             - instanceName:
> ??-controller- is flapping. disconnect it.  maxDisconnectThreshold: 5
> disconnects in 300000ms.
> 2015-04-30 16:11:05,679 ERROR - ZKHelixManager             - instanceName:
> ??-controller- is flapping. disconnect it.  maxDisconnectThreshold: 5
> disconnects in 300000ms.
>
> and confirmed its GCing from the logs. (Sorry, had a bad dashboard
> originally that did not catch this)
>
> Thanks
> Vinoth
>
> On Thu, Apr 30, 2015 at 12:12 PM, Zhen Zhang <ne...@gmail.com> wrote:
>
>> Hi Vinoth,
>>
>> The NPE indicates the zookeeper connection in ZkClient is NULL. The
>> connection becomes NULL only when HelixManager#disconnect() is called. This
>> may happen if you directly call HelixManager#disconnect() or there are
>> frequent GC's and HelixManager disconnects itself. You may grep
>> "KeeperState" to figure out the connection state changes.
>>
>> Thanks,
>> Jason
>>
>>
>> On Thu, Apr 30, 2015 at 11:53 AM, Vinoth Chandar <vi...@uber.com> wrote:
>>
>>> Hi guys,
>>>
>>> I am hitting the following with 0.6.5, upon a ZK connection timeout . We
>>> make this call to the PropertyStore to figure out an offset to resume from.
>>> This error eventually puts every partition into an error state and comes to
>>> a grinding halt.  Any pointers to troubleshoot this? Nonetheless, there
>>> should nt be an NPE right?
>>>
>>> NullPointerException
>>>
>>>    -
>>>
>>>    org.apache.helix.manager.zk.ZkClient$4 in call at line 241
>>>    -
>>>
>>>    org.apache.helix.manager.zk.ZkClient$4 in call at line 237
>>>    -
>>>
>>>    org.I0Itec.zkclient.ZkClient in retryUntilConnected at line 675
>>>    -
>>>
>>>    org.apache.helix.manager.zk.ZkClient in readData at line 237
>>>    -
>>>
>>>    org.I0Itec.zkclient.ZkClient in readData at line 761
>>>    -
>>>
>>>    org.apache.helix.manager.zk.ZkBaseDataAccessor in get at line 308
>>>    -
>>>
>>>    org.apache.helix.manager.zk.ZkCacheBaseDataAccessor in get at line
>>>    377
>>>    -
>>>
>>>    org.apache.helix.store.zk.AutoFallbackPropertyStore in get at line
>>>    100
>>>
>>>
>>>
>>> Thanks
>>> Vinoth
>>>
>>
>>
>

Re: NPE trying to reconnect, upon ZK Timeout

Posted by Vinoth Chandar <vi...@uber.com>.

yep .. Seeing this

$ grep -i flap /var/log/streamio/streamio.log
2015-04-30 16:08:50,823 ERROR - ZKHelixManager             - instanceName:
??--checkpointer is flapping. disconnect it.  maxDisconnectThreshold: 5
disconnects in 300000ms.
2015-04-30 16:09:30,140 ERROR - ZKHelixManager             - instanceName:
??-controller- is flapping. disconnect it.  maxDisconnectThreshold: 5
disconnects in 300000ms.
2015-04-30 16:11:05,679 ERROR - ZKHelixManager             - instanceName:
??-controller- is flapping. disconnect it.  maxDisconnectThreshold: 5
disconnects in 300000ms.

and confirmed its GCing from the logs. (Sorry, had a bad dashboard
originally that did not catch this)

Thanks
Vinoth

On Thu, Apr 30, 2015 at 12:12 PM, Zhen Zhang <ne...@gmail.com> wrote:

> Hi Vinoth,
>
> The NPE indicates the zookeeper connection in ZkClient is NULL. The
> connection becomes NULL only when HelixManager#disconnect() is called. This
> may happen if you directly call HelixManager#disconnect() or there are
> frequent GC's and HelixManager disconnects itself. You may grep
> "KeeperState" to figure out the connection state changes.
>
> Thanks,
> Jason
>
>
> On Thu, Apr 30, 2015 at 11:53 AM, Vinoth Chandar <vi...@uber.com> wrote:
>
>> Hi guys,
>>
>> I am hitting the following with 0.6.5, upon a ZK connection timeout . We
>> make this call to the PropertyStore to figure out an offset to resume from.
>> This error eventually puts every partition into an error state and comes to
>> a grinding halt.  Any pointers to troubleshoot this? Nonetheless, there
>> should nt be an NPE right?
>>
>> NullPointerException
>>
>>    -
>>
>>    org.apache.helix.manager.zk.ZkClient$4 in call at line 241
>>    -
>>
>>    org.apache.helix.manager.zk.ZkClient$4 in call at line 237
>>    -
>>
>>    org.I0Itec.zkclient.ZkClient in retryUntilConnected at line 675
>>    -
>>
>>    org.apache.helix.manager.zk.ZkClient in readData at line 237
>>    -
>>
>>    org.I0Itec.zkclient.ZkClient in readData at line 761
>>    -
>>
>>    org.apache.helix.manager.zk.ZkBaseDataAccessor in get at line 308
>>    -
>>
>>    org.apache.helix.manager.zk.ZkCacheBaseDataAccessor in get at line 377
>>    -
>>
>>    org.apache.helix.store.zk.AutoFallbackPropertyStore in get at line 100
>>
>>
>>
>> Thanks
>> Vinoth
>>
>
>

Re: NPE trying to reconnect, upon ZK Timeout

Posted by Zhen Zhang <ne...@gmail.com>.

Hi Vinoth,

The NPE indicates the zookeeper connection in ZkClient is NULL. The
connection becomes NULL only when HelixManager#disconnect() is called. This
may happen if you directly call HelixManager#disconnect() or there are
frequent GC's and HelixManager disconnects itself. You may grep
"KeeperState" to figure out the connection state changes.

Thanks,
Jason


On Thu, Apr 30, 2015 at 11:53 AM, Vinoth Chandar <vi...@uber.com> wrote:

> Hi guys,
>
> I am hitting the following with 0.6.5, upon a ZK connection timeout . We
> make this call to the PropertyStore to figure out an offset to resume from.
> This error eventually puts every partition into an error state and comes to
> a grinding halt.  Any pointers to troubleshoot this? Nonetheless, there
> should nt be an NPE right?
>
> NullPointerException
>
>    -
>
>    org.apache.helix.manager.zk.ZkClient$4 in call at line 241
>    -
>
>    org.apache.helix.manager.zk.ZkClient$4 in call at line 237
>    -
>
>    org.I0Itec.zkclient.ZkClient in retryUntilConnected at line 675
>    -
>
>    org.apache.helix.manager.zk.ZkClient in readData at line 237
>    -
>
>    org.I0Itec.zkclient.ZkClient in readData at line 761
>    -
>
>    org.apache.helix.manager.zk.ZkBaseDataAccessor in get at line 308
>    -
>
>    org.apache.helix.manager.zk.ZkCacheBaseDataAccessor in get at line 377
>    -
>
>    org.apache.helix.store.zk.AutoFallbackPropertyStore in get at line 100
>
>
>
> Thanks
> Vinoth
>