You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by Varun Sharma <va...@pinterest.com> on 2014/11/18 01:10:39 UTC

Helix issue - External View out of sync

Hi,

I am seeing the following issue for many partitions in helix using a simple
Online->Offline state model factory. The external view says that the
partition has been assigned to 3 hosts. However, when I look at the hosts
only 1 of them executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11]
(HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on Zk to
be stateModel's CurrentState*. *partitionKey: 490*, currentState: ONLINE,
message: 12690ce8-8098-46b1-a93d-279604f0e3db,
{CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
READ_TIMESTAMP=1415870993787,
RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
TO_STATE=ONLINE}{}{}

When I grep the message ID in the controller, I see the following:

2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155)
INFO  {

  "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",

  "mapFields" : {

    "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
c1193025-b416-49d7-adc2-10afe2389141" : {

      "AdditionalInfo" : "Message execution failed. msgId:
12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
org.apache.helix.messaging.handling.
*HelixStateTransitionHandler$HelixStateMismatchException*: Current state of
stateModel does not match the fromState in Message, Current State:ONLINE,
message expected:OFFLINE, partition: 490, from:
hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",

      "Class" : "class
org.apache.helix.messaging.handling.HelixStateTransitionHandler",

      "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",

      "Message state" : "READ"

    },


What could be causing this - when I restart the node, the error disappears
(meaning that the node is able to perform the state transition). What could
be causing this state mismatch ?


Thanks

Varun

Re: Helix issue - External View out of sync

Posted by kishore g <g....@gmail.com>.

Want to hop on to helix irc to debug this?

On Mon, Nov 17, 2014 at 5:08 PM, Varun Sharma <va...@pinterest.com> wrote:

> I looked at the logs and gc was fine as the system was processing other
> events around the same time.
>
> Is there anything else specifically I shold look for in the logs ? Is
> there a way to find out whether a node was removed from the cluster due to
> a ZK issue ?
>
> Thanks !
> Varun
>
> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com> wrote:
>
>> I am wondering how come a partition was in the online state for a
>> resource that was newly created.
>>
>> Thanks
>> Varun
>>
>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>>> I am using 0.6.4. In this case, I created a resource and set its ideal
>>> state and the partitions onlined themselves. It seems for that node - it
>>> opened a whole bunch of other partitions at around the same time (~ 30 or
>>> so) but failed to open 3-4 partitions. This was for a brand new resource I
>>> created..
>>>
>>> THanks !
>>> Varun
>>>
>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com> wrote:
>>>
>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses the
>>>> cluster member ship if they get into long GC or starts flapping. That might
>>>> be cause for state mismatch. However, external view must be up to date. It
>>>> might help if you can attach the controller logs and node logs.
>>>>
>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am seeing the following issue for many partitions in helix using a
>>>>> simple Online->Offline state model factory. The external view says that the
>>>>> partition has been assigned to 3 hosts. However, when I look at the hosts
>>>>> only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>>
>>>>> On the hosts, that did not execute the transition, I see the following:
>>>>>
>>>>> 2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on
>>>>> Zk to be stateModel's CurrentState*. *partitionKey: 490*,
>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>>> READ_TIMESTAMP=1415870993787,
>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>>>> TO_STATE=ONLINE}{}{}
>>>>>
>>>>> When I grep the message ID in the controller, I see the following:
>>>>>
>>>>> 2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>>
>>>>>   "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>>
>>>>>   "mapFields" : {
>>>>>
>>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>>
>>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>>> org.apache.helix.messaging.handling.
>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>>>> state of stateModel does not match the fromState in Message, Current
>>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>>
>>>>>       "Class" : "class
>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>>
>>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>>
>>>>>       "Message state" : "READ"
>>>>>
>>>>>     },
>>>>>
>>>>>
>>>>> What could be causing this - when I restart the node, the error
>>>>> disappears (meaning that the node is able to perform the state transition).
>>>>> What could be causing this state mismatch ?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Varun
>>>>>
>>>>
>>>>
>>>
>>
>

RE: Helix issue - External View out of sync

Posted by Zhen Zhang <zz...@linkedin.com>.

Agree. Making partition name unique across all resources will be a simple workaround. I have filed a jira for this:
https://issues.apache.org/jira/browse/HELIX-552

The fix will affect public interface (StateModelFactory), but we can make it backward compatible and deprecated the old methods.

Thanks,
Zhen

________________________________
From: kishore g [g.kishore@gmail.com]
Sent: Tuesday, November 18, 2014 4:12 PM
To: user@helix.apache.org
Subject: Re: Helix issue - External View out of sync

I think a simpler solution is to append the partitionName with resource name. The standard convention is resourceName_partitionId.

This is a stupid bug that was introduced in the original version.

The statemodelfactory has the createStateModel method that takes partition name. Internally helix maintains a mapping of partition name to statemodel. This assumes that the partition name is unique across all resources.

The fix is trivial but might involve change in the interface. Any suggestions?

thanks,
Kishore G



On Tue, Nov 18, 2014 at 3:35 PM, Zhen Zhang <zz...@linkedin.com>> wrote:
Hi Varun,

Factory name needs not to be resource name. You can probably do this:
1) register multiple state model factories using OnlineOffline state model and a set of predefined different factory names.
2) when you create a new resource, set the factory name in ideal state using a factory name that hasn't been used by current resources: IdealState#setStateModelFactoryName(factory-name)
3) when you decommission a resource, I guess you will remove the resource from ideal-state, so the factory name will be no longer used.

Thanks,
Zhen
________________________________
From: Varun Sharma [varun@pinterest.com<ma...@pinterest.com>]
Sent: Tuesday, November 18, 2014 3:16 PM
To: user@helix.apache.org<ma...@helix.apache.org>

Subject: Re: Helix issue - External View out of sync

Hmm, it seems that in my case, the resources are not known in advance and I need to decommision resources/create resources on the fly as data comes in/gets deleted. Is there a way around that ?

Thanks
Varun

On Tue, Nov 18, 2014 at 3:06 PM, Zhen Zhang <zz...@linkedin.com>> wrote:
Hi Varun,

Here is the problem. You are using ONLINE-OFFLINE state model for multiple resources, and in this case when you register state model factory, you need to use your resource name (e.g. $terrapin$data$meta_pin_join$1415866960201) as your factory name instead of using the default factory name (which is "DEFAULT"); sth. like this:

HelixManager#getStateMachineEngine#registerStateModelFactory("ONLINEOFFLINE", factory, "$terrapin$data$meta_pin_join$1415866960201")

Otherwise, Helix can't distinguish the state model factories for the two different resources using the same state model and the same factory name. To confirm, you should have the following message in your participant log:

WARN: "stateModelFactory for " + stateModelName + " using factoryName DEFAULT has already been registered."

Let us know if this solves the problem.

Thanks,
Zhen

________________________________
From: Varun Sharma [varun@pinterest.com<ma...@pinterest.com>]
Sent: Tuesday, November 18, 2014 12:59 PM

To: user@helix.apache.org<ma...@helix.apache.org>
Subject: Re: Helix issue - External View out of sync

I shared the logs with zhen using google drive..

On Tue, Nov 18, 2014 at 12:56 PM, kishore g <g....@gmail.com>> wrote:
Did you try dropbox or any other public file sharing service.

On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <va...@pinterest.com>> wrote:
Hi Zhen,

My logs are > 10M and jira does not allow me to attach them. Also, gmail is not allowing me to send them over as it flags them as "blocked for security reasons" - link here<https://support.google.com/mail/answer/6590?hl=en> - Do you have any other options to send over the file. I create HELIX-551 for this issue.

Thanks
Varun

On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <zz...@linkedin.com>> wrote:
Hi Varun, I missed the conversation on IRC. You could create a jira at:
https://issues.apache.org/jira/browse/HELIX

And attach the zk log in the jira. We will be able to figure it out.

Thanks,
Zhen

________________________________
From: Zhen Zhang [zzhang@linkedin.com<ma...@linkedin.com>]
Sent: Monday, November 17, 2014 5:16 PM
To: user@helix.apache.org<ma...@helix.apache.org>
Subject: RE: Helix issue - External View out of sync

Hi, Varun, you can join us on freenode IRC: http://helix.apache.org/IRC.html

Thanks,
Zhen

________________________________
From: Varun Sharma [varun@pinterest.com<ma...@pinterest.com>]
Sent: Monday, November 17, 2014 5:08 PM
To: user@helix.apache.org<ma...@helix.apache.org>
Subject: Re: Helix issue - External View out of sync

I looked at the logs and gc was fine as the system was processing other events around the same time.

Is there anything else specifically I shold look for in the logs ? Is there a way to find out whether a node was removed from the cluster due to a ZK issue ?

Thanks !
Varun

On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am wondering how come a partition was in the online state for a resource that was newly created.

Thanks
Varun

On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am using 0.6.4. In this case, I created a resource and set its ideal state and the partitions onlined themselves. It seems for that node - it opened a whole bunch of other partitions at around the same time (~ 30 or so) but failed to open 3-4 partitions. This was for a brand new resource I created..

THanks !
Varun

On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com>> wrote:
One suggestion is to check for GC pauses on the nodes. Nodes loses the cluster member ship if they get into long GC or starts flapping. That might be cause for state mismatch. However, external view must be up to date. It might help if you can attach the controller logs and node logs.

On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>> wrote:
Hi,

I am seeing the following issue for many partitions in helix using a simple Online->Offline state model factory. The external view says that the partition has been assigned to 3 hosts. However, when I look at the hosts only 1 of them executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN  Force CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{}

When I grep the message ID in the controller, I see the following:


2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO  {

  "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",

  "mapFields" : {

    "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141" : {

      "AdditionalInfo" : "Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException: Current state of stateModel does not match the fromState in Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",

      "Class" : "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",

      "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",

      "Message state" : "READ"

    },


What could be causing this - when I restart the node, the error disappears (meaning that the node is able to perform the state transition). What could be causing this state mismatch ?


Thanks

Varun

Re: Helix issue - External View out of sync

Posted by kishore g <g....@gmail.com>.

I think a simpler solution is to append the partitionName with resource
name. The standard convention is resourceName_partitionId.

This is a stupid bug that was introduced in the original version.

The statemodelfactory has the createStateModel method that takes partition
name. Internally helix maintains a mapping of partition name to statemodel.
This assumes that the partition name is unique across all resources.

The fix is trivial but might involve change in the interface. Any
suggestions?

thanks,
Kishore G



On Tue, Nov 18, 2014 at 3:35 PM, Zhen Zhang <zz...@linkedin.com> wrote:

>  Hi Varun,
>
> Factory name needs not to be resource name. You can probably do this:
> 1) register multiple state model factories using OnlineOffline state model
> and a set of predefined different factory names.
> 2) when you create a new resource, set the factory name in ideal state
> using a factory name that hasn't been used by current resources:
> IdealState#setStateModelFactoryName(factory-name)
> 3) when you decommission a resource, I guess you will remove the resource
> from ideal-state, so the factory name will be no longer used.
>
> Thanks,
> Zhen
>  ------------------------------
> *From:* Varun Sharma [varun@pinterest.com]
> *Sent:* Tuesday, November 18, 2014 3:16 PM
> *To:* user@helix.apache.org
>
> *Subject:* Re: Helix issue - External View out of sync
>
>   Hmm, it seems that in my case, the resources are not known in advance
> and I need to decommision resources/create resources on the fly as data
> comes in/gets deleted. Is there a way around that ?
>
>  Thanks
> Varun
>
> On Tue, Nov 18, 2014 at 3:06 PM, Zhen Zhang <zz...@linkedin.com> wrote:
>
>>  Hi Varun,
>>
>> Here is the problem. You are using ONLINE-OFFLINE state model for
>> multiple resources, and in this case when you register state model factory,
>> you need to use your resource name (e.g.
>> $terrapin$data$meta_pin_join$1415866960201) as your factory name instead of
>> using the default factory name (which is "DEFAULT"); sth. like this:
>>
>> HelixManager#getStateMachineEngine#registerStateModelFactory("ONLINEOFFLINE",
>> factory, "$terrapin$data$meta_pin_join$1415866960201")
>>
>> Otherwise, Helix can't distinguish the state model factories for the two
>> different resources using the same state model and the same factory name.
>> To confirm, you should have the following message in your participant log:
>>
>> WARN: "stateModelFactory for " + stateModelName + " using factoryName
>> DEFAULT has already been registered."
>>
>> Let us know if this solves the problem.
>>
>> Thanks,
>> Zhen
>>
>>  ------------------------------
>> *From:* Varun Sharma [varun@pinterest.com]
>> *Sent:* Tuesday, November 18, 2014 12:59 PM
>>
>> *To:* user@helix.apache.org
>> *Subject:* Re: Helix issue - External View out of sync
>>
>>    I shared the logs with zhen using google drive..
>>
>> On Tue, Nov 18, 2014 at 12:56 PM, kishore g <g....@gmail.com> wrote:
>>
>>> Did you try dropbox or any other public file sharing service.
>>>
>>> On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>>
>>>> Hi Zhen,
>>>>
>>>>  My logs are > 10M and jira does not allow me to attach them. Also,
>>>> gmail is not allowing me to send them over as it flags them as "blocked for
>>>> security reasons" - link here
>>>> <https://support.google.com/mail/answer/6590?hl=en> - Do you have any
>>>> other options to send over the file. I create HELIX-551 for this issue.
>>>>
>>>>  Thanks
>>>>  Varun
>>>>
>>>> On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <zz...@linkedin.com>
>>>> wrote:
>>>>
>>>>>  Hi Varun, I missed the conversation on IRC. You could create a jira
>>>>> at:
>>>>> https://issues.apache.org/jira/browse/HELIX
>>>>>
>>>>> And attach the zk log in the jira. We will be able to figure it out.
>>>>>
>>>>> Thanks,
>>>>> Zhen
>>>>>
>>>>>  ------------------------------
>>>>> *From:* Zhen Zhang [zzhang@linkedin.com]
>>>>> *Sent:* Monday, November 17, 2014 5:16 PM
>>>>> *To:* user@helix.apache.org
>>>>> *Subject:* RE: Helix issue - External View out of sync
>>>>>
>>>>>    Hi, Varun, you can join us on freenode IRC:
>>>>> http://helix.apache.org/IRC.html
>>>>>
>>>>> Thanks,
>>>>> Zhen
>>>>>
>>>>>  ------------------------------
>>>>> *From:* Varun Sharma [varun@pinterest.com]
>>>>> *Sent:* Monday, November 17, 2014 5:08 PM
>>>>> *To:* user@helix.apache.org
>>>>> *Subject:* Re: Helix issue - External View out of sync
>>>>>
>>>>>   I looked at the logs and gc was fine as the system was processing
>>>>> other events around the same time.
>>>>>
>>>>>  Is there anything else specifically I shold look for in the logs ?
>>>>> Is there a way to find out whether a node was removed from the cluster due
>>>>> to a ZK issue ?
>>>>>
>>>>>  Thanks !
>>>>> Varun
>>>>>
>>>>> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com>
>>>>> wrote:
>>>>>
>>>>>> I am wondering how come a partition was in the online state for a
>>>>>> resource that was newly created.
>>>>>>
>>>>>>  Thanks
>>>>>>  Varun
>>>>>>
>>>>>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I am using 0.6.4. In this case, I created a resource and set its
>>>>>>> ideal state and the partitions onlined themselves. It seems for that node -
>>>>>>> it opened a whole bunch of other partitions at around the same time (~ 30
>>>>>>> or so) but failed to open 3-4 partitions. This was for a brand new resource
>>>>>>> I created..
>>>>>>>
>>>>>>>  THanks !
>>>>>>>  Varun
>>>>>>>
>>>>>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses
>>>>>>>> the cluster member ship if they get into long GC or starts flapping. That
>>>>>>>> might be cause for state mismatch. However, external view must be up to
>>>>>>>> date. It might help if you can attach the controller logs and node logs.
>>>>>>>>
>>>>>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>  I am seeing the following issue for many partitions in helix
>>>>>>>>> using a simple Online->Offline state model factory. The external view says
>>>>>>>>> that the partition has been assigned to 3 hosts. However, when I look at
>>>>>>>>> the hosts only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>>>>>>
>>>>>>>>>  On the hosts, that did not execute the transition, I see the
>>>>>>>>> following:
>>>>>>>>>
>>>>>>>>>  2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>>>>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState
>>>>>>>>> on Zk to be stateModel's CurrentState*. *partitionKey: 490*,
>>>>>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>>>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>>>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>>>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>>>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>>>>>>> READ_TIMESTAMP=1415870993787,
>>>>>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>>>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>>>>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>>>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>>>>>>>> TO_STATE=ONLINE}{}{}
>>>>>>>>>
>>>>>>>>>  When I grep the message ID in the controller, I see the
>>>>>>>>> following:
>>>>>>>>>
>>>>>>>>>  2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>>>>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>>>>>>
>>>>>>>>>   "id" :
>>>>>>>>> "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>>>>>>
>>>>>>>>>   "mapFields" : {
>>>>>>>>>
>>>>>>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>>>>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>>>>>>
>>>>>>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>>>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>>>>>>> org.apache.helix.messaging.handling.
>>>>>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*:
>>>>>>>>> Current state of stateModel does not match the fromState in Message,
>>>>>>>>> Current State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>>>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>>>>>>
>>>>>>>>>       "Class" : "class
>>>>>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>>>>>>
>>>>>>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>>>>>>
>>>>>>>>>       "Message state" : "READ"
>>>>>>>>>
>>>>>>>>>     },
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  What could be causing this - when I restart the node, the error
>>>>>>>>> disappears (meaning that the node is able to perform the state transition).
>>>>>>>>> What could be causing this state mismatch ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Thanks
>>>>>>>>>
>>>>>>>>> Varun
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Helix issue - External View out of sync

Posted by Zhen Zhang <zz...@linkedin.com>.

Hi Varun,

Factory name needs not to be resource name. You can probably do this:
1) register multiple state model factories using OnlineOffline state model and a set of predefined different factory names.
2) when you create a new resource, set the factory name in ideal state using a factory name that hasn't been used by current resources: IdealState#setStateModelFactoryName(factory-name)
3) when you decommission a resource, I guess you will remove the resource from ideal-state, so the factory name will be no longer used.

Thanks,
Zhen
________________________________
From: Varun Sharma [varun@pinterest.com]
Sent: Tuesday, November 18, 2014 3:16 PM
To: user@helix.apache.org
Subject: Re: Helix issue - External View out of sync

Hmm, it seems that in my case, the resources are not known in advance and I need to decommision resources/create resources on the fly as data comes in/gets deleted. Is there a way around that ?

Thanks
Varun

On Tue, Nov 18, 2014 at 3:06 PM, Zhen Zhang <zz...@linkedin.com>> wrote:
Hi Varun,

Here is the problem. You are using ONLINE-OFFLINE state model for multiple resources, and in this case when you register state model factory, you need to use your resource name (e.g. $terrapin$data$meta_pin_join$1415866960201) as your factory name instead of using the default factory name (which is "DEFAULT"); sth. like this:

HelixManager#getStateMachineEngine#registerStateModelFactory("ONLINEOFFLINE", factory, "$terrapin$data$meta_pin_join$1415866960201")

Otherwise, Helix can't distinguish the state model factories for the two different resources using the same state model and the same factory name. To confirm, you should have the following message in your participant log:

WARN: "stateModelFactory for " + stateModelName + " using factoryName DEFAULT has already been registered."

Let us know if this solves the problem.

Thanks,
Zhen

________________________________
From: Varun Sharma [varun@pinterest.com<ma...@pinterest.com>]
Sent: Tuesday, November 18, 2014 12:59 PM

To: user@helix.apache.org<ma...@helix.apache.org>
Subject: Re: Helix issue - External View out of sync

I shared the logs with zhen using google drive..

On Tue, Nov 18, 2014 at 12:56 PM, kishore g <g....@gmail.com>> wrote:
Did you try dropbox or any other public file sharing service.

On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <va...@pinterest.com>> wrote:
Hi Zhen,

My logs are > 10M and jira does not allow me to attach them. Also, gmail is not allowing me to send them over as it flags them as "blocked for security reasons" - link here<https://support.google.com/mail/answer/6590?hl=en> - Do you have any other options to send over the file. I create HELIX-551 for this issue.

Thanks
Varun

On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <zz...@linkedin.com>> wrote:
Hi Varun, I missed the conversation on IRC. You could create a jira at:
https://issues.apache.org/jira/browse/HELIX

And attach the zk log in the jira. We will be able to figure it out.

Thanks,
Zhen

________________________________
From: Zhen Zhang [zzhang@linkedin.com<ma...@linkedin.com>]
Sent: Monday, November 17, 2014 5:16 PM
To: user@helix.apache.org<ma...@helix.apache.org>
Subject: RE: Helix issue - External View out of sync

Hi, Varun, you can join us on freenode IRC: http://helix.apache.org/IRC.html

Thanks,
Zhen

________________________________
From: Varun Sharma [varun@pinterest.com<ma...@pinterest.com>]
Sent: Monday, November 17, 2014 5:08 PM
To: user@helix.apache.org<ma...@helix.apache.org>
Subject: Re: Helix issue - External View out of sync

I looked at the logs and gc was fine as the system was processing other events around the same time.

Is there anything else specifically I shold look for in the logs ? Is there a way to find out whether a node was removed from the cluster due to a ZK issue ?

Thanks !
Varun

On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am wondering how come a partition was in the online state for a resource that was newly created.

Thanks
Varun

On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am using 0.6.4. In this case, I created a resource and set its ideal state and the partitions onlined themselves. It seems for that node - it opened a whole bunch of other partitions at around the same time (~ 30 or so) but failed to open 3-4 partitions. This was for a brand new resource I created..

THanks !
Varun

On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com>> wrote:
One suggestion is to check for GC pauses on the nodes. Nodes loses the cluster member ship if they get into long GC or starts flapping. That might be cause for state mismatch. However, external view must be up to date. It might help if you can attach the controller logs and node logs.

On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>> wrote:
Hi,

I am seeing the following issue for many partitions in helix using a simple Online->Offline state model factory. The external view says that the partition has been assigned to 3 hosts. However, when I look at the hosts only 1 of them executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN  Force CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{}

When I grep the message ID in the controller, I see the following:


2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO  {

  "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",

  "mapFields" : {

    "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141" : {

      "AdditionalInfo" : "Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException: Current state of stateModel does not match the fromState in Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",

      "Class" : "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",

      "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",

      "Message state" : "READ"

    },


What could be causing this - when I restart the node, the error disappears (meaning that the node is able to perform the state transition). What could be causing this state mismatch ?


Thanks

Varun

Re: Helix issue - External View out of sync

Posted by Varun Sharma <va...@pinterest.com>.

Hmm, it seems that in my case, the resources are not known in advance and I
need to decommision resources/create resources on the fly as data comes
in/gets deleted. Is there a way around that ?

Thanks
Varun

On Tue, Nov 18, 2014 at 3:06 PM, Zhen Zhang <zz...@linkedin.com> wrote:

>  Hi Varun,
>
> Here is the problem. You are using ONLINE-OFFLINE state model for multiple
> resources, and in this case when you register state model factory, you need
> to use your resource name (e.g. $terrapin$data$meta_pin_join$1415866960201)
> as your factory name instead of using the default factory name (which is
> "DEFAULT"); sth. like this:
>
> HelixManager#getStateMachineEngine#registerStateModelFactory("ONLINEOFFLINE",
> factory, "$terrapin$data$meta_pin_join$1415866960201")
>
> Otherwise, Helix can't distinguish the state model factories for the two
> different resources using the same state model and the same factory name.
> To confirm, you should have the following message in your participant log:
>
> WARN: "stateModelFactory for " + stateModelName + " using factoryName
> DEFAULT has already been registered."
>
> Let us know if this solves the problem.
>
> Thanks,
> Zhen
>
>  ------------------------------
> *From:* Varun Sharma [varun@pinterest.com]
> *Sent:* Tuesday, November 18, 2014 12:59 PM
>
> *To:* user@helix.apache.org
> *Subject:* Re: Helix issue - External View out of sync
>
>   I shared the logs with zhen using google drive..
>
> On Tue, Nov 18, 2014 at 12:56 PM, kishore g <g....@gmail.com> wrote:
>
>> Did you try dropbox or any other public file sharing service.
>>
>> On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>>> Hi Zhen,
>>>
>>>  My logs are > 10M and jira does not allow me to attach them. Also,
>>> gmail is not allowing me to send them over as it flags them as "blocked for
>>> security reasons" - link here
>>> <https://support.google.com/mail/answer/6590?hl=en> - Do you have any
>>> other options to send over the file. I create HELIX-551 for this issue.
>>>
>>>  Thanks
>>>  Varun
>>>
>>> On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <zz...@linkedin.com> wrote:
>>>
>>>>  Hi Varun, I missed the conversation on IRC. You could create a jira
>>>> at:
>>>> https://issues.apache.org/jira/browse/HELIX
>>>>
>>>> And attach the zk log in the jira. We will be able to figure it out.
>>>>
>>>> Thanks,
>>>> Zhen
>>>>
>>>>  ------------------------------
>>>> *From:* Zhen Zhang [zzhang@linkedin.com]
>>>> *Sent:* Monday, November 17, 2014 5:16 PM
>>>> *To:* user@helix.apache.org
>>>> *Subject:* RE: Helix issue - External View out of sync
>>>>
>>>>    Hi, Varun, you can join us on freenode IRC:
>>>> http://helix.apache.org/IRC.html
>>>>
>>>> Thanks,
>>>> Zhen
>>>>
>>>>  ------------------------------
>>>> *From:* Varun Sharma [varun@pinterest.com]
>>>> *Sent:* Monday, November 17, 2014 5:08 PM
>>>> *To:* user@helix.apache.org
>>>> *Subject:* Re: Helix issue - External View out of sync
>>>>
>>>>   I looked at the logs and gc was fine as the system was processing
>>>> other events around the same time.
>>>>
>>>>  Is there anything else specifically I shold look for in the logs ? Is
>>>> there a way to find out whether a node was removed from the cluster due to
>>>> a ZK issue ?
>>>>
>>>>  Thanks !
>>>> Varun
>>>>
>>>> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com>
>>>> wrote:
>>>>
>>>>> I am wondering how come a partition was in the online state for a
>>>>> resource that was newly created.
>>>>>
>>>>>  Thanks
>>>>>  Varun
>>>>>
>>>>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>
>>>>> wrote:
>>>>>
>>>>>> I am using 0.6.4. In this case, I created a resource and set its
>>>>>> ideal state and the partitions onlined themselves. It seems for that node -
>>>>>> it opened a whole bunch of other partitions at around the same time (~ 30
>>>>>> or so) but failed to open 3-4 partitions. This was for a brand new resource
>>>>>> I created..
>>>>>>
>>>>>>  THanks !
>>>>>>  Varun
>>>>>>
>>>>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses
>>>>>>> the cluster member ship if they get into long GC or starts flapping. That
>>>>>>> might be cause for state mismatch. However, external view must be up to
>>>>>>> date. It might help if you can attach the controller logs and node logs.
>>>>>>>
>>>>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>  I am seeing the following issue for many partitions in helix
>>>>>>>> using a simple Online->Offline state model factory. The external view says
>>>>>>>> that the partition has been assigned to 3 hosts. However, when I look at
>>>>>>>> the hosts only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>>>>>
>>>>>>>>  On the hosts, that did not execute the transition, I see the
>>>>>>>> following:
>>>>>>>>
>>>>>>>>  2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>>>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState
>>>>>>>> on Zk to be stateModel's CurrentState*. *partitionKey: 490*,
>>>>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>>>>>> READ_TIMESTAMP=1415870993787,
>>>>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>>>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>>>>>>> TO_STATE=ONLINE}{}{}
>>>>>>>>
>>>>>>>>  When I grep the message ID in the controller, I see the following:
>>>>>>>>
>>>>>>>>  2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>>>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>>>>>
>>>>>>>>   "id" :
>>>>>>>> "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>>>>>
>>>>>>>>   "mapFields" : {
>>>>>>>>
>>>>>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>>>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>>>>>
>>>>>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>>>>>> org.apache.helix.messaging.handling.
>>>>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>>>>>>> state of stateModel does not match the fromState in Message, Current
>>>>>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>>>>>
>>>>>>>>       "Class" : "class
>>>>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>>>>>
>>>>>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>>>>>
>>>>>>>>       "Message state" : "READ"
>>>>>>>>
>>>>>>>>     },
>>>>>>>>
>>>>>>>>
>>>>>>>>  What could be causing this - when I restart the node, the error
>>>>>>>> disappears (meaning that the node is able to perform the state transition).
>>>>>>>> What could be causing this state mismatch ?
>>>>>>>>
>>>>>>>>
>>>>>>>>  Thanks
>>>>>>>>
>>>>>>>> Varun
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

RE: Helix issue - External View out of sync

Posted by Zhen Zhang <zz...@linkedin.com>.

Hi Varun,

Here is the problem. You are using ONLINE-OFFLINE state model for multiple resources, and in this case when you register state model factory, you need to use your resource name (e.g. $terrapin$data$meta_pin_join$1415866960201) as your factory name instead of using the default factory name (which is "DEFAULT"); sth. like this:

HelixManager#getStateMachineEngine#registerStateModelFactory("ONLINEOFFLINE", factory, "$terrapin$data$meta_pin_join$1415866960201")

Otherwise, Helix can't distinguish the state model factories for the two different resources using the same state model and the same factory name. To confirm, you should have the following message in your participant log:

WARN: "stateModelFactory for " + stateModelName + " using factoryName DEFAULT has already been registered."

Let us know if this solves the problem.

Thanks,
Zhen

________________________________
From: Varun Sharma [varun@pinterest.com]
Sent: Tuesday, November 18, 2014 12:59 PM
To: user@helix.apache.org
Subject: Re: Helix issue - External View out of sync

I shared the logs with zhen using google drive..

On Tue, Nov 18, 2014 at 12:56 PM, kishore g <g....@gmail.com>> wrote:
Did you try dropbox or any other public file sharing service.

On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <va...@pinterest.com>> wrote:
Hi Zhen,

My logs are > 10M and jira does not allow me to attach them. Also, gmail is not allowing me to send them over as it flags them as "blocked for security reasons" - link here<https://support.google.com/mail/answer/6590?hl=en> - Do you have any other options to send over the file. I create HELIX-551 for this issue.

Thanks
Varun

On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <zz...@linkedin.com>> wrote:
Hi Varun, I missed the conversation on IRC. You could create a jira at:
https://issues.apache.org/jira/browse/HELIX

And attach the zk log in the jira. We will be able to figure it out.

Thanks,
Zhen

________________________________
From: Zhen Zhang [zzhang@linkedin.com<ma...@linkedin.com>]
Sent: Monday, November 17, 2014 5:16 PM
To: user@helix.apache.org<ma...@helix.apache.org>
Subject: RE: Helix issue - External View out of sync

Hi, Varun, you can join us on freenode IRC: http://helix.apache.org/IRC.html

Thanks,
Zhen

________________________________
From: Varun Sharma [varun@pinterest.com<ma...@pinterest.com>]
Sent: Monday, November 17, 2014 5:08 PM
To: user@helix.apache.org<ma...@helix.apache.org>
Subject: Re: Helix issue - External View out of sync

I looked at the logs and gc was fine as the system was processing other events around the same time.

Is there anything else specifically I shold look for in the logs ? Is there a way to find out whether a node was removed from the cluster due to a ZK issue ?

Thanks !
Varun

On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am wondering how come a partition was in the online state for a resource that was newly created.

Thanks
Varun

On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am using 0.6.4. In this case, I created a resource and set its ideal state and the partitions onlined themselves. It seems for that node - it opened a whole bunch of other partitions at around the same time (~ 30 or so) but failed to open 3-4 partitions. This was for a brand new resource I created..

THanks !
Varun

On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com>> wrote:
One suggestion is to check for GC pauses on the nodes. Nodes loses the cluster member ship if they get into long GC or starts flapping. That might be cause for state mismatch. However, external view must be up to date. It might help if you can attach the controller logs and node logs.

On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>> wrote:
Hi,

I am seeing the following issue for many partitions in helix using a simple Online->Offline state model factory. The external view says that the partition has been assigned to 3 hosts. However, when I look at the hosts only 1 of them executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN  Force CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{}

When I grep the message ID in the controller, I see the following:


2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO  {

  "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",

  "mapFields" : {

    "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141" : {

      "AdditionalInfo" : "Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException: Current state of stateModel does not match the fromState in Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",

      "Class" : "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",

      "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",

      "Message state" : "READ"

    },


What could be causing this - when I restart the node, the error disappears (meaning that the node is able to perform the state transition). What could be causing this state mismatch ?


Thanks

Varun

Re: Helix issue - External View out of sync

Posted by Varun Sharma <va...@pinterest.com>.

I shared the logs with zhen using google drive..

On Tue, Nov 18, 2014 at 12:56 PM, kishore g <g....@gmail.com> wrote:

> Did you try dropbox or any other public file sharing service.
>
> On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <va...@pinterest.com>
> wrote:
>
>> Hi Zhen,
>>
>> My logs are > 10M and jira does not allow me to attach them. Also, gmail
>> is not allowing me to send them over as it flags them as "blocked for
>> security reasons" - link here
>> <https://support.google.com/mail/answer/6590?hl=en> - Do you have any
>> other options to send over the file. I create HELIX-551 for this issue.
>>
>> Thanks
>> Varun
>>
>> On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <zz...@linkedin.com> wrote:
>>
>>>  Hi Varun, I missed the conversation on IRC. You could create a jira at:
>>> https://issues.apache.org/jira/browse/HELIX
>>>
>>> And attach the zk log in the jira. We will be able to figure it out.
>>>
>>> Thanks,
>>> Zhen
>>>
>>>  ------------------------------
>>> *From:* Zhen Zhang [zzhang@linkedin.com]
>>> *Sent:* Monday, November 17, 2014 5:16 PM
>>> *To:* user@helix.apache.org
>>> *Subject:* RE: Helix issue - External View out of sync
>>>
>>>   Hi, Varun, you can join us on freenode IRC:
>>> http://helix.apache.org/IRC.html
>>>
>>> Thanks,
>>> Zhen
>>>
>>>  ------------------------------
>>> *From:* Varun Sharma [varun@pinterest.com]
>>> *Sent:* Monday, November 17, 2014 5:08 PM
>>> *To:* user@helix.apache.org
>>> *Subject:* Re: Helix issue - External View out of sync
>>>
>>>   I looked at the logs and gc was fine as the system was processing
>>> other events around the same time.
>>>
>>>  Is there anything else specifically I shold look for in the logs ? Is
>>> there a way to find out whether a node was removed from the cluster due to
>>> a ZK issue ?
>>>
>>>  Thanks !
>>> Varun
>>>
>>> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>>
>>>> I am wondering how come a partition was in the online state for a
>>>> resource that was newly created.
>>>>
>>>>  Thanks
>>>>  Varun
>>>>
>>>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>
>>>> wrote:
>>>>
>>>>> I am using 0.6.4. In this case, I created a resource and set its ideal
>>>>> state and the partitions onlined themselves. It seems for that node - it
>>>>> opened a whole bunch of other partitions at around the same time (~ 30 or
>>>>> so) but failed to open 3-4 partitions. This was for a brand new resource I
>>>>> created..
>>>>>
>>>>>  THanks !
>>>>>  Varun
>>>>>
>>>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses
>>>>>> the cluster member ship if they get into long GC or starts flapping. That
>>>>>> might be cause for state mismatch. However, external view must be up to
>>>>>> date. It might help if you can attach the controller logs and node logs.
>>>>>>
>>>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>  I am seeing the following issue for many partitions in helix using
>>>>>>> a simple Online->Offline state model factory. The external view says that
>>>>>>> the partition has been assigned to 3 hosts. However, when I look at the
>>>>>>> hosts only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>>>>
>>>>>>>  On the hosts, that did not execute the transition, I see the
>>>>>>> following:
>>>>>>>
>>>>>>>  2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on
>>>>>>> Zk to be stateModel's CurrentState*. *partitionKey: 490*,
>>>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>>>>> READ_TIMESTAMP=1415870993787,
>>>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>>>>>> TO_STATE=ONLINE}{}{}
>>>>>>>
>>>>>>>  When I grep the message ID in the controller, I see the following:
>>>>>>>
>>>>>>>  2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>>>>
>>>>>>>   "id" :
>>>>>>> "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>>>>
>>>>>>>   "mapFields" : {
>>>>>>>
>>>>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>>>>
>>>>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>>>>> org.apache.helix.messaging.handling.
>>>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>>>>>> state of stateModel does not match the fromState in Message, Current
>>>>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>>>>
>>>>>>>       "Class" : "class
>>>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>>>>
>>>>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>>>>
>>>>>>>       "Message state" : "READ"
>>>>>>>
>>>>>>>     },
>>>>>>>
>>>>>>>
>>>>>>>  What could be causing this - when I restart the node, the error
>>>>>>> disappears (meaning that the node is able to perform the state transition).
>>>>>>> What could be causing this state mismatch ?
>>>>>>>
>>>>>>>
>>>>>>>  Thanks
>>>>>>>
>>>>>>> Varun
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Helix issue - External View out of sync

Posted by kishore g <g....@gmail.com>.

Did you try dropbox or any other public file sharing service.

On Tue, Nov 18, 2014 at 10:57 AM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Zhen,
>
> My logs are > 10M and jira does not allow me to attach them. Also, gmail
> is not allowing me to send them over as it flags them as "blocked for
> security reasons" - link here
> <https://support.google.com/mail/answer/6590?hl=en> - Do you have any
> other options to send over the file. I create HELIX-551 for this issue.
>
> Thanks
> Varun
>
> On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <zz...@linkedin.com> wrote:
>
>>  Hi Varun, I missed the conversation on IRC. You could create a jira at:
>> https://issues.apache.org/jira/browse/HELIX
>>
>> And attach the zk log in the jira. We will be able to figure it out.
>>
>> Thanks,
>> Zhen
>>
>>  ------------------------------
>> *From:* Zhen Zhang [zzhang@linkedin.com]
>> *Sent:* Monday, November 17, 2014 5:16 PM
>> *To:* user@helix.apache.org
>> *Subject:* RE: Helix issue - External View out of sync
>>
>>   Hi, Varun, you can join us on freenode IRC:
>> http://helix.apache.org/IRC.html
>>
>> Thanks,
>> Zhen
>>
>>  ------------------------------
>> *From:* Varun Sharma [varun@pinterest.com]
>> *Sent:* Monday, November 17, 2014 5:08 PM
>> *To:* user@helix.apache.org
>> *Subject:* Re: Helix issue - External View out of sync
>>
>>   I looked at the logs and gc was fine as the system was processing
>> other events around the same time.
>>
>>  Is there anything else specifically I shold look for in the logs ? Is
>> there a way to find out whether a node was removed from the cluster due to
>> a ZK issue ?
>>
>>  Thanks !
>> Varun
>>
>> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>>> I am wondering how come a partition was in the online state for a
>>> resource that was newly created.
>>>
>>>  Thanks
>>>  Varun
>>>
>>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>>
>>>> I am using 0.6.4. In this case, I created a resource and set its ideal
>>>> state and the partitions onlined themselves. It seems for that node - it
>>>> opened a whole bunch of other partitions at around the same time (~ 30 or
>>>> so) but failed to open 3-4 partitions. This was for a brand new resource I
>>>> created..
>>>>
>>>>  THanks !
>>>>  Varun
>>>>
>>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com> wrote:
>>>>
>>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses the
>>>>> cluster member ship if they get into long GC or starts flapping. That might
>>>>> be cause for state mismatch. However, external view must be up to date. It
>>>>> might help if you can attach the controller logs and node logs.
>>>>>
>>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>  I am seeing the following issue for many partitions in helix using
>>>>>> a simple Online->Offline state model factory. The external view says that
>>>>>> the partition has been assigned to 3 hosts. However, when I look at the
>>>>>> hosts only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>>>
>>>>>>  On the hosts, that did not execute the transition, I see the
>>>>>> following:
>>>>>>
>>>>>>  2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on
>>>>>> Zk to be stateModel's CurrentState*. *partitionKey: 490*,
>>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>>>> READ_TIMESTAMP=1415870993787,
>>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>>>>> TO_STATE=ONLINE}{}{}
>>>>>>
>>>>>>  When I grep the message ID in the controller, I see the following:
>>>>>>
>>>>>>  2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>>>
>>>>>>   "id" :
>>>>>> "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>>>
>>>>>>   "mapFields" : {
>>>>>>
>>>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>>>
>>>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>>>> org.apache.helix.messaging.handling.
>>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>>>>> state of stateModel does not match the fromState in Message, Current
>>>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>>>
>>>>>>       "Class" : "class
>>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>>>
>>>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>>>
>>>>>>       "Message state" : "READ"
>>>>>>
>>>>>>     },
>>>>>>
>>>>>>
>>>>>>  What could be causing this - when I restart the node, the error
>>>>>> disappears (meaning that the node is able to perform the state transition).
>>>>>> What could be causing this state mismatch ?
>>>>>>
>>>>>>
>>>>>>  Thanks
>>>>>>
>>>>>> Varun
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Helix issue - External View out of sync

Posted by Varun Sharma <va...@pinterest.com>.

Hi Zhen,

My logs are > 10M and jira does not allow me to attach them. Also, gmail is
not allowing me to send them over as it flags them as "blocked for security
reasons" - link here <https://support.google.com/mail/answer/6590?hl=en> -
Do you have any other options to send over the file. I create HELIX-551 for
this issue.

Thanks
Varun

On Mon, Nov 17, 2014 at 6:49 PM, Zhen Zhang <zz...@linkedin.com> wrote:

>  Hi Varun, I missed the conversation on IRC. You could create a jira at:
> https://issues.apache.org/jira/browse/HELIX
>
> And attach the zk log in the jira. We will be able to figure it out.
>
> Thanks,
> Zhen
>
>  ------------------------------
> *From:* Zhen Zhang [zzhang@linkedin.com]
> *Sent:* Monday, November 17, 2014 5:16 PM
> *To:* user@helix.apache.org
> *Subject:* RE: Helix issue - External View out of sync
>
>   Hi, Varun, you can join us on freenode IRC:
> http://helix.apache.org/IRC.html
>
> Thanks,
> Zhen
>
>  ------------------------------
> *From:* Varun Sharma [varun@pinterest.com]
> *Sent:* Monday, November 17, 2014 5:08 PM
> *To:* user@helix.apache.org
> *Subject:* Re: Helix issue - External View out of sync
>
>   I looked at the logs and gc was fine as the system was processing other
> events around the same time.
>
>  Is there anything else specifically I shold look for in the logs ? Is
> there a way to find out whether a node was removed from the cluster due to
> a ZK issue ?
>
>  Thanks !
> Varun
>
> On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com> wrote:
>
>> I am wondering how come a partition was in the online state for a
>> resource that was newly created.
>>
>>  Thanks
>>  Varun
>>
>> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>>> I am using 0.6.4. In this case, I created a resource and set its ideal
>>> state and the partitions onlined themselves. It seems for that node - it
>>> opened a whole bunch of other partitions at around the same time (~ 30 or
>>> so) but failed to open 3-4 partitions. This was for a brand new resource I
>>> created..
>>>
>>>  THanks !
>>>  Varun
>>>
>>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com> wrote:
>>>
>>>> One suggestion is to check for GC pauses on the nodes. Nodes loses the
>>>> cluster member ship if they get into long GC or starts flapping. That might
>>>> be cause for state mismatch. However, external view must be up to date. It
>>>> might help if you can attach the controller logs and node logs.
>>>>
>>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>  I am seeing the following issue for many partitions in helix using a
>>>>> simple Online->Offline state model factory. The external view says that the
>>>>> partition has been assigned to 3 hosts. However, when I look at the hosts
>>>>> only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>>
>>>>>  On the hosts, that did not execute the transition, I see the
>>>>> following:
>>>>>
>>>>>  2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on
>>>>> Zk to be stateModel's CurrentState*. *partitionKey: 490*,
>>>>> currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>>> READ_TIMESTAMP=1415870993787,
>>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>>>> TO_STATE=ONLINE}{}{}
>>>>>
>>>>>  When I grep the message ID in the controller, I see the following:
>>>>>
>>>>>  2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>>
>>>>>   "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>>
>>>>>   "mapFields" : {
>>>>>
>>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>>
>>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>>> org.apache.helix.messaging.handling.
>>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>>>> state of stateModel does not match the fromState in Message, Current
>>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>>
>>>>>       "Class" : "class
>>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>>
>>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>>
>>>>>       "Message state" : "READ"
>>>>>
>>>>>     },
>>>>>
>>>>>
>>>>>  What could be causing this - when I restart the node, the error
>>>>> disappears (meaning that the node is able to perform the state transition).
>>>>> What could be causing this state mismatch ?
>>>>>
>>>>>
>>>>>  Thanks
>>>>>
>>>>> Varun
>>>>>
>>>>
>>>>
>>>
>>
>

RE: Helix issue - External View out of sync

Posted by Zhen Zhang <zz...@linkedin.com>.

Hi Varun, I missed the conversation on IRC. You could create a jira at:
https://issues.apache.org/jira/browse/HELIX

And attach the zk log in the jira. We will be able to figure it out.

Thanks,
Zhen

________________________________
From: Zhen Zhang [zzhang@linkedin.com]
Sent: Monday, November 17, 2014 5:16 PM
To: user@helix.apache.org
Subject: RE: Helix issue - External View out of sync

Hi, Varun, you can join us on freenode IRC: http://helix.apache.org/IRC.html

Thanks,
Zhen

________________________________
From: Varun Sharma [varun@pinterest.com]
Sent: Monday, November 17, 2014 5:08 PM
To: user@helix.apache.org
Subject: Re: Helix issue - External View out of sync

I looked at the logs and gc was fine as the system was processing other events around the same time.

Is there anything else specifically I shold look for in the logs ? Is there a way to find out whether a node was removed from the cluster due to a ZK issue ?

Thanks !
Varun

On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am wondering how come a partition was in the online state for a resource that was newly created.

Thanks
Varun

On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am using 0.6.4. In this case, I created a resource and set its ideal state and the partitions onlined themselves. It seems for that node - it opened a whole bunch of other partitions at around the same time (~ 30 or so) but failed to open 3-4 partitions. This was for a brand new resource I created..

THanks !
Varun

On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com>> wrote:
One suggestion is to check for GC pauses on the nodes. Nodes loses the cluster member ship if they get into long GC or starts flapping. That might be cause for state mismatch. However, external view must be up to date. It might help if you can attach the controller logs and node logs.

On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>> wrote:
Hi,

I am seeing the following issue for many partitions in helix using a simple Online->Offline state model factory. The external view says that the partition has been assigned to 3 hosts. However, when I look at the hosts only 1 of them executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN  Force CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{}

When I grep the message ID in the controller, I see the following:


2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO  {

  "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",

  "mapFields" : {

    "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141" : {

      "AdditionalInfo" : "Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException: Current state of stateModel does not match the fromState in Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",

      "Class" : "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",

      "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",

      "Message state" : "READ"

    },


What could be causing this - when I restart the node, the error disappears (meaning that the node is able to perform the state transition). What could be causing this state mismatch ?


Thanks

Varun

RE: Helix issue - External View out of sync

Posted by Zhen Zhang <zz...@linkedin.com>.

Hi, Varun, you can join us on freenode IRC: http://helix.apache.org/IRC.html

Thanks,
Zhen

________________________________
From: Varun Sharma [varun@pinterest.com]
Sent: Monday, November 17, 2014 5:08 PM
To: user@helix.apache.org
Subject: Re: Helix issue - External View out of sync

I looked at the logs and gc was fine as the system was processing other events around the same time.

Is there anything else specifically I shold look for in the logs ? Is there a way to find out whether a node was removed from the cluster due to a ZK issue ?

Thanks !
Varun

On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am wondering how come a partition was in the online state for a resource that was newly created.

Thanks
Varun

On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com>> wrote:
I am using 0.6.4. In this case, I created a resource and set its ideal state and the partitions onlined themselves. It seems for that node - it opened a whole bunch of other partitions at around the same time (~ 30 or so) but failed to open 3-4 partitions. This was for a brand new resource I created..

THanks !
Varun

On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com>> wrote:
One suggestion is to check for GC pauses on the nodes. Nodes loses the cluster member ship if they get into long GC or starts flapping. That might be cause for state mismatch. However, external view must be up to date. It might help if you can attach the controller logs and node logs.

On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>> wrote:
Hi,

I am seeing the following issue for many partitions in helix using a simple Online->Offline state model factory. The external view says that the partition has been assigned to 3 hosts. However, when I look at the hosts only 1 of them executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN  Force CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{}

When I grep the message ID in the controller, I see the following:

2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO  {

  "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",

  "mapFields" : {

    "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141" : {

      "AdditionalInfo" : "Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException: Current state of stateModel does not match the fromState in Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",

      "Class" : "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",

      "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",

      "Message state" : "READ"

    },

What could be causing this - when I restart the node, the error disappears (meaning that the node is able to perform the state transition). What could be causing this state mismatch ?

Thanks

Varun

Re: Helix issue - External View out of sync

Posted by Varun Sharma <va...@pinterest.com>.

I looked at the logs and gc was fine as the system was processing other
events around the same time.

Is there anything else specifically I shold look for in the logs ? Is there
a way to find out whether a node was removed from the cluster due to a ZK
issue ?

Thanks !
Varun

On Mon, Nov 17, 2014 at 4:32 PM, Varun Sharma <va...@pinterest.com> wrote:

> I am wondering how come a partition was in the online state for a resource
> that was newly created.
>
> Thanks
> Varun
>
> On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com> wrote:
>
>> I am using 0.6.4. In this case, I created a resource and set its ideal
>> state and the partitions onlined themselves. It seems for that node - it
>> opened a whole bunch of other partitions at around the same time (~ 30 or
>> so) but failed to open 3-4 partitions. This was for a brand new resource I
>> created..
>>
>> THanks !
>> Varun
>>
>> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com> wrote:
>>
>>> One suggestion is to check for GC pauses on the nodes. Nodes loses the
>>> cluster member ship if they get into long GC or starts flapping. That might
>>> be cause for state mismatch. However, external view must be up to date. It
>>> might help if you can attach the controller logs and node logs.
>>>
>>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am seeing the following issue for many partitions in helix using a
>>>> simple Online->Offline state model factory. The external view says that the
>>>> partition has been assigned to 3 hosts. However, when I look at the hosts
>>>> only 1 of them executed the OFFLINE --> ONLINE transition.
>>>>
>>>> On the hosts, that did not execute the transition, I see the following:
>>>>
>>>> 2014-11-13 09:29:54,394 [pool-3-thread-11]
>>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on Zk
>>>> to be stateModel's CurrentState*. *partitionKey: 490*, currentState:
>>>> ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>>> READ_TIMESTAMP=1415870993787,
>>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>>> TO_STATE=ONLINE}{}{}
>>>>
>>>> When I grep the message ID in the controller, I see the following:
>>>>
>>>> 2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>>
>>>>   "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>>
>>>>   "mapFields" : {
>>>>
>>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>>
>>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>>> org.apache.helix.messaging.handling.
>>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>>> state of stateModel does not match the fromState in Message, Current
>>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>>
>>>>       "Class" : "class
>>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>>
>>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>>
>>>>       "Message state" : "READ"
>>>>
>>>>     },
>>>>
>>>>
>>>> What could be causing this - when I restart the node, the error
>>>> disappears (meaning that the node is able to perform the state transition).
>>>> What could be causing this state mismatch ?
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Varun
>>>>
>>>
>>>
>>
>

Re: Helix issue - External View out of sync

Posted by Varun Sharma <va...@pinterest.com>.

I am wondering how come a partition was in the online state for a resource
that was newly created.

Thanks
Varun

On Mon, Nov 17, 2014 at 4:31 PM, Varun Sharma <va...@pinterest.com> wrote:

> I am using 0.6.4. In this case, I created a resource and set its ideal
> state and the partitions onlined themselves. It seems for that node - it
> opened a whole bunch of other partitions at around the same time (~ 30 or
> so) but failed to open 3-4 partitions. This was for a brand new resource I
> created..
>
> THanks !
> Varun
>
> On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com> wrote:
>
>> One suggestion is to check for GC pauses on the nodes. Nodes loses the
>> cluster member ship if they get into long GC or starts flapping. That might
>> be cause for state mismatch. However, external view must be up to date. It
>> might help if you can attach the controller logs and node logs.
>>
>> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am seeing the following issue for many partitions in helix using a
>>> simple Online->Offline state model factory. The external view says that the
>>> partition has been assigned to 3 hosts. However, when I look at the hosts
>>> only 1 of them executed the OFFLINE --> ONLINE transition.
>>>
>>> On the hosts, that did not execute the transition, I see the following:
>>>
>>> 2014-11-13 09:29:54,394 [pool-3-thread-11]
>>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on Zk
>>> to be stateModel's CurrentState*. *partitionKey: 490*, currentState:
>>> ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>>> READ_TIMESTAMP=1415870993787,
>>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>>> TO_STATE=ONLINE}{}{}
>>>
>>> When I grep the message ID in the controller, I see the following:
>>>
>>> 2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>>> (ZKPathDataDumpTask.java:155) INFO  {
>>>
>>>   "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>>
>>>   "mapFields" : {
>>>
>>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>>
>>>       "AdditionalInfo" : "Message execution failed. msgId:
>>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>>> org.apache.helix.messaging.handling.
>>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current
>>> state of stateModel does not match the fromState in Message, Current
>>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>>
>>>       "Class" : "class
>>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>>
>>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>>
>>>       "Message state" : "READ"
>>>
>>>     },
>>>
>>>
>>> What could be causing this - when I restart the node, the error
>>> disappears (meaning that the node is able to perform the state transition).
>>> What could be causing this state mismatch ?
>>>
>>>
>>> Thanks
>>>
>>> Varun
>>>
>>
>>
>

Re: Helix issue - External View out of sync

Posted by Varun Sharma <va...@pinterest.com>.

I am using 0.6.4. In this case, I created a resource and set its ideal
state and the partitions onlined themselves. It seems for that node - it
opened a whole bunch of other partitions at around the same time (~ 30 or
so) but failed to open 3-4 partitions. This was for a brand new resource I
created..

THanks !
Varun

On Mon, Nov 17, 2014 at 4:24 PM, kishore g <g....@gmail.com> wrote:

> One suggestion is to check for GC pauses on the nodes. Nodes loses the
> cluster member ship if they get into long GC or starts flapping. That might
> be cause for state mismatch. However, external view must be up to date. It
> might help if you can attach the controller logs and node logs.
>
> On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com> wrote:
>
>> Hi,
>>
>> I am seeing the following issue for many partitions in helix using a
>> simple Online->Offline state model factory. The external view says that the
>> partition has been assigned to 3 hosts. However, when I look at the hosts
>> only 1 of them executed the OFFLINE --> ONLINE transition.
>>
>> On the hosts, that did not execute the transition, I see the following:
>>
>> 2014-11-13 09:29:54,394 [pool-3-thread-11]
>> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on Zk
>> to be stateModel's CurrentState*. *partitionKey: 490*, currentState:
>> ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db,
>> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
>> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
>> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
>> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
>> READ_TIMESTAMP=1415870993787,
>> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
>> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
>> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
>> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
>> TO_STATE=ONLINE}{}{}
>>
>> When I grep the message ID in the controller, I see the following:
>>
>> 2014-11-14 09:34:56,265 [StatusDumpTimerTask]
>> (ZKPathDataDumpTask.java:155) INFO  {
>>
>>   "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>>
>>   "mapFields" : {
>>
>>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
>> c1193025-b416-49d7-adc2-10afe2389141" : {
>>
>>       "AdditionalInfo" : "Message execution failed. msgId:
>> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
>> org.apache.helix.messaging.handling.
>> *HelixStateTransitionHandler$HelixStateMismatchException*: Current state
>> of stateModel does not match the fromState in Message, Current
>> State:ONLINE, message expected:OFFLINE, partition: 490, from:
>> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>>
>>       "Class" : "class
>> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>>
>>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>>
>>       "Message state" : "READ"
>>
>>     },
>>
>>
>> What could be causing this - when I restart the node, the error
>> disappears (meaning that the node is able to perform the state transition).
>> What could be causing this state mismatch ?
>>
>>
>> Thanks
>>
>> Varun
>>
>
>

Re: Helix issue - External View out of sync

Posted by kishore g <g....@gmail.com>.

One suggestion is to check for GC pauses on the nodes. Nodes loses the
cluster member ship if they get into long GC or starts flapping. That might
be cause for state mismatch. However, external view must be up to date. It
might help if you can attach the controller logs and node logs.

On Mon, Nov 17, 2014 at 4:10 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi,
>
> I am seeing the following issue for many partitions in helix using a
> simple Online->Offline state model factory. The external view says that the
> partition has been assigned to 3 hosts. However, when I look at the hosts
> only 1 of them executed the OFFLINE --> ONLINE transition.
>
> On the hosts, that did not execute the transition, I see the following:
>
> 2014-11-13 09:29:54,394 [pool-3-thread-11]
> (HelixStateTransitionHandler.java:206) WARN  *Force CurrentState on Zk to
> be stateModel's CurrentState*. *partitionKey: 490*, currentState: ONLINE,
> message: 12690ce8-8098-46b1-a93d-279604f0e3db,
> {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange,
> EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013,
> FROM_STATE=OFFLINE, MSG_ID=*12690ce8-8098-46b1-a93d-279604f0e3db*,
> MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490,
> READ_TIMESTAMP=1415870993787,
> RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201,
> SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7,
> STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT,
> TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013,
> TO_STATE=ONLINE}{}{}
>
> When I grep the message ID in the controller, I see the following:
>
> 2014-11-14 09:34:56,265 [StatusDumpTimerTask]
> (ZKPathDataDumpTask.java:155) INFO  {
>
>   "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",
>
>   "mapFields" : {
>
>     "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION
> c1193025-b416-49d7-adc2-10afe2389141" : {
>
>       "AdditionalInfo" : "Message execution failed. msgId:
> 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg:
> org.apache.helix.messaging.handling.
> *HelixStateTransitionHandler$HelixStateMismatchException*: Current state
> of stateModel does not match the fromState in Message, Current
> State:ONLINE, message expected:OFFLINE, partition: 490, from:
> hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",
>
>       "Class" : "class
> org.apache.helix.messaging.handling.HelixStateTransitionHandler",
>
>       "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",
>
>       "Message state" : "READ"
>
>     },
>
>
> What could be causing this - when I restart the node, the error disappears
> (meaning that the node is able to perform the state transition). What could
> be causing this state mismatch ?
>
>
> Thanks
>
> Varun
>

RE: Helix issue - External View out of sync

Posted by Zhen Zhang <zz...@linkedin.com>.

Hi Varun, which helix version you are using?

Thanks,
Zhen
________________________________
From: Varun Sharma [varun@pinterest.com]
Sent: Monday, November 17, 2014 4:10 PM
To: user@helix.apache.org
Subject: Helix issue - External View out of sync

Hi,

I am seeing the following issue for many partitions in helix using a simple Online->Offline state model factory. The external view says that the partition has been assigned to 3 hosts. However, when I look at the hosts only 1 of them executed the OFFLINE --> ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN  Force CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{}

When I grep the message ID in the controller, I see the following:

2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO  {

  "id" : "149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201",

  "mapFields" : {

    "HELIX_ERROR     20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141" : {

      "AdditionalInfo" : "Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException: Current state of stateModel does not match the fromState in Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256",

      "Class" : "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",

      "MSG_ID" : "12690ce8-8098-46b1-a93d-279604f0e3db",

      "Message state" : "READ"

    },

What could be causing this - when I restart the node, the error disappears (meaning that the node is able to perform the state transition). What could be causing this state mismatch ?

Thanks

Varun