You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Su Yi <su...@hust.edu.cn> on 2014/12/31 19:30:17 UTC

High Availability of Samza

Hello,

Here are some thoughts about HA of Samza.

1. Failure detection

The problem is, failure detection of container completely depends on YARN in Samza. YARN counts on Node Manager reporting container failures, however Node Manager could fail, too (like, if the machine failed, NM would fail). Node Manager failures can be detected through heartbeat by Resource Manager, but, by default it'll take 10 mins to confirm Node Manager failure. I think, that's OK with batch processing, but not stream processing.

Configuring yarn failure confirm interval to 1s, result in an unstable yarn cluster(4 node in total). With 2s, all things works fine, but it takes 10s~20s to get lost container(machine shut down) back. Considering that testing stream task is very simple(stateless), the recovery time is relatively long.

I am not an expert on YARN, I don't know why it, by default, takes such a long time to confirm node failure. To my understanding, YARN is something trying to be general, and it is not sufficient for stream processing framework. Extra effort should be done beyond YARN on failure detection in stream processing.

2. Task redeployment

After Resource Manager informed Samza of container failure, Samza should apply for resources from YARN to redeploy failed tasks, which consumes time during recovery. And, recovery time is critical for HA in stream processing. I think, maintaining a few standby containers may eliminate this overhead on recovery time. Samza could deploy failed tasks on the standby containers than requesting from YARN.

Hot standby containers, which is described in SAMZA-406(https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery time, however it's costly(it doubles the resources needed).

I'm wondering, what does these stuffs means to you, and how about the feasibility. By the way, I'm using Samza 0.7 .

Thank you for reading.

Happy New Year!;-)

Su Yi

Re: Re: High Availability of Samza

Posted by Milinda Pathirage <mp...@umail.iu.edu>.

I think what Su has experienced is true in case of Node manager failure.
This was there in old Hadoop (Task Tracker failures), this [1] paper
discuss effects of this. I think this behavior is there for node manager
failures (In YARN) too, thats what I discovered sometime back (about a year
ago) by going through YARN code. But I am not sure whether this is true
now.

Thanks
Milinda

[1] http://www.cs.rice.edu/~fd2/pdf/hpdc106-dinu.pdf

On Sat, Jan 3, 2015 at 11:24 PM, Yi Su <su...@hust.edu.cn> wrote:

> Hi Fang,
>
> I have verified the failure detection issue. It takes 10mins for recovery,
> if I kill the Node Manager process first. I will detail the experiment, in
> case I have make any mistakes.
>
> The nodes arrangement is same as before.
>
> Workload :
>         Every second, a python program generates a record with the system
> current time. And it sends the record 10 times to kafka topic
> "suyi-test-input".
>
> Stream Task :
>         It gets the tuples from input stream and sends the tuples to the
> output stream "suyi-test-output". Checkpiont is disabled.
>
> YARN Configuration :
>         <property>
>                 <description>How long to wait until a node manager is
> considered dead.</description>
>                 <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
>                 <value>600000</value>
>         </property>
>         <property>
>                 <description>How often to check that node managers are
> still alive.</description>
>                 <name>yarn.resourcemanager.nm.
> liveness-monitor.interval-ms</name>
>                 <value>1000</value>
>         </property>
>
> Experiment Process :
>         (1) I submited the stream task, the application master ran on node
> "a", and the other container ran on node "b".
>         (2) I killed the node manager process on node "b".
>         (3) At about 09:59:07, I killed the container process on node "b".
>
> Results :
>         (1) At 10:09:33 the application master tried to redeploy the lost
> container, according to application master logs.
>         (2) The output during 09:59:35 to 10:09:53 is lost.
>
> A fraction from application master logs:
>         2015-01-04 09:47:20 ContainerManagementProtocolProxy [INFO]
> Opening proxy : b:35889
>         2015-01-04 09:47:20 SamzaAppMasterTaskManager [INFO] Claimed task
> ID 0 for container container_1420335573203_0001_01_000002 on node b (
> http://b:8042/node/containerlogs/container_1420335573203_0001_01_000002).
>         2015-01-04 09:47:20 SamzaAppMasterTaskManager [INFO] Started task
> ID 0
>         2015-01-04 09:57:19 ClientUtils$ [INFO] Fetching metadata from
> broker id:0,host:192.168.3.141,port:9092 with correlation id 41 for 1
> topic(s) Set(metrics)
>         2015-01-04 09:57:19 SyncProducer [INFO] Connected to
> 192.168.3.141:9092 for producing
>         2015-01-04 09:57:19 SyncProducer [INFO] Disconnecting from
> 192.168.3.141:9092
>         2015-01-04 09:57:19 SyncProducer [INFO] Disconnecting from a:9092
>         2015-01-04 09:57:19 SyncProducer [INFO] Connected to a:9092 for
> producing
>         2015-01-04 10:07:19 ClientUtils$ [INFO] Fetching metadata from
> broker id:0,host:192.168.3.141,port:9092 with correlation id 82 for 1
> topic(s) Set(metrics)
>         2015-01-04 10:07:19 SyncProducer [INFO] Connected to
> 192.168.3.141:9092 for producing
>         2015-01-04 10:07:19 SyncProducer [INFO] Disconnecting from
> 192.168.3.141:9092
>         2015-01-04 10:07:19 SyncProducer [INFO] Disconnecting from a:9092
>         2015-01-04 10:07:19 SyncProducer [INFO] Connected to a:9092 for
> producing
>         2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Got an exit
> code of -100. This means that container container_1420335573203_0001_01_000002
> was killed by YARN, either due to being released by the application master
> or being 'lost' due to node failures etc.
>         2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Released
> container container_1420335573203_0001_01_000002 was assigned task ID 0.
> Requesting a new container for the task.
>         2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Requesting 1
> container(s) with 1024mb of memory
>
> A fraction from output with my comment added:
>         {"time":"1420336774.0"}
>         {"time":"1420336774.0"}
>         {"time":"1420336774.0"}
>         {"time":"1420336775.0"}
>         {"time":"1420336775.0"}
>         {"time":"1420336775.0"}
>         {"time":"1420336775.0"}
>         {"time":"1420336775.0"}
>         {"time":"1420336775.0"}
>         {"time":"1420336775.0"}
>         {"time":"1420336775.0"}
>         {"time":"1420336775.0"}
>         {"time":"1420336775.0"} \\ 09:59:35
>         {"time":"1420337393.0"} \\ 10:09:53
>         {"time":"1420337393.0"}
>         {"time":"1420337393.0"}
>         {"time":"1420337393.0"}
>         {"time":"1420337393.0"}
>         {"time":"1420337393.0"}
>         {"time":"1420337393.0"}
>         {"time":"1420337393.0"}
>         {"time":"1420337393.0"}
>         {"time":"1420337393.0"}
>         {"time":"1420337394.0"}
>         {"time":"1420337394.0"}
>         {"time":"1420337394.0"}
>         {"time":"1420337394.0"}
>         {"time":"1420337394.0"}
>
> For the Task redeployment issue, I worries that if the Resource Manager is
> busy or there are no available containers in the system, redeployment of
> failure task might be delayed.
>
> Thank you for your help.
>
> Su Yi
>
>
> On Sat, 03 Jan 2015 06:04:20 +0800, Yan Fang <ya...@gmail.com> wrote:
>
>  Hi Su Yi,
>>
>> I think there maybe a misunderstanding. For the failure detection, if the
>> containers die ( because of NM failure or whatever reason ), AM will bring
>> up new containers in the same NM or a different NM according to the
>> resource availability. It does not take as much as 10 mins to recover. One
>> way you can test is that, you run a Samza job and manually kill the NM or
>> the thread to see how quickly it recovers. In terms of how
>> yarn.nm.liveness-monitor.expiry-interval-ms
>> plays the role here, not very sure. Hope any yarn expert in the community
>> can explain it a little.
>>
>> The goal of standby container in SAMZA-406 is to recover quickly when the
>> task has a lot of local state and so reading changelog takes a long time,
>> not to reduce the time of *allocating* the container, which, I believe, is
>> taken care by the YARN.
>>
>> Hope this help a little. Thanks.
>>
>> Cheers,
>>
>> Fang, Yan
>> yanfang724@gmail.com
>> +1 (206) 849-4108
>>
>> On Thu, Jan 1, 2015 at 4:20 AM, Su Yi <su...@hust.edu.cn> wrote:
>>
>>  Hi Timothy,
>>>
>>> There are 4 nodes in total : a,b,c,d
>>> Resource manager : a
>>> Node manager : a,b,c,d
>>> Kafka and zookeeper running on : a
>>>
>>> YARN configuration is :
>>>
>>> <property>
>>>     <description>How long to wait until a node manager is considered
>>> dead.</description>
>>>     <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
>>>     <value>1000</value>
>>> </property>
>>>
>>> <property>
>>>     <description>How often to check that node managers are still
>>> alive.</description>
>>>     <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
>>>     <value>100</value>
>>> </property>
>>>
>>> From web UI of Samza, I found that node 'a' appeared and disappeared
>>> again
>>> and again in the node list.
>>>
>>> Su Yi
>>>
>>> On 2015-01-01 02:54:48，"Timothy Chen" <tn...@gmail.com> wrote：
>>>
>>> >Hi Su Yi,
>>> >
>>> >Can you elaborate a bit more what you mean by unstable cluster when
>>> >you configured the heartbeat interval to be 1s?
>>> >
>>> >Tim
>>> >
>>> >On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <su...@hust.edu.cn> wrote:
>>> >> Hello,
>>> >>
>>> >> Here are some thoughts about HA of Samza.
>>> >>
>>> >> 1. Failure detection
>>> >>
>>> >> The problem is, failure detection of container completely depends on
>>> YARN in Samza. YARN counts on Node Manager reporting container failures,
>>> however Node Manager could fail, too (like, if the machine failed, NM
>>> would
>>> fail). Node Manager failures can be detected through heartbeat by
>>> Resource
>>> Manager, but, by default it'll take 10 mins to confirm Node Manager
>>> failure. I think, that's OK with batch processing, but not stream
>>> processing.
>>> >>
>>> >> Configuring yarn failure confirm interval to 1s, result in an unstable
>>> yarn cluster(4 node in total). With 2s, all things works fine, but it
>>> takes
>>> 10s~20s to get lost container(machine shut down) back. Considering that
>>> testing stream task is very simple(stateless), the recovery time is
>>> relatively long.
>>> >>
>>> >> I am not an expert on YARN, I don't know why it, by default, takes
>>> such
>>> a long time to confirm node failure. To my understanding, YARN is
>>> something
>>> trying to be general, and it is not sufficient for stream processing
>>> framework. Extra effort should be done beyond YARN on failure detection
>>> in
>>> stream processing.
>>> >>
>>> >> 2. Task redeployment
>>> >>
>>> >> After Resource Manager informed Samza of container failure, Samza
>>> should apply for resources from YARN to redeploy failed tasks, which
>>> consumes time during recovery. And, recovery time is critical for HA in
>>> stream processing. I think, maintaining a few standby containers may
>>> eliminate this overhead on recovery time. Samza could deploy failed tasks
>>> on the standby containers than requesting from YARN.
>>> >>
>>> >> Hot standby containers, which is described in SAMZA-406(
>>> https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery
>>> time, however it's costly(it doubles the resources needed).
>>> >>
>>> >> I'm wondering, what does these stuffs means to you, and how about the
>>> feasibility. By the way, I'm using Samza 0.7 .
>>> >>
>>> >> Thank you for reading.
>>> >>
>>> >> Happy New Year!;-)
>>> >>
>>> >> Su Yi
>>>
>>
>


-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Re: Re: High Availability of Samza

Posted by Yi Su <su...@hust.edu.cn>.

Hi Fang,

I have verified the failure detection issue. It takes 10mins for recovery,  
if I kill the Node Manager process first. I will detail the experiment, in  
case I have make any mistakes.

The nodes arrangement is same as before.

Workload :
	Every second, a python program generates a record with the system current  
time. And it sends the record 10 times to kafka topic "suyi-test-input".
	
Stream Task :
	It gets the tuples from input stream and sends the tuples to the output  
stream "suyi-test-output". Checkpiont is disabled.

YARN Configuration :
	<property>
		<description>How long to wait until a node manager is considered  
dead.</description>
		<name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
		<value>600000</value>
	</property>
	<property>
		<description>How often to check that node managers are still  
alive.</description>
		<name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
		<value>1000</value>
	</property>

Experiment Process :
	(1) I submited the stream task, the application master ran on node "a",  
and the other container ran on node "b".
	(2) I killed the node manager process on node "b".
	(3) At about 09:59:07, I killed the container process on node "b".
	
Results :
	(1) At 10:09:33 the application master tried to redeploy the lost  
container, according to application master logs.
	(2) The output during 09:59:35 to 10:09:53 is lost.
	
A fraction from application master logs:
	2015-01-04 09:47:20 ContainerManagementProtocolProxy [INFO] Opening proxy  
: b:35889
	2015-01-04 09:47:20 SamzaAppMasterTaskManager [INFO] Claimed task ID 0  
for container container_1420335573203_0001_01_000002 on node b  
(http://b:8042/node/containerlogs/container_1420335573203_0001_01_000002).
	2015-01-04 09:47:20 SamzaAppMasterTaskManager [INFO] Started task ID 0
	2015-01-04 09:57:19 ClientUtils$ [INFO] Fetching metadata from broker  
id:0,host:192.168.3.141,port:9092 with correlation id 41 for 1 topic(s)  
Set(metrics)
	2015-01-04 09:57:19 SyncProducer [INFO] Connected to 192.168.3.141:9092  
for producing
	2015-01-04 09:57:19 SyncProducer [INFO] Disconnecting from  
192.168.3.141:9092
	2015-01-04 09:57:19 SyncProducer [INFO] Disconnecting from a:9092
	2015-01-04 09:57:19 SyncProducer [INFO] Connected to a:9092 for producing
	2015-01-04 10:07:19 ClientUtils$ [INFO] Fetching metadata from broker  
id:0,host:192.168.3.141,port:9092 with correlation id 82 for 1 topic(s)  
Set(metrics)
	2015-01-04 10:07:19 SyncProducer [INFO] Connected to 192.168.3.141:9092  
for producing
	2015-01-04 10:07:19 SyncProducer [INFO] Disconnecting from  
192.168.3.141:9092
	2015-01-04 10:07:19 SyncProducer [INFO] Disconnecting from a:9092
	2015-01-04 10:07:19 SyncProducer [INFO] Connected to a:9092 for producing
	2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Got an exit code of  
-100. This means that container container_1420335573203_0001_01_000002 was  
killed by YARN, either due to being released by the application master or  
being 'lost' due to node failures etc.
	2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Released container  
container_1420335573203_0001_01_000002 was assigned task ID 0. Requesting  
a new container for the task.
	2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Requesting 1  
container(s) with 1024mb of memory
	
A fraction from output with my comment added:
	{"time":"1420336774.0"}
	{"time":"1420336774.0"}
	{"time":"1420336774.0"}
	{"time":"1420336775.0"}
	{"time":"1420336775.0"}
	{"time":"1420336775.0"}
	{"time":"1420336775.0"}
	{"time":"1420336775.0"}
	{"time":"1420336775.0"}
	{"time":"1420336775.0"}
	{"time":"1420336775.0"}
	{"time":"1420336775.0"}
	{"time":"1420336775.0"} \\ 09:59:35
	{"time":"1420337393.0"} \\ 10:09:53
	{"time":"1420337393.0"}
	{"time":"1420337393.0"}
	{"time":"1420337393.0"}
	{"time":"1420337393.0"}
	{"time":"1420337393.0"}
	{"time":"1420337393.0"}
	{"time":"1420337393.0"}
	{"time":"1420337393.0"}
	{"time":"1420337393.0"}
	{"time":"1420337394.0"}
	{"time":"1420337394.0"}
	{"time":"1420337394.0"}
	{"time":"1420337394.0"}
	{"time":"1420337394.0"}
	
For the Task redeployment issue, I worries that if the Resource Manager is  
busy or there are no available containers in the system, redeployment of  
failure task might be delayed.

Thank you for your help.

Su Yi

On Sat, 03 Jan 2015 06:04:20 +0800, Yan Fang <ya...@gmail.com> wrote:

> Hi Su Yi,
>
> I think there maybe a misunderstanding. For the failure detection, if the
> containers die ( because of NM failure or whatever reason ), AM will  
> bring
> up new containers in the same NM or a different NM according to the
> resource availability. It does not take as much as 10 mins to recover.  
> One
> way you can test is that, you run a Samza job and manually kill the NM or
> the thread to see how quickly it recovers. In terms of how
> yarn.nm.liveness-monitor.expiry-interval-ms
> plays the role here, not very sure. Hope any yarn expert in the community
> can explain it a little.
>
> The goal of standby container in SAMZA-406 is to recover quickly when the
> task has a lot of local state and so reading changelog takes a long time,
> not to reduce the time of *allocating* the container, which, I believe,  
> is
> taken care by the YARN.
>
> Hope this help a little. Thanks.
>
> Cheers,
>
> Fang, Yan
> yanfang724@gmail.com
> +1 (206) 849-4108
>
> On Thu, Jan 1, 2015 at 4:20 AM, Su Yi <su...@hust.edu.cn> wrote:
>
>> Hi Timothy,
>>
>> There are 4 nodes in total : a,b,c,d
>> Resource manager : a
>> Node manager : a,b,c,d
>> Kafka and zookeeper running on : a
>>
>> YARN configuration is :
>>
>> <property>
>>     <description>How long to wait until a node manager is considered
>> dead.</description>
>>     <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
>>     <value>1000</value>
>> </property>
>>
>> <property>
>>     <description>How often to check that node managers are still
>> alive.</description>
>>     <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
>>     <value>100</value>
>> </property>
>>
>> From web UI of Samza, I found that node 'a' appeared and disappeared  
>> again
>> and again in the node list.
>>
>> Su Yi
>>
>> On 2015-01-01 02:54:48，"Timothy Chen" <tn...@gmail.com> wrote：
>>
>> >Hi Su Yi,
>> >
>> >Can you elaborate a bit more what you mean by unstable cluster when
>> >you configured the heartbeat interval to be 1s?
>> >
>> >Tim
>> >
>> >On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <su...@hust.edu.cn> wrote:
>> >> Hello,
>> >>
>> >> Here are some thoughts about HA of Samza.
>> >>
>> >> 1. Failure detection
>> >>
>> >> The problem is, failure detection of container completely depends on
>> YARN in Samza. YARN counts on Node Manager reporting container failures,
>> however Node Manager could fail, too (like, if the machine failed, NM  
>> would
>> fail). Node Manager failures can be detected through heartbeat by  
>> Resource
>> Manager, but, by default it'll take 10 mins to confirm Node Manager
>> failure. I think, that's OK with batch processing, but not stream
>> processing.
>> >>
>> >> Configuring yarn failure confirm interval to 1s, result in an  
>> unstable
>> yarn cluster(4 node in total). With 2s, all things works fine, but it  
>> takes
>> 10s~20s to get lost container(machine shut down) back. Considering that
>> testing stream task is very simple(stateless), the recovery time is
>> relatively long.
>> >>
>> >> I am not an expert on YARN, I don't know why it, by default, takes  
>> such
>> a long time to confirm node failure. To my understanding, YARN is  
>> something
>> trying to be general, and it is not sufficient for stream processing
>> framework. Extra effort should be done beyond YARN on failure detection  
>> in
>> stream processing.
>> >>
>> >> 2. Task redeployment
>> >>
>> >> After Resource Manager informed Samza of container failure, Samza
>> should apply for resources from YARN to redeploy failed tasks, which
>> consumes time during recovery. And, recovery time is critical for HA in
>> stream processing. I think, maintaining a few standby containers may
>> eliminate this overhead on recovery time. Samza could deploy failed  
>> tasks
>> on the standby containers than requesting from YARN.
>> >>
>> >> Hot standby containers, which is described in SAMZA-406(
>> https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery
>> time, however it's costly(it doubles the resources needed).
>> >>
>> >> I'm wondering, what does these stuffs means to you, and how about the
>> feasibility. By the way, I'm using Samza 0.7 .
>> >>
>> >> Thank you for reading.
>> >>
>> >> Happy New Year!;-)
>> >>
>> >> Su Yi

Re: Re: High Availability of Samza

Posted by Yan Fang <ya...@gmail.com>.

Hi Su Yi,

I think there maybe a misunderstanding. For the failure detection, if the
containers die ( because of NM failure or whatever reason ), AM will bring
up new containers in the same NM or a different NM according to the
resource availability. It does not take as much as 10 mins to recover. One
way you can test is that, you run a Samza job and manually kill the NM or
the thread to see how quickly it recovers. In terms of how
yarn.nm.liveness-monitor.expiry-interval-ms
plays the role here, not very sure. Hope any yarn expert in the community
can explain it a little.

The goal of standby container in SAMZA-406 is to recover quickly when the
task has a lot of local state and so reading changelog takes a long time,
not to reduce the time of *allocating* the container, which, I believe, is
taken care by the YARN.

Hope this help a little. Thanks.

Cheers,

Fang, Yan
yanfang724@gmail.com
+1 (206) 849-4108

On Thu, Jan 1, 2015 at 4:20 AM, Su Yi <su...@hust.edu.cn> wrote:

> Hi Timothy,
>
> There are 4 nodes in total : a,b,c,d
> Resource manager : a
> Node manager : a,b,c,d
> Kafka and zookeeper running on : a
>
> YARN configuration is :
>
> <property>
>     <description>How long to wait until a node manager is considered
> dead.</description>
>     <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
>     <value>1000</value>
> </property>
>
> <property>
>     <description>How often to check that node managers are still
> alive.</description>
>     <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
>     <value>100</value>
> </property>
>
> From web UI of Samza, I found that node 'a' appeared and disappeared again
> and again in the node list.
>
> Su Yi
>
> On 2015-01-01 02:54:48，"Timothy Chen" <tn...@gmail.com> wrote：
>
> >Hi Su Yi,
> >
> >Can you elaborate a bit more what you mean by unstable cluster when
> >you configured the heartbeat interval to be 1s?
> >
> >Tim
> >
> >On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <su...@hust.edu.cn> wrote:
> >> Hello,
> >>
> >> Here are some thoughts about HA of Samza.
> >>
> >> 1. Failure detection
> >>
> >> The problem is, failure detection of container completely depends on
> YARN in Samza. YARN counts on Node Manager reporting container failures,
> however Node Manager could fail, too (like, if the machine failed, NM would
> fail). Node Manager failures can be detected through heartbeat by Resource
> Manager, but, by default it'll take 10 mins to confirm Node Manager
> failure. I think, that's OK with batch processing, but not stream
> processing.
> >>
> >> Configuring yarn failure confirm interval to 1s, result in an unstable
> yarn cluster(4 node in total). With 2s, all things works fine, but it takes
> 10s~20s to get lost container(machine shut down) back. Considering that
> testing stream task is very simple(stateless), the recovery time is
> relatively long.
> >>
> >> I am not an expert on YARN, I don't know why it, by default, takes such
> a long time to confirm node failure. To my understanding, YARN is something
> trying to be general, and it is not sufficient for stream processing
> framework. Extra effort should be done beyond YARN on failure detection in
> stream processing.
> >>
> >> 2. Task redeployment
> >>
> >> After Resource Manager informed Samza of container failure, Samza
> should apply for resources from YARN to redeploy failed tasks, which
> consumes time during recovery. And, recovery time is critical for HA in
> stream processing. I think, maintaining a few standby containers may
> eliminate this overhead on recovery time. Samza could deploy failed tasks
> on the standby containers than requesting from YARN.
> >>
> >> Hot standby containers, which is described in SAMZA-406(
> https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery
> time, however it's costly(it doubles the resources needed).
> >>
> >> I'm wondering, what does these stuffs means to you, and how about the
> feasibility. By the way, I'm using Samza 0.7 .
> >>
> >> Thank you for reading.
> >>
> >> Happy New Year!;-)
> >>
> >> Su Yi
>

Re:Re: High Availability of Samza

Posted by Su Yi <su...@hust.edu.cn>.

Hi Timothy,

There are 4 nodes in total : a,b,c,d
Resource manager : a
Node manager : a,b,c,d
Kafka and zookeeper running on : a

YARN configuration is :

<property>
    <description>How long to wait until a node manager is considered dead.</description>
    <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
    <value>1000</value>
</property>

<property>
    <description>How often to check that node managers are still alive.</description>
    <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
    <value>100</value>
</property>

>From web UI of Samza, I found that node 'a' appeared and disappeared again and again in the node list.

Su Yi

On 2015-01-01 02:54:48，"Timothy Chen" <tn...@gmail.com> wrote：

>Hi Su Yi,
>
>Can you elaborate a bit more what you mean by unstable cluster when
>you configured the heartbeat interval to be 1s?
>
>Tim
>
>On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <su...@hust.edu.cn> wrote:
>> Hello,
>>
>> Here are some thoughts about HA of Samza.
>>
>> 1. Failure detection
>>
>> The problem is, failure detection of container completely depends on YARN in Samza. YARN counts on Node Manager reporting container failures, however Node Manager could fail, too (like, if the machine failed, NM would fail). Node Manager failures can be detected through heartbeat by Resource Manager, but, by default it'll take 10 mins to confirm Node Manager failure. I think, that's OK with batch processing, but not stream processing.
>>
>> Configuring yarn failure confirm interval to 1s, result in an unstable yarn cluster(4 node in total). With 2s, all things works fine, but it takes 10s~20s to get lost container(machine shut down) back. Considering that testing stream task is very simple(stateless), the recovery time is relatively long.
>>
>> I am not an expert on YARN, I don't know why it, by default, takes such a long time to confirm node failure. To my understanding, YARN is something trying to be general, and it is not sufficient for stream processing framework. Extra effort should be done beyond YARN on failure detection in stream processing.
>>
>> 2. Task redeployment
>>
>> After Resource Manager informed Samza of container failure, Samza should apply for resources from YARN to redeploy failed tasks, which consumes time during recovery. And, recovery time is critical for HA in stream processing. I think, maintaining a few standby containers may eliminate this overhead on recovery time. Samza could deploy failed tasks on the standby containers than requesting from YARN.
>>
>> Hot standby containers, which is described in SAMZA-406(https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery time, however it's costly(it doubles the resources needed).
>>
>> I'm wondering, what does these stuffs means to you, and how about the feasibility. By the way, I'm using Samza 0.7 .
>>
>> Thank you for reading.
>>
>> Happy New Year!;-)
>>
>> Su Yi

Re: High Availability of Samza

Posted by Timothy Chen <tn...@gmail.com>.

Hi Su Yi,

Can you elaborate a bit more what you mean by unstable cluster when
you configured the heartbeat interval to be 1s?

Tim

On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <su...@hust.edu.cn> wrote:
> Hello,
>
> Here are some thoughts about HA of Samza.
>
> 1. Failure detection
>
> The problem is, failure detection of container completely depends on YARN in Samza. YARN counts on Node Manager reporting container failures, however Node Manager could fail, too (like, if the machine failed, NM would fail). Node Manager failures can be detected through heartbeat by Resource Manager, but, by default it'll take 10 mins to confirm Node Manager failure. I think, that's OK with batch processing, but not stream processing.
>
> Configuring yarn failure confirm interval to 1s, result in an unstable yarn cluster(4 node in total). With 2s, all things works fine, but it takes 10s~20s to get lost container(machine shut down) back. Considering that testing stream task is very simple(stateless), the recovery time is relatively long.
>
> I am not an expert on YARN, I don't know why it, by default, takes such a long time to confirm node failure. To my understanding, YARN is something trying to be general, and it is not sufficient for stream processing framework. Extra effort should be done beyond YARN on failure detection in stream processing.
>
> 2. Task redeployment
>
> After Resource Manager informed Samza of container failure, Samza should apply for resources from YARN to redeploy failed tasks, which consumes time during recovery. And, recovery time is critical for HA in stream processing. I think, maintaining a few standby containers may eliminate this overhead on recovery time. Samza could deploy failed tasks on the standby containers than requesting from YARN.
>
> Hot standby containers, which is described in SAMZA-406(https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery time, however it's costly(it doubles the resources needed).
>
> I'm wondering, what does these stuffs means to you, and how about the feasibility. By the way, I'm using Samza 0.7 .
>
> Thank you for reading.
>
> Happy New Year!;-)
>
> Su Yi