You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Saiph Kappa <sa...@gmail.com> on 2016/02/19 20:04:19 UTC

How to increase akka heartbeat?

Hi,

I have a Flink client application that launches jobs to remote clusters.
However I'm getting my jobs cancelled:
"18:25:29,650 WARN
akka.remote.ReliableDeliverySupervisor                        - Association
with remote system [akka.tcp://flink@127.0.0.1:52929] has failed, address
is now gated for [5000] ms. Reason is: [Disassociated]."

How can I increase the akka heartbeat interval? Where should I set that
configuration parameter, in the client or in the Flink clusters, and in
which file.

Thanks.

Re: How to increase akka heartbeat?

Posted by Saiph Kappa <sa...@gmail.com>.
Thanks for your help. Apparently the problem was not in Akka. It seems that
when using a source .socketTextStream with maxRetry = -1, it continually
attempts to connect to the socket for the 1st time, but once it is
connected, and if no data is sent, it seems that the job is terminated.

On Fri, Feb 19, 2016 at 9:13 PM, Stephan Ewen <se...@apache.org> wrote:

> Hi Saiph!
>
> What is the problem that is happening? The log actually looks like the Job
> is successfully sent to the JobManager.
>
> Stephan
>
>
>
> On Fri, Feb 19, 2016 at 8:49 PM, Robert Metzger <rm...@apache.org>
> wrote:
>
>> Hi,
>> can you maybe (if you want also private) send me the full logs of the
>> jobmanager? The messages you've posted here are logged at DEBUG level. They
>> don't indicate an erroneous behavior of the system.
>>
>> On Fri, Feb 19, 2016 at 8:44 PM, Saiph Kappa <sa...@gmail.com>
>> wrote:
>>
>>> These were the parameters that I set btw:
>>>
>>> akka.watch.heartbeat.interval: 100
>>> akka.transport.heartbeat.interval: 1000
>>>
>>> On Fri, Feb 19, 2016 at 7:43 PM, Saiph Kappa <sa...@gmail.com>
>>> wrote:
>>>
>>>> I am not sure.
>>>>
>>>> For that particular machine I get messages like these:
>>>> «
>>>> myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from
>>>> Actor[akka://flink/deadLetters].
>>>> ^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new
>>>> JobManager akka.tcp://flink@myip:6123/user/jobmanager.
>>>>
>>>> ^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to
>>>> JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job
>>>> JOB1 (5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress
>>>>
>>>>
>>>> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
>>>> LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
>>>> in 48 ms from Actor[akka://flink/deadLetters].
>>>>
>>>>
>>>> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
>>>> LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
>>>> in 48 ms from Actor[akka://flink/deadLetters].
>>>>
>>>> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message
>>>> JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a
>>>> from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
>>>> »
>>>>
>>>> I tried to set the heartbeat interval in the cluster but it didn't
>>>> solve the problem, should I try to set it in the client (how can I do it)?
>>>> I see no other errors or exceptions on the log files.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <rm...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Saiph,
>>>>>
>>>>> are you sure that the jobs are cancelled because the client
>>>>> disconnects?
>>>>>
>>>>> For the different timeouts, check the configuration page:
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html
>>>>> and search for "heartbeat".
>>>>>
>>>>> On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a Flink client application that launches jobs to remote
>>>>>> clusters. However I'm getting my jobs cancelled:
>>>>>> "18:25:29,650 WARN
>>>>>> akka.remote.ReliableDeliverySupervisor                        - Association
>>>>>> with remote system [akka.tcp://flink@127.0.0.1:52929] has failed,
>>>>>> address is now gated for [5000] ms. Reason is: [Disassociated]."
>>>>>>
>>>>>> How can I increase the akka heartbeat interval? Where should I set
>>>>>> that configuration parameter, in the client or in the Flink clusters, and
>>>>>> in which file.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to increase akka heartbeat?

Posted by Stephan Ewen <se...@apache.org>.
Hi Saiph!

What is the problem that is happening? The log actually looks like the Job
is successfully sent to the JobManager.

Stephan



On Fri, Feb 19, 2016 at 8:49 PM, Robert Metzger <rm...@apache.org> wrote:

> Hi,
> can you maybe (if you want also private) send me the full logs of the
> jobmanager? The messages you've posted here are logged at DEBUG level. They
> don't indicate an erroneous behavior of the system.
>
> On Fri, Feb 19, 2016 at 8:44 PM, Saiph Kappa <sa...@gmail.com>
> wrote:
>
>> These were the parameters that I set btw:
>>
>> akka.watch.heartbeat.interval: 100
>> akka.transport.heartbeat.interval: 1000
>>
>> On Fri, Feb 19, 2016 at 7:43 PM, Saiph Kappa <sa...@gmail.com>
>> wrote:
>>
>>> I am not sure.
>>>
>>> For that particular machine I get messages like these:
>>> «
>>> myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from
>>> Actor[akka://flink/deadLetters].
>>> ^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new
>>> JobManager akka.tcp://flink@myip:6123/user/jobmanager.
>>>
>>> ^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to
>>> JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job
>>> JOB1 (5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress
>>>
>>>
>>> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
>>> LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
>>> in 48 ms from Actor[akka://flink/deadLetters].
>>>
>>>
>>> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
>>> LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
>>> in 48 ms from Actor[akka://flink/deadLetters].
>>>
>>> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message
>>> JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a
>>> from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
>>> »
>>>
>>> I tried to set the heartbeat interval in the cluster but it didn't solve
>>> the problem, should I try to set it in the client (how can I do it)? I see
>>> no other errors or exceptions on the log files.
>>>
>>>
>>>
>>>
>>> On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <rm...@apache.org>
>>> wrote:
>>>
>>>> Hi Saiph,
>>>>
>>>> are you sure that the jobs are cancelled because the client disconnects?
>>>>
>>>> For the different timeouts, check the configuration page:
>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html
>>>> and search for "heartbeat".
>>>>
>>>> On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a Flink client application that launches jobs to remote
>>>>> clusters. However I'm getting my jobs cancelled:
>>>>> "18:25:29,650 WARN
>>>>> akka.remote.ReliableDeliverySupervisor                        - Association
>>>>> with remote system [akka.tcp://flink@127.0.0.1:52929] has failed,
>>>>> address is now gated for [5000] ms. Reason is: [Disassociated]."
>>>>>
>>>>> How can I increase the akka heartbeat interval? Where should I set
>>>>> that configuration parameter, in the client or in the Flink clusters, and
>>>>> in which file.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to increase akka heartbeat?

Posted by Robert Metzger <rm...@apache.org>.
Hi,
can you maybe (if you want also private) send me the full logs of the
jobmanager? The messages you've posted here are logged at DEBUG level. They
don't indicate an erroneous behavior of the system.

On Fri, Feb 19, 2016 at 8:44 PM, Saiph Kappa <sa...@gmail.com> wrote:

> These were the parameters that I set btw:
>
> akka.watch.heartbeat.interval: 100
> akka.transport.heartbeat.interval: 1000
>
> On Fri, Feb 19, 2016 at 7:43 PM, Saiph Kappa <sa...@gmail.com>
> wrote:
>
>> I am not sure.
>>
>> For that particular machine I get messages like these:
>> «
>> myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from
>> Actor[akka://flink/deadLetters].
>> ^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new
>> JobManager akka.tcp://flink@myip:6123/user/jobmanager.
>>
>> ^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to
>> JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job JOB1
>> (5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress
>>
>>
>> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
>> LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
>> in 48 ms from Actor[akka://flink/deadLetters].
>>
>>
>> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
>> LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
>> in 48 ms from Actor[akka://flink/deadLetters].
>>
>> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message
>> JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a
>> from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
>> »
>>
>> I tried to set the heartbeat interval in the cluster but it didn't solve
>> the problem, should I try to set it in the client (how can I do it)? I see
>> no other errors or exceptions on the log files.
>>
>>
>>
>>
>> On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Hi Saiph,
>>>
>>> are you sure that the jobs are cancelled because the client disconnects?
>>>
>>> For the different timeouts, check the configuration page:
>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html
>>> and search for "heartbeat".
>>>
>>> On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a Flink client application that launches jobs to remote
>>>> clusters. However I'm getting my jobs cancelled:
>>>> "18:25:29,650 WARN
>>>> akka.remote.ReliableDeliverySupervisor                        - Association
>>>> with remote system [akka.tcp://flink@127.0.0.1:52929] has failed,
>>>> address is now gated for [5000] ms. Reason is: [Disassociated]."
>>>>
>>>> How can I increase the akka heartbeat interval? Where should I set that
>>>> configuration parameter, in the client or in the Flink clusters, and in
>>>> which file.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>
>>
>

Re: How to increase akka heartbeat?

Posted by Saiph Kappa <sa...@gmail.com>.
These were the parameters that I set btw:

akka.watch.heartbeat.interval: 100
akka.transport.heartbeat.interval: 1000

On Fri, Feb 19, 2016 at 7:43 PM, Saiph Kappa <sa...@gmail.com> wrote:

> I am not sure.
>
> For that particular machine I get messages like these:
> «
> myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from
> Actor[akka://flink/deadLetters].
> ^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new
> JobManager akka.tcp://flink@myip:6123/user/jobmanager.
>
> ^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to
> JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job JOB1
> (5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress
>
>
> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
> LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
> in 48 ms from Actor[akka://flink/deadLetters].
>
>
> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
> LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
> in 48 ms from Actor[akka://flink/deadLetters].
>
> ^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message
> JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a
> from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
> »
>
> I tried to set the heartbeat interval in the cluster but it didn't solve
> the problem, should I try to set it in the client (how can I do it)? I see
> no other errors or exceptions on the log files.
>
>
>
>
> On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <rm...@apache.org>
> wrote:
>
>> Hi Saiph,
>>
>> are you sure that the jobs are cancelled because the client disconnects?
>>
>> For the different timeouts, check the configuration page:
>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html
>> and search for "heartbeat".
>>
>> On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <sa...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have a Flink client application that launches jobs to remote clusters.
>>> However I'm getting my jobs cancelled:
>>> "18:25:29,650 WARN
>>> akka.remote.ReliableDeliverySupervisor                        - Association
>>> with remote system [akka.tcp://flink@127.0.0.1:52929] has failed,
>>> address is now gated for [5000] ms. Reason is: [Disassociated]."
>>>
>>> How can I increase the akka heartbeat interval? Where should I set that
>>> configuration parameter, in the client or in the Flink clusters, and in
>>> which file.
>>>
>>> Thanks.
>>>
>>>
>>
>

Re: How to increase akka heartbeat?

Posted by Saiph Kappa <sa...@gmail.com>.
I am not sure.

For that particular machine I get messages like these:
«
myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from
Actor[akka://flink/deadLetters].
^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new
JobManager akka.tcp://flink@myip:6123/user/jobmanager.

^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to
JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job JOB1
(5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
in 48 ms from Actor[akka://flink/deadLetters].


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message
LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197]))
in 48 ms from Actor[akka://flink/deadLetters].

^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message
JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a
from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
»

I tried to set the heartbeat interval in the cluster but it didn't solve
the problem, should I try to set it in the client (how can I do it)? I see
no other errors or exceptions on the log files.




On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <rm...@apache.org> wrote:

> Hi Saiph,
>
> are you sure that the jobs are cancelled because the client disconnects?
>
> For the different timeouts, check the configuration page:
> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html
> and search for "heartbeat".
>
> On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <sa...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a Flink client application that launches jobs to remote clusters.
>> However I'm getting my jobs cancelled:
>> "18:25:29,650 WARN
>> akka.remote.ReliableDeliverySupervisor                        - Association
>> with remote system [akka.tcp://flink@127.0.0.1:52929] has failed,
>> address is now gated for [5000] ms. Reason is: [Disassociated]."
>>
>> How can I increase the akka heartbeat interval? Where should I set that
>> configuration parameter, in the client or in the Flink clusters, and in
>> which file.
>>
>> Thanks.
>>
>>
>

Re: How to increase akka heartbeat?

Posted by Robert Metzger <rm...@apache.org>.
Hi Saiph,

are you sure that the jobs are cancelled because the client disconnects?

For the different timeouts, check the configuration page:
https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html
and search for "heartbeat".

On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <sa...@gmail.com> wrote:

> Hi,
>
> I have a Flink client application that launches jobs to remote clusters.
> However I'm getting my jobs cancelled:
> "18:25:29,650 WARN
> akka.remote.ReliableDeliverySupervisor                        - Association
> with remote system [akka.tcp://flink@127.0.0.1:52929] has failed, address
> is now gated for [5000] ms. Reason is: [Disassociated]."
>
> How can I increase the akka heartbeat interval? Where should I set that
> configuration parameter, in the client or in the Flink clusters, and in
> which file.
>
> Thanks.
>
>