You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Deepak Jha <dk...@gmail.com> on 2016/03/04 02:48:31 UTC

Remote TaskManager Connection Problem

Hi All,
I've created 2 docker containers on my local machine, one running
JM(192.168.99.104) and other running TM. I was expecting to see TM in the
JM UI but it did not happen. On looking into the TM logs I see following
lines


01:29:50,862 DEBUG org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager process reaper
01:29:50,868 INFO  org.apache.flink.runtime.filecache.FileCache
     - User file cache uses directory
/tmp/flink-dist-cache-be63f351-2bce-48ef-bbc4-fb0f40fecd49
01:29:51,093 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor at
akka://flink/user/taskmanager#1222392284.
01:29:51,095 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager data connection information: 140efeb188cc (dataPort=6122)
01:29:51,096 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager has 1 task slot(s).
01:29:51,097 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Memory usage stats: [HEAP: 386/494/494 MB, NON HEAP: 30/31/-1 MB
(used/committed/max)]
01:29:51,104 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager akka.tcp://
flink@192.168.99.104:6123/user/jobmanager (attempt 1, timeout: 500
milliseconds)
01:29:51,633 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager akka.tcp://
flink@192.168.99.104:6123/user/jobmanager (attempt 2, timeout: 1000
milliseconds)
01:29:52,652 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager akka.tcp://
flink@192.168.99.104:6123/user/jobmanager (attempt 3, timeout: 2000
milliseconds)
01:29:54,672 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager akka.tcp://
flink@192.168.99.104:6123/user/jobmanager (attempt 4, timeout: 4000
milliseconds)
01:29:58,693 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager akka.tcp://
flink@192.168.99.104:6123/user/jobmanager (attempt 5, timeout: 8000
milliseconds)
01:30:06,702 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager akka.tcp://
flink@192.168.99.104:6123/user/jobmanager (attempt 6, timeout: 16000
milliseconds)


However, from TM i am able to reach JM on port 6123
root@140efeb188cc:/# nc -v 192.168.99.104 6123
Connection to 192.168.99.104 6123 port [tcp/*] succeeded!


masters file on TM contains
192.168.99.104:8080

Did anyone face this issue with remote JM/TM combination ?

-- 
Thanks,
Deepak Jha

Re: Remote TaskManager Connection Problem

Posted by Deepak Jha <dk...@gmail.com>.
Hi Stephan,
Thanks for the response. I was able to resolve the issue, I was using
localhost in jobmanager name instead of container name... There were few
more issues which I would like to mention
- I'm using S3 for storage/checkpoint in Flink HA mode, I realized that I
have to set fs.hdfs.hadoopconf in conf/flink-conf.yaml and add
core-site.xml in conf/ .. Since I'm deploying it on AWS I had to place
hadoop-aws.jar as well....


On Fri, Mar 4, 2016 at 1:22 AM, Stephan Ewen <se...@apache.org> wrote:

> The  pull request https://github.com/apache/flink/pull/1758 should improve
> the TaskManager's network interface selection.
>
>
> On Fri, Mar 4, 2016 at 10:19 AM, Stephan Ewen <se...@apache.org> wrote:
>
> > Hi!
> >
> > This registration phase means that the TaskManager tries to tell the
> > JobManager that it is available.
> > If that fails, there can be two reasons
> >
> >   1) Network communication not possible to the port
> >       1.1) JobManager IP really not reachable (not the case, as you
> > described)
> >       1.2) TaskManager selected a wrong network interface to work with
> >   2) JobManager not listening
> >
> >
> > To look into 1.2, can you check the TaskManager log at the beginning,
> > where it says what interface/hostname the TaskManager selected to use?
> >
> > Thanks,
> > Stephan
> >
> >
> >
> >
> >
> >
> > On Fri, Mar 4, 2016 at 2:48 AM, Deepak Jha <dk...@gmail.com> wrote:
> >
> >> Hi All,
> >> I've created 2 docker containers on my local machine, one running
> >> JM(192.168.99.104) and other running TM. I was expecting to see TM in
> the
> >> JM UI but it did not happen. On looking into the TM logs I see following
> >> lines
> >>
> >>
> >> 01:29:50,862 DEBUG org.apache.flink.runtime.taskmanager.TaskManager
> >>      - Starting TaskManager process reaper
> >> 01:29:50,868 INFO  org.apache.flink.runtime.filecache.FileCache
> >>      - User file cache uses directory
> >> /tmp/flink-dist-cache-be63f351-2bce-48ef-bbc4-fb0f40fecd49
> >> 01:29:51,093 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - Starting TaskManager actor at
> >> akka://flink/user/taskmanager#1222392284.
> >> 01:29:51,095 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - TaskManager data connection information: 140efeb188cc
> >> (dataPort=6122)
> >> 01:29:51,096 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - TaskManager has 1 task slot(s).
> >> 01:29:51,097 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - Memory usage stats: [HEAP: 386/494/494 MB, NON HEAP: 30/31/-1 MB
> >> (used/committed/max)]
> >> 01:29:51,104 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - Trying to register at JobManager akka.tcp://
> >> flink@192.168.99.104:6123/user/jobmanager (attempt 1, timeout: 500
> >> milliseconds)
> >> 01:29:51,633 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - Trying to register at JobManager akka.tcp://
> >> flink@192.168.99.104:6123/user/jobmanager (attempt 2, timeout: 1000
> >> milliseconds)
> >> 01:29:52,652 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - Trying to register at JobManager akka.tcp://
> >> flink@192.168.99.104:6123/user/jobmanager (attempt 3, timeout: 2000
> >> milliseconds)
> >> 01:29:54,672 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - Trying to register at JobManager akka.tcp://
> >> flink@192.168.99.104:6123/user/jobmanager (attempt 4, timeout: 4000
> >> milliseconds)
> >> 01:29:58,693 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - Trying to register at JobManager akka.tcp://
> >> flink@192.168.99.104:6123/user/jobmanager (attempt 5, timeout: 8000
> >> milliseconds)
> >> 01:30:06,702 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> >>      - Trying to register at JobManager akka.tcp://
> >> flink@192.168.99.104:6123/user/jobmanager (attempt 6, timeout: 16000
> >> milliseconds)
> >>
> >>
> >> However, from TM i am able to reach JM on port 6123
> >> root@140efeb188cc:/# nc -v 192.168.99.104 6123
> >> Connection to 192.168.99.104 6123 port [tcp/*] succeeded!
> >>
> >>
> >> masters file on TM contains
> >> 192.168.99.104:8080
> >>
> >> Did anyone face this issue with remote JM/TM combination ?
> >>
> >> --
> >> Thanks,
> >> Deepak Jha
> >>
> >
> >
>



-- 
Thanks,
Deepak Jha

Re: Remote TaskManager Connection Problem

Posted by Stephan Ewen <se...@apache.org>.
The  pull request https://github.com/apache/flink/pull/1758 should improve
the TaskManager's network interface selection.


On Fri, Mar 4, 2016 at 10:19 AM, Stephan Ewen <se...@apache.org> wrote:

> Hi!
>
> This registration phase means that the TaskManager tries to tell the
> JobManager that it is available.
> If that fails, there can be two reasons
>
>   1) Network communication not possible to the port
>       1.1) JobManager IP really not reachable (not the case, as you
> described)
>       1.2) TaskManager selected a wrong network interface to work with
>   2) JobManager not listening
>
>
> To look into 1.2, can you check the TaskManager log at the beginning,
> where it says what interface/hostname the TaskManager selected to use?
>
> Thanks,
> Stephan
>
>
>
>
>
>
> On Fri, Mar 4, 2016 at 2:48 AM, Deepak Jha <dk...@gmail.com> wrote:
>
>> Hi All,
>> I've created 2 docker containers on my local machine, one running
>> JM(192.168.99.104) and other running TM. I was expecting to see TM in the
>> JM UI but it did not happen. On looking into the TM logs I see following
>> lines
>>
>>
>> 01:29:50,862 DEBUG org.apache.flink.runtime.taskmanager.TaskManager
>>      - Starting TaskManager process reaper
>> 01:29:50,868 INFO  org.apache.flink.runtime.filecache.FileCache
>>      - User file cache uses directory
>> /tmp/flink-dist-cache-be63f351-2bce-48ef-bbc4-fb0f40fecd49
>> 01:29:51,093 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - Starting TaskManager actor at
>> akka://flink/user/taskmanager#1222392284.
>> 01:29:51,095 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - TaskManager data connection information: 140efeb188cc
>> (dataPort=6122)
>> 01:29:51,096 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - TaskManager has 1 task slot(s).
>> 01:29:51,097 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - Memory usage stats: [HEAP: 386/494/494 MB, NON HEAP: 30/31/-1 MB
>> (used/committed/max)]
>> 01:29:51,104 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - Trying to register at JobManager akka.tcp://
>> flink@192.168.99.104:6123/user/jobmanager (attempt 1, timeout: 500
>> milliseconds)
>> 01:29:51,633 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - Trying to register at JobManager akka.tcp://
>> flink@192.168.99.104:6123/user/jobmanager (attempt 2, timeout: 1000
>> milliseconds)
>> 01:29:52,652 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - Trying to register at JobManager akka.tcp://
>> flink@192.168.99.104:6123/user/jobmanager (attempt 3, timeout: 2000
>> milliseconds)
>> 01:29:54,672 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - Trying to register at JobManager akka.tcp://
>> flink@192.168.99.104:6123/user/jobmanager (attempt 4, timeout: 4000
>> milliseconds)
>> 01:29:58,693 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - Trying to register at JobManager akka.tcp://
>> flink@192.168.99.104:6123/user/jobmanager (attempt 5, timeout: 8000
>> milliseconds)
>> 01:30:06,702 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>      - Trying to register at JobManager akka.tcp://
>> flink@192.168.99.104:6123/user/jobmanager (attempt 6, timeout: 16000
>> milliseconds)
>>
>>
>> However, from TM i am able to reach JM on port 6123
>> root@140efeb188cc:/# nc -v 192.168.99.104 6123
>> Connection to 192.168.99.104 6123 port [tcp/*] succeeded!
>>
>>
>> masters file on TM contains
>> 192.168.99.104:8080
>>
>> Did anyone face this issue with remote JM/TM combination ?
>>
>> --
>> Thanks,
>> Deepak Jha
>>
>
>

Re: Remote TaskManager Connection Problem

Posted by Stephan Ewen <se...@apache.org>.
Hi!

This registration phase means that the TaskManager tries to tell the
JobManager that it is available.
If that fails, there can be two reasons

  1) Network communication not possible to the port
      1.1) JobManager IP really not reachable (not the case, as you
described)
      1.2) TaskManager selected a wrong network interface to work with
  2) JobManager not listening


To look into 1.2, can you check the TaskManager log at the beginning, where
it says what interface/hostname the TaskManager selected to use?

Thanks,
Stephan






On Fri, Mar 4, 2016 at 2:48 AM, Deepak Jha <dk...@gmail.com> wrote:

> Hi All,
> I've created 2 docker containers on my local machine, one running
> JM(192.168.99.104) and other running TM. I was expecting to see TM in the
> JM UI but it did not happen. On looking into the TM logs I see following
> lines
>
>
> 01:29:50,862 DEBUG org.apache.flink.runtime.taskmanager.TaskManager
>      - Starting TaskManager process reaper
> 01:29:50,868 INFO  org.apache.flink.runtime.filecache.FileCache
>      - User file cache uses directory
> /tmp/flink-dist-cache-be63f351-2bce-48ef-bbc4-fb0f40fecd49
> 01:29:51,093 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - Starting TaskManager actor at
> akka://flink/user/taskmanager#1222392284.
> 01:29:51,095 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - TaskManager data connection information: 140efeb188cc
> (dataPort=6122)
> 01:29:51,096 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - TaskManager has 1 task slot(s).
> 01:29:51,097 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - Memory usage stats: [HEAP: 386/494/494 MB, NON HEAP: 30/31/-1 MB
> (used/committed/max)]
> 01:29:51,104 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - Trying to register at JobManager akka.tcp://
> flink@192.168.99.104:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 01:29:51,633 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - Trying to register at JobManager akka.tcp://
> flink@192.168.99.104:6123/user/jobmanager (attempt 2, timeout: 1000
> milliseconds)
> 01:29:52,652 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - Trying to register at JobManager akka.tcp://
> flink@192.168.99.104:6123/user/jobmanager (attempt 3, timeout: 2000
> milliseconds)
> 01:29:54,672 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - Trying to register at JobManager akka.tcp://
> flink@192.168.99.104:6123/user/jobmanager (attempt 4, timeout: 4000
> milliseconds)
> 01:29:58,693 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - Trying to register at JobManager akka.tcp://
> flink@192.168.99.104:6123/user/jobmanager (attempt 5, timeout: 8000
> milliseconds)
> 01:30:06,702 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>      - Trying to register at JobManager akka.tcp://
> flink@192.168.99.104:6123/user/jobmanager (attempt 6, timeout: 16000
> milliseconds)
>
>
> However, from TM i am able to reach JM on port 6123
> root@140efeb188cc:/# nc -v 192.168.99.104 6123
> Connection to 192.168.99.104 6123 port [tcp/*] succeeded!
>
>
> masters file on TM contains
> 192.168.99.104:8080
>
> Did anyone face this issue with remote JM/TM combination ?
>
> --
> Thanks,
> Deepak Jha
>