You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Saliya Ekanayake <es...@gmail.com> on 2016/06/29 23:44:30 UTC

Parameters to Control Intra-node Parallelism

Hi,

We are trying to scale some of our scientific applications written in
Flink. A few questions on tuning Flink performance.

1. What parameters are available to control parallelism within a node?
2. Does Flink support shared memory-based messaging within a node (without
doing TCP calls)?
3. Is there support for Infiniband interconnect?

Thank you,
Saliya

-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
Thank you!

On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org> wrote:

> Yes, exactly.
>
> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <es...@gmail.com>
> wrote:
> > Thank you, yes, it can be done externally, if not supported within Flink.
> >
> > So the way to spawn multiple task managers would be to list the same
> slave
> > machines N times as necessary in the slaves file?
> >
> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >>
> >> No, not inside of Flink. That sounds like something like the OS or
> >> resource manager should handle.
> >>
> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <es...@gmail.com>
> >> wrote:
> >> > That's great, so is there support to pin task managers to sockets as
> >> > well?
> >> >
> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >> >>
> >> >> Regarding 2) if you don't manually configure something else, that
> >> >> should happen always.
> >> >>
> >> >> Yes, you can run more than one task manager per node depending on the
> >> >> process isolation you want. Within a task manager, there are multiple
> >> >> threads for each slot. For example, if you have 2 task managers with
> 2
> >> >> slots each and submit a job with parallelism 4, each task manager
> will
> >> >> execute 2 sub tasks in separate Threads.
> >> >>
> >> >>
> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <es...@gmail.com>
> >> >> wrote:
> >> >> > Hi Ufuk,
> >> >> >
> >> >> > Looking at the document you sent it seems only 1 task manager per
> >> >> > node
> >> >> > exist
> >> >> > and within that you have multiple slots. Is it possible to run more
> >> >> > than
> >> >> > 1
> >> >> > task manager per node? Also, within a task manager is the
> parallelism
> >> >> > done
> >> >> > through threads or processes?
> >> >> >
> >> >> > Thank you,
> >> >> > Saliya
> >> >> >
> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake <
> esaliya@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Thank you, I'll check these.
> >> >> >>
> >> >> >> In 2.) you said they are likely to exchange through memory. Is
> there
> >> >> >> a
> >> >> >> case why they wouldn't?
> >> >> >>
> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org>
> wrote:
> >> >> >>>
> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
> >> >> >>> <es...@gmail.com>
> >> >> >>> wrote:
> >> >> >>> > 1. What parameters are available to control parallelism within
> a
> >> >> >>> > node?
> >> >> >>>
> >> >> >>> Task Manager processing slots:
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
> >> >> >>>
> >> >> >>> > 2. Does Flink support shared memory-based messaging within a
> node
> >> >> >>> > (without
> >> >> >>> > doing TCP calls)?
> >> >> >>>
> >> >> >>> Yes, local exchanges happen via memory and not TCP, for example
> if
> >> >> >>> you
> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely
> to
> >> >> >>> exchange data locally.
> >> >> >>>
> >> >> >>> > 3. Is there support for Infiniband interconnect?
> >> >> >>>
> >> >> >>> No, not that I'm aware of.
> >> >> >>>
> >> >> >>> – Ufuk
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Saliya Ekanayake
> >> >> >> Ph.D. Candidate | Research Assistant
> >> >> >> School of Informatics and Computing | Digital Science Center
> >> >> >> Indiana University, Bloomington
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Saliya Ekanayake
> >> >> > Ph.D. Candidate | Research Assistant
> >> >> > School of Informatics and Computing | Digital Science Center
> >> >> > Indiana University, Bloomington
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Saliya Ekanayake
> >> > Ph.D. Candidate | Research Assistant
> >> > School of Informatics and Computing | Digital Science Center
> >> > Indiana University, Bloomington
> >> >
> >
> >
> >
> >
> > --
> > Saliya Ekanayake
> > Ph.D. Candidate | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> >
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
Thank you, Ovidiu.

On Wed, Jul 13, 2016 at 3:34 PM, Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Hi,
>
> I would pay attention to the memory settings such that
> heap+off-heap+network buffers can be served from your node’s RAM for both
> TMs.
> Also, there is some correlation between the number of buffers, parallelism
> and your workflow’s operators. The suggestion to be used for the
> numberOfBuffers does not work in every case.
>
> I guess the numberOfBuffers could be automatically determined based on the
> parallelism and workflow’s operators, not sure how to do that.
>
> Best,
> Ovidiu
>
> On 12 Jul 2016, at 21:18, Saliya Ekanayake <es...@gmail.com> wrote:
>
> Hi Ovidiu,
>
> Checking the /var/log/messages based on Greg's response revealed TMs were
> killed due to out of memory. Here's the node architecture. Each node has
> 128GB of RAM. I was trying to run 2 TMs per node binding each to 12 cores
> (or 1 socket). The total number of nodes were 16. I finally, managed to get
> it working with the following (non-default) settings.
>
> taskmanager.heap.mb: 12288
> taskmanager.numberOfTaskSlots: 12
> akka.ask.timeout: 1000 s
> taskmanager.network.numberOfBuffers: 36864
>
> Note, the number of buffers value, this had to be higher (twice in this
> case) than what's suggested in Flink (#slots-per-TM^2 * #TMs * 4, which
> would be 12*12*32*4 = 18432). Otherwise, it would throw me the not enough
> buffers error.
>
> Thank you,
> Saliya
>
> <juliet65.png>
>
> On Tue, Jul 12, 2016 at 7:39 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.marcu@inria.fr> wrote:
>
>> Hi,
>>
>> Can you post your configuration parameters (exclude default settings) and
>> cluster description?
>>
>> Best,
>> Ovidiu
>>
>> On 11 Jul 2016, at 17:49, Saliya Ekanayake <es...@gmail.com> wrote:
>>
>> Thank you Greg, I'll check if this was the cause for my TMs to disappear.
>>
>> On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <co...@greghogan.com> wrote:
>>
>>> The OOM killer doesn't give warning so you'll need to call dmesg or look
>>> in /var/log/messages or similar. The following reports that Debian flavors
>>> may use /var/log/syslog.
>>>
>>> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
>>>
>>> On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <es...@gmail.com>
>>> wrote:
>>>
>>>> Greg,
>>>>
>>>> where did you see the OOM log as shown in this mail thread? In my case
>>>> none of the TaskManagers nor JobManger reports an error like this.
>>>>
>>>> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <co...@greghogan.com> wrote:
>>>>
>>>>> These symptoms sounds similar to what I was experiencing in the
>>>>> following thread. Flink can have some unexpected memory usage which can
>>>>> result in an OOM kill by the kernel, and this becomes more pronounced as
>>>>> the cluster size grows.
>>>>>   https://www.mail-archive.com/dev@flink.apache.org/msg06346.html
>>>>>
>>>>> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <es...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I checked, but JVMs didn't crash. No puppet or other services like
>>>>>> that.
>>>>>>
>>>>>> One thing I found is that things work OK when I have a smaller number
>>>>>> of slaves. For example, here I was trying to run on 16 nodes giving 2 TMs
>>>>>> each. Then I reduced it to 4 nodes each with 2 TMs, which worked.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rm...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> from the TaskManager logs, I can not see anything suspicious.
>>>>>>> Its a bit weird that the TaskManager logs just end, without any
>>>>>>> shutdown messages. Usually the TMs log some shut down stuff when they are
>>>>>>> stopping.
>>>>>>> Also, if they would be still running, I would expect some error
>>>>>>> messages from akka about the connection status.
>>>>>>> So the only thing I conclude is that one of the TMs was killed by
>>>>>>> the OS or the JVM crashed. Did you check if that happened?
>>>>>>>
>>>>>>> Do you have any service like puppet that is controlling processes?
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <es...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I see two logs (attached), but there's only 1 TaskManger process.
>>>>>>>> Also, the Web console says it can find only 1 TM.
>>>>>>>>
>>>>>>>> However, I see this part in JM log, which shows there was a second
>>>>>>>> TM at one point, but it was unregistered. Any thoughts?
>>>>>>>>
>>>>>>>> --------------------------
>>>>>>>>
>>>>>>>> - Registered TaskManager at j-002 (akka.tcp://
>>>>>>>> flink@172.16.0.2:42888/user/taskmanager) as
>>>>>>>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
>>>>>>>> Current number of alive task slots is 12.
>>>>>>>>
>>>>>>>> 2016-07-07 11:32:40,363 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor - Association with remote system
>>>>>>>> [akka.tcp://flink@172.16.0.2:42888] has failed, address is now
>>>>>>>> gated for [5000] ms. Reason is: [Disassociated].
>>>>>>>>
>>>>>>>> 2016-07-07 11:32:42,722 INFO
>>>>>>>>  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
>>>>>>>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
>>>>>>>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
>>>>>>>> Current number of alive task slots is 24.
>>>>>>>>
>>>>>>>> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
>>>>>>>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888].
>>>>>>>> Address is now gated for 5000 ms, all messages to this address will be
>>>>>>>> delivered to dead letters. Reason: Connection refused: /
>>>>>>>> 172.16.0.2:42888
>>>>>>>>
>>>>>>>> 2016-07-07 11:33:15,320 INFO
>>>>>>>>  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
>>>>>>>> flink@172.16.0.2:42888/user/taskmanager terminated.
>>>>>>>> 2016-07-07 11:33:15,320 INFO
>>>>>>>>  org.apache.flink.runtime.instance.InstanceManager - Unregistered task
>>>>>>>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number
>>>>>>>> of registered task managers 1. Number of available slots 12.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>>>>>>
>>>>>>>>> No that should suffice. Can you check whether there are any task
>>>>>>>>> manager logs for the second TM on that machine
>>>>>>>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the
>>>>>>>>> task
>>>>>>>>> manager process does start up and there is another problem. If not,
>>>>>>>>> the task managers seems not to start even.
>>>>>>>>>
>>>>>>>>> – Ufuk
>>>>>>>>>
>>>>>>>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <
>>>>>>>>> esaliya@gmail.com> wrote:
>>>>>>>>> > I tried to run more than one task manager per node by
>>>>>>>>> duplicating the slave
>>>>>>>>> > IPs. At startup it says for example,
>>>>>>>>> >
>>>>>>>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>>>>>>>>> > Starting taskmanager daemon on host j-011.
>>>>>>>>> >
>>>>>>>>> > but I only see 1 task manager process running.
>>>>>>>>> >
>>>>>>>>> > Is there anything else I need to do?
>>>>>>>>> >
>>>>>>>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> >>
>>>>>>>>> >> Yes, exactly.
>>>>>>>>> >>
>>>>>>>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <
>>>>>>>>> esaliya@gmail.com>
>>>>>>>>> >> wrote:
>>>>>>>>> >> > Thank you, yes, it can be done externally, if not supported
>>>>>>>>> within
>>>>>>>>> >> > Flink.
>>>>>>>>> >> >
>>>>>>>>> >> > So the way to spawn multiple task managers would be to list
>>>>>>>>> the same
>>>>>>>>> >> > slave
>>>>>>>>> >> > machines N times as necessary in the slaves file?
>>>>>>>>> >> >
>>>>>>>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> >> >>
>>>>>>>>> >> >> No, not inside of Flink. That sounds like something like the
>>>>>>>>> OS or
>>>>>>>>> >> >> resource manager should handle.
>>>>>>>>> >> >>
>>>>>>>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <
>>>>>>>>> esaliya@gmail.com>
>>>>>>>>> >> >> wrote:
>>>>>>>>> >> >> > That's great, so is there support to pin task managers to
>>>>>>>>> sockets as
>>>>>>>>> >> >> > well?
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <
>>>>>>>>> uce@apache.org> wrote:
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> Regarding 2) if you don't manually configure something
>>>>>>>>> else, that
>>>>>>>>> >> >> >> should happen always.
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> Yes, you can run more than one task manager per node
>>>>>>>>> depending on
>>>>>>>>> >> >> >> the
>>>>>>>>> >> >> >> process isolation you want. Within a task manager, there
>>>>>>>>> are
>>>>>>>>> >> >> >> multiple
>>>>>>>>> >> >> >> threads for each slot. For example, if you have 2 task
>>>>>>>>> managers with
>>>>>>>>> >> >> >> 2
>>>>>>>>> >> >> >> slots each and submit a job with parallelism 4, each task
>>>>>>>>> manager
>>>>>>>>> >> >> >> will
>>>>>>>>> >> >> >> execute 2 sub tasks in separate Threads.
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >>
>>>>>>>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
>>>>>>>>> esaliya@gmail.com>
>>>>>>>>> >> >> >> wrote:
>>>>>>>>> >> >> >> > Hi Ufuk,
>>>>>>>>> >> >> >> >
>>>>>>>>> >> >> >> > Looking at the document you sent it seems only 1 task
>>>>>>>>> manager per
>>>>>>>>> >> >> >> > node
>>>>>>>>> >> >> >> > exist
>>>>>>>>> >> >> >> > and within that you have multiple slots. Is it possible
>>>>>>>>> to run
>>>>>>>>> >> >> >> > more
>>>>>>>>> >> >> >> > than
>>>>>>>>> >> >> >> > 1
>>>>>>>>> >> >> >> > task manager per node? Also, within a task manager is
>>>>>>>>> the
>>>>>>>>> >> >> >> > parallelism
>>>>>>>>> >> >> >> > done
>>>>>>>>> >> >> >> > through threads or processes?
>>>>>>>>> >> >> >> >
>>>>>>>>> >> >> >> > Thank you,
>>>>>>>>> >> >> >> > Saliya
>>>>>>>>> >> >> >> >
>>>>>>>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>>>>>>>>> >> >> >> > <es...@gmail.com>
>>>>>>>>> >> >> >> > wrote:
>>>>>>>>> >> >> >> >>
>>>>>>>>> >> >> >> >> Thank you, I'll check these.
>>>>>>>>> >> >> >> >>
>>>>>>>>> >> >> >> >> In 2.) you said they are likely to exchange through
>>>>>>>>> memory. Is
>>>>>>>>> >> >> >> >> there
>>>>>>>>> >> >> >> >> a
>>>>>>>>> >> >> >> >> case why they wouldn't?
>>>>>>>>> >> >> >> >>
>>>>>>>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <
>>>>>>>>> uce@apache.org>
>>>>>>>>> >> >> >> >> wrote:
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>>>>>>>>> >> >> >> >>> <es...@gmail.com>
>>>>>>>>> >> >> >> >>> wrote:
>>>>>>>>> >> >> >> >>> > 1. What parameters are available to control
>>>>>>>>> parallelism within
>>>>>>>>> >> >> >> >>> > a
>>>>>>>>> >> >> >> >>> > node?
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>> Task Manager processing slots:
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>>
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging
>>>>>>>>> within a
>>>>>>>>> >> >> >> >>> > node
>>>>>>>>> >> >> >> >>> > (without
>>>>>>>>> >> >> >> >>> > doing TCP calls)?
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP,
>>>>>>>>> for example
>>>>>>>>> >> >> >> >>> if
>>>>>>>>> >> >> >> >>> you
>>>>>>>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1
>>>>>>>>> are likely
>>>>>>>>> >> >> >> >>> to
>>>>>>>>> >> >> >> >>> exchange data locally.
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>> No, not that I'm aware of.
>>>>>>>>> >> >> >> >>>
>>>>>>>>> >> >> >> >>> – Ufuk
>>>>>>>>> >> >> >> >>
>>>>>>>>> >> >> >> >>
>>>>>>>>> >> >> >> >>
>>>>>>>>> >> >> >> >>
>>>>>>>>> >> >> >> >> --
>>>>>>>>> >> >> >> >> Saliya Ekanayake
>>>>>>>>> >> >> >> >> Ph.D. Candidate | Research Assistant
>>>>>>>>> >> >> >> >> School of Informatics and Computing | Digital Science
>>>>>>>>> Center
>>>>>>>>> >> >> >> >> Indiana University, Bloomington
>>>>>>>>> >> >> >> >>
>>>>>>>>> >> >> >> >
>>>>>>>>> >> >> >> >
>>>>>>>>> >> >> >> >
>>>>>>>>> >> >> >> > --
>>>>>>>>> >> >> >> > Saliya Ekanayake
>>>>>>>>> >> >> >> > Ph.D. Candidate | Research Assistant
>>>>>>>>> >> >> >> > School of Informatics and Computing | Digital Science
>>>>>>>>> Center
>>>>>>>>> >> >> >> > Indiana University, Bloomington
>>>>>>>>> >> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > --
>>>>>>>>> >> >> > Saliya Ekanayake
>>>>>>>>> >> >> > Ph.D. Candidate | Research Assistant
>>>>>>>>> >> >> > School of Informatics and Computing | Digital Science
>>>>>>>>> Center
>>>>>>>>> >> >> > Indiana University, Bloomington
>>>>>>>>> >> >> >
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >> > --
>>>>>>>>> >> > Saliya Ekanayake
>>>>>>>>> >> > Ph.D. Candidate | Research Assistant
>>>>>>>>> >> > School of Informatics and Computing | Digital Science Center
>>>>>>>>> >> > Indiana University, Bloomington
>>>>>>>>> >> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> > Saliya Ekanayake
>>>>>>>>> > Ph.D. Candidate | Research Assistant
>>>>>>>>> > School of Informatics and Computing | Digital Science Center
>>>>>>>>> > Indiana University, Bloomington
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Saliya Ekanayake
>>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>>> Indiana University, Bloomington
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Saliya Ekanayake
>>>>>> Ph.D. Candidate | Research Assistant
>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>> Indiana University, Bloomington
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>>
>>>>
>>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.
Hi,

I would pay attention to the memory settings such that heap+off-heap+network buffers can be served from your node’s RAM for both TMs.
Also, there is some correlation between the number of buffers, parallelism and your workflow’s operators. The suggestion to be used for the numberOfBuffers does not work in every case.

I guess the numberOfBuffers could be automatically determined based on the parallelism and workflow’s operators, not sure how to do that.

Best,
Ovidiu

> On 12 Jul 2016, at 21:18, Saliya Ekanayake <es...@gmail.com> wrote:
> 
> Hi Ovidiu,
> 
> Checking the /var/log/messages based on Greg's response revealed TMs were killed due to out of memory. Here's the node architecture. Each node has 128GB of RAM. I was trying to run 2 TMs per node binding each to 12 cores (or 1 socket). The total number of nodes were 16. I finally, managed to get it working with the following (non-default) settings.
> 
> taskmanager.heap.mb: 12288
> taskmanager.numberOfTaskSlots: 12
> akka.ask.timeout: 1000 s
> taskmanager.network.numberOfBuffers: 36864
> 
> Note, the number of buffers value, this had to be higher (twice in this case) than what's suggested in Flink (#slots-per-TM^2 * #TMs * 4, which would be 12*12*32*4 = 18432). Otherwise, it would throw me the not enough buffers error.
> 
> Thank you,
> Saliya
> 
> <juliet65.png>
> 
> On Tue, Jul 12, 2016 at 7:39 AM, Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr <ma...@inria.fr>> wrote:
> Hi,
> 
> Can you post your configuration parameters (exclude default settings) and cluster description?
> 
> Best,
> Ovidiu
> 
>> On 11 Jul 2016, at 17:49, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Thank you Greg, I'll check if this was the cause for my TMs to disappear.
>> 
>> On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <code@greghogan.com <ma...@greghogan.com>> wrote:
>> The OOM killer doesn't give warning so you'll need to call dmesg or look in /var/log/messages or similar. The following reports that Debian flavors may use /var/log/syslog.
>>   http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer <http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer>
>> 
>> On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>> wrote:
>> Greg,
>> 
>> where did you see the OOM log as shown in this mail thread? In my case none of the TaskManagers nor JobManger reports an error like this.
>> 
>> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <code@greghogan.com <ma...@greghogan.com>> wrote:
>> These symptoms sounds similar to what I was experiencing in the following thread. Flink can have some unexpected memory usage which can result in an OOM kill by the kernel, and this becomes more pronounced as the cluster size grows.
>>   https://www.mail-archive.com/dev@flink.apache.org/msg06346.html <https://www.mail-archive.com/dev@flink.apache.org/msg06346.html>
>> 
>> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>> wrote:
>> I checked, but JVMs didn't crash. No puppet or other services like that.
>> 
>> One thing I found is that things work OK when I have a smaller number of slaves. For example, here I was trying to run on 16 nodes giving 2 TMs each. Then I reduced it to 4 nodes each with 2 TMs, which worked.
>> 
>> 
>> 
>> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rmetzger@apache.org <ma...@apache.org>> wrote:
>> Hi,
>> from the TaskManager logs, I can not see anything suspicious.
>> Its a bit weird that the TaskManager logs just end, without any shutdown messages. Usually the TMs log some shut down stuff when they are stopping.
>> Also, if they would be still running, I would expect some error messages from akka about the connection status.
>> So the only thing I conclude is that one of the TMs was killed by the OS or the JVM crashed. Did you check if that happened?
>> 
>> Do you have any service like puppet that is controlling processes?
>> 
>> 
>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>> wrote:
>> I see two logs (attached), but there's only 1 TaskManger process. Also, the Web console says it can find only 1 TM. 
>> 
>> However, I see this part in JM log, which shows there was a second TM at one point, but it was unregistered. Any thoughts?
>> 
>> --------------------------
>> 
>> - Registered TaskManager at j-002 (akka.tcp://flink@172.16.0.2:42888/user/taskmanager <http://flink@172.16.0.2:42888/user/taskmanager>) as 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1. Current number of alive task slots is 12.
>> 
>> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@172.16.0.2:42888 <http://flink@172.16.0.2:42888/>] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>> 
>> 2016-07-07 11:32:42,722 INFO  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager <http://flink@172.16.0.2:37373/user/taskmanager>) as 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2. Current number of alive task slots is 24.
>> 
>> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.16.0.2:42888 <http://flink@172.16.0.2:42888/>]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.16.0.2:42888 <http://172.16.0.2:42888/>
>> 
>> 2016-07-07 11:33:15,320 INFO  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager <http://flink@172.16.0.2:42888/user/taskmanager> terminated.
>> 2016-07-07 11:33:15,320 INFO  org.apache.flink.runtime.instance.InstanceManager - Unregistered task manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager <http://flink@172.16.0.2:42888/user/taskmanager>. Number of registered task managers 1. Number of available slots 12.
>> 
>> 
>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>> wrote:
>> No that should suffice. Can you check whether there are any task
>> manager logs for the second TM on that machine
>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
>> manager process does start up and there is another problem. If not,
>> the task managers seems not to start even.
>> 
>> – Ufuk
>> 
>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>> wrote:
>> > I tried to run more than one task manager per node by duplicating the slave
>> > IPs. At startup it says for example,
>> >
>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>> > Starting taskmanager daemon on host j-011.
>> >
>> > but I only see 1 task manager process running.
>> >
>> > Is there anything else I need to do?
>> >
>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>> wrote:
>> >>
>> >> Yes, exactly.
>> >>
>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>>
>> >> wrote:
>> >> > Thank you, yes, it can be done externally, if not supported within
>> >> > Flink.
>> >> >
>> >> > So the way to spawn multiple task managers would be to list the same
>> >> > slave
>> >> > machines N times as necessary in the slaves file?
>> >> >
>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>> wrote:
>> >> >>
>> >> >> No, not inside of Flink. That sounds like something like the OS or
>> >> >> resource manager should handle.
>> >> >>
>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>>
>> >> >> wrote:
>> >> >> > That's great, so is there support to pin task managers to sockets as
>> >> >> > well?
>> >> >> >
>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>> wrote:
>> >> >> >>
>> >> >> >> Regarding 2) if you don't manually configure something else, that
>> >> >> >> should happen always.
>> >> >> >>
>> >> >> >> Yes, you can run more than one task manager per node depending on
>> >> >> >> the
>> >> >> >> process isolation you want. Within a task manager, there are
>> >> >> >> multiple
>> >> >> >> threads for each slot. For example, if you have 2 task managers with
>> >> >> >> 2
>> >> >> >> slots each and submit a job with parallelism 4, each task manager
>> >> >> >> will
>> >> >> >> execute 2 sub tasks in separate Threads.
>> >> >> >>
>> >> >> >>
>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>>
>> >> >> >> wrote:
>> >> >> >> > Hi Ufuk,
>> >> >> >> >
>> >> >> >> > Looking at the document you sent it seems only 1 task manager per
>> >> >> >> > node
>> >> >> >> > exist
>> >> >> >> > and within that you have multiple slots. Is it possible to run
>> >> >> >> > more
>> >> >> >> > than
>> >> >> >> > 1
>> >> >> >> > task manager per node? Also, within a task manager is the
>> >> >> >> > parallelism
>> >> >> >> > done
>> >> >> >> > through threads or processes?
>> >> >> >> >
>> >> >> >> > Thank you,
>> >> >> >> > Saliya
>> >> >> >> >
>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>> >> >> >> > <esaliya@gmail.com <ma...@gmail.com>>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> Thank you, I'll check these.
>> >> >> >> >>
>> >> >> >> >> In 2.) you said they are likely to exchange through memory. Is
>> >> >> >> >> there
>> >> >> >> >> a
>> >> >> >> >> case why they wouldn't?
>> >> >> >> >>
>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>>
>> >> >> >> >> wrote:
>> >> >> >> >>>
>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>> >> >> >> >>> <esaliya@gmail.com <ma...@gmail.com>>
>> >> >> >> >>> wrote:
>> >> >> >> >>> > 1. What parameters are available to control parallelism within
>> >> >> >> >>> > a
>> >> >> >> >>> > node?
>> >> >> >> >>>
>> >> >> >> >>> Task Manager processing slots:
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots <https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots>
>> >> >> >> >>>
>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging within a
>> >> >> >> >>> > node
>> >> >> >> >>> > (without
>> >> >> >> >>> > doing TCP calls)?
>> >> >> >> >>>
>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for example
>> >> >> >> >>> if
>> >> >> >> >>> you
>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely
>> >> >> >> >>> to
>> >> >> >> >>> exchange data locally.
>> >> >> >> >>>
>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>> >> >> >> >>>
>> >> >> >> >>> No, not that I'm aware of.
>> >> >> >> >>>
>> >> >> >> >>> – Ufuk
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >> Saliya Ekanayake
>> >> >> >> >> Ph.D. Candidate | Research Assistant
>> >> >> >> >> School of Informatics and Computing | Digital Science Center
>> >> >> >> >> Indiana University, Bloomington
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Saliya Ekanayake
>> >> >> >> > Ph.D. Candidate | Research Assistant
>> >> >> >> > School of Informatics and Computing | Digital Science Center
>> >> >> >> > Indiana University, Bloomington
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Saliya Ekanayake
>> >> >> > Ph.D. Candidate | Research Assistant
>> >> >> > School of Informatics and Computing | Digital Science Center
>> >> >> > Indiana University, Bloomington
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Saliya Ekanayake
>> >> > Ph.D. Candidate | Research Assistant
>> >> > School of Informatics and Computing | Digital Science Center
>> >> > Indiana University, Bloomington
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Saliya Ekanayake
>> > Ph.D. Candidate | Research Assistant
>> > School of Informatics and Computing | Digital Science Center
>> > Indiana University, Bloomington
>> >
>> 
>> 
>> 
>> -- 
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> 
> 
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> 


Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
Hi Ovidiu,

Checking the /var/log/messages based on Greg's response revealed TMs were
killed due to out of memory. Here's the node architecture. Each node has
128GB of RAM. I was trying to run 2 TMs per node binding each to 12 cores
(or 1 socket). The total number of nodes were 16. I finally, managed to get
it working with the following (non-default) settings.

taskmanager.heap.mb: 12288
taskmanager.numberOfTaskSlots: 12
akka.ask.timeout: 1000 s
taskmanager.network.numberOfBuffers: 36864

Note, the number of buffers value, this had to be higher (twice in this
case) than what's suggested in Flink (#slots-per-TM^2 * #TMs * 4, which
would be 12*12*32*4 = 18432). Otherwise, it would throw me the not enough
buffers error.

Thank you,
Saliya

[image: Inline image 2]

On Tue, Jul 12, 2016 at 7:39 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Hi,
>
> Can you post your configuration parameters (exclude default settings) and
> cluster description?
>
> Best,
> Ovidiu
>
> On 11 Jul 2016, at 17:49, Saliya Ekanayake <es...@gmail.com> wrote:
>
> Thank you Greg, I'll check if this was the cause for my TMs to disappear.
>
> On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <co...@greghogan.com> wrote:
>
>> The OOM killer doesn't give warning so you'll need to call dmesg or look
>> in /var/log/messages or similar. The following reports that Debian flavors
>> may use /var/log/syslog.
>>
>> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
>>
>> On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <es...@gmail.com>
>> wrote:
>>
>>> Greg,
>>>
>>> where did you see the OOM log as shown in this mail thread? In my case
>>> none of the TaskManagers nor JobManger reports an error like this.
>>>
>>> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <co...@greghogan.com> wrote:
>>>
>>>> These symptoms sounds similar to what I was experiencing in the
>>>> following thread. Flink can have some unexpected memory usage which can
>>>> result in an OOM kill by the kernel, and this becomes more pronounced as
>>>> the cluster size grows.
>>>>   https://www.mail-archive.com/dev@flink.apache.org/msg06346.html
>>>>
>>>> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <es...@gmail.com>
>>>> wrote:
>>>>
>>>>> I checked, but JVMs didn't crash. No puppet or other services like
>>>>> that.
>>>>>
>>>>> One thing I found is that things work OK when I have a smaller number
>>>>> of slaves. For example, here I was trying to run on 16 nodes giving 2 TMs
>>>>> each. Then I reduced it to 4 nodes each with 2 TMs, which worked.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> from the TaskManager logs, I can not see anything suspicious.
>>>>>> Its a bit weird that the TaskManager logs just end, without any
>>>>>> shutdown messages. Usually the TMs log some shut down stuff when they are
>>>>>> stopping.
>>>>>> Also, if they would be still running, I would expect some error
>>>>>> messages from akka about the connection status.
>>>>>> So the only thing I conclude is that one of the TMs was killed by the
>>>>>> OS or the JVM crashed. Did you check if that happened?
>>>>>>
>>>>>> Do you have any service like puppet that is controlling processes?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <es...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I see two logs (attached), but there's only 1 TaskManger process.
>>>>>>> Also, the Web console says it can find only 1 TM.
>>>>>>>
>>>>>>> However, I see this part in JM log, which shows there was a second
>>>>>>> TM at one point, but it was unregistered. Any thoughts?
>>>>>>>
>>>>>>> --------------------------
>>>>>>>
>>>>>>> - Registered TaskManager at j-002 (akka.tcp://
>>>>>>> flink@172.16.0.2:42888/user/taskmanager) as
>>>>>>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
>>>>>>> Current number of alive task slots is 12.
>>>>>>>
>>>>>>> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>> - Association with remote system [akka.tcp://flink@172.16.0.2:42888]
>>>>>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>>>>>
>>>>>>> 2016-07-07 11:32:42,722 INFO
>>>>>>>  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
>>>>>>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
>>>>>>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
>>>>>>> Current number of alive task slots is 24.
>>>>>>>
>>>>>>> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
>>>>>>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888].
>>>>>>> Address is now gated for 5000 ms, all messages to this address will be
>>>>>>> delivered to dead letters. Reason: Connection refused: /
>>>>>>> 172.16.0.2:42888
>>>>>>>
>>>>>>> 2016-07-07 11:33:15,320 INFO
>>>>>>>  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
>>>>>>> flink@172.16.0.2:42888/user/taskmanager terminated.
>>>>>>> 2016-07-07 11:33:15,320 INFO
>>>>>>>  org.apache.flink.runtime.instance.InstanceManager - Unregistered task
>>>>>>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number
>>>>>>> of registered task managers 1. Number of available slots 12.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>>>>>
>>>>>>>> No that should suffice. Can you check whether there are any task
>>>>>>>> manager logs for the second TM on that machine
>>>>>>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
>>>>>>>> manager process does start up and there is another problem. If not,
>>>>>>>> the task managers seems not to start even.
>>>>>>>>
>>>>>>>> – Ufuk
>>>>>>>>
>>>>>>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <es...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> > I tried to run more than one task manager per node by duplicating
>>>>>>>> the slave
>>>>>>>> > IPs. At startup it says for example,
>>>>>>>> >
>>>>>>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>>>>>>>> > Starting taskmanager daemon on host j-011.
>>>>>>>> >
>>>>>>>> > but I only see 1 task manager process running.
>>>>>>>> >
>>>>>>>> > Is there anything else I need to do?
>>>>>>>> >
>>>>>>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >> Yes, exactly.
>>>>>>>> >>
>>>>>>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <
>>>>>>>> esaliya@gmail.com>
>>>>>>>> >> wrote:
>>>>>>>> >> > Thank you, yes, it can be done externally, if not supported
>>>>>>>> within
>>>>>>>> >> > Flink.
>>>>>>>> >> >
>>>>>>>> >> > So the way to spawn multiple task managers would be to list
>>>>>>>> the same
>>>>>>>> >> > slave
>>>>>>>> >> > machines N times as necessary in the slaves file?
>>>>>>>> >> >
>>>>>>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org>
>>>>>>>> wrote:
>>>>>>>> >> >>
>>>>>>>> >> >> No, not inside of Flink. That sounds like something like the
>>>>>>>> OS or
>>>>>>>> >> >> resource manager should handle.
>>>>>>>> >> >>
>>>>>>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <
>>>>>>>> esaliya@gmail.com>
>>>>>>>> >> >> wrote:
>>>>>>>> >> >> > That's great, so is there support to pin task managers to
>>>>>>>> sockets as
>>>>>>>> >> >> > well?
>>>>>>>> >> >> >
>>>>>>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <
>>>>>>>> uce@apache.org> wrote:
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> Regarding 2) if you don't manually configure something
>>>>>>>> else, that
>>>>>>>> >> >> >> should happen always.
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> Yes, you can run more than one task manager per node
>>>>>>>> depending on
>>>>>>>> >> >> >> the
>>>>>>>> >> >> >> process isolation you want. Within a task manager, there
>>>>>>>> are
>>>>>>>> >> >> >> multiple
>>>>>>>> >> >> >> threads for each slot. For example, if you have 2 task
>>>>>>>> managers with
>>>>>>>> >> >> >> 2
>>>>>>>> >> >> >> slots each and submit a job with parallelism 4, each task
>>>>>>>> manager
>>>>>>>> >> >> >> will
>>>>>>>> >> >> >> execute 2 sub tasks in separate Threads.
>>>>>>>> >> >> >>
>>>>>>>> >> >> >>
>>>>>>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
>>>>>>>> esaliya@gmail.com>
>>>>>>>> >> >> >> wrote:
>>>>>>>> >> >> >> > Hi Ufuk,
>>>>>>>> >> >> >> >
>>>>>>>> >> >> >> > Looking at the document you sent it seems only 1 task
>>>>>>>> manager per
>>>>>>>> >> >> >> > node
>>>>>>>> >> >> >> > exist
>>>>>>>> >> >> >> > and within that you have multiple slots. Is it possible
>>>>>>>> to run
>>>>>>>> >> >> >> > more
>>>>>>>> >> >> >> > than
>>>>>>>> >> >> >> > 1
>>>>>>>> >> >> >> > task manager per node? Also, within a task manager is the
>>>>>>>> >> >> >> > parallelism
>>>>>>>> >> >> >> > done
>>>>>>>> >> >> >> > through threads or processes?
>>>>>>>> >> >> >> >
>>>>>>>> >> >> >> > Thank you,
>>>>>>>> >> >> >> > Saliya
>>>>>>>> >> >> >> >
>>>>>>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>>>>>>>> >> >> >> > <es...@gmail.com>
>>>>>>>> >> >> >> > wrote:
>>>>>>>> >> >> >> >>
>>>>>>>> >> >> >> >> Thank you, I'll check these.
>>>>>>>> >> >> >> >>
>>>>>>>> >> >> >> >> In 2.) you said they are likely to exchange through
>>>>>>>> memory. Is
>>>>>>>> >> >> >> >> there
>>>>>>>> >> >> >> >> a
>>>>>>>> >> >> >> >> case why they wouldn't?
>>>>>>>> >> >> >> >>
>>>>>>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <
>>>>>>>> uce@apache.org>
>>>>>>>> >> >> >> >> wrote:
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>>>>>>>> >> >> >> >>> <es...@gmail.com>
>>>>>>>> >> >> >> >>> wrote:
>>>>>>>> >> >> >> >>> > 1. What parameters are available to control
>>>>>>>> parallelism within
>>>>>>>> >> >> >> >>> > a
>>>>>>>> >> >> >> >>> > node?
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>> Task Manager processing slots:
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>>
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging
>>>>>>>> within a
>>>>>>>> >> >> >> >>> > node
>>>>>>>> >> >> >> >>> > (without
>>>>>>>> >> >> >> >>> > doing TCP calls)?
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP,
>>>>>>>> for example
>>>>>>>> >> >> >> >>> if
>>>>>>>> >> >> >> >>> you
>>>>>>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1
>>>>>>>> are likely
>>>>>>>> >> >> >> >>> to
>>>>>>>> >> >> >> >>> exchange data locally.
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>> No, not that I'm aware of.
>>>>>>>> >> >> >> >>>
>>>>>>>> >> >> >> >>> – Ufuk
>>>>>>>> >> >> >> >>
>>>>>>>> >> >> >> >>
>>>>>>>> >> >> >> >>
>>>>>>>> >> >> >> >>
>>>>>>>> >> >> >> >> --
>>>>>>>> >> >> >> >> Saliya Ekanayake
>>>>>>>> >> >> >> >> Ph.D. Candidate | Research Assistant
>>>>>>>> >> >> >> >> School of Informatics and Computing | Digital Science
>>>>>>>> Center
>>>>>>>> >> >> >> >> Indiana University, Bloomington
>>>>>>>> >> >> >> >>
>>>>>>>> >> >> >> >
>>>>>>>> >> >> >> >
>>>>>>>> >> >> >> >
>>>>>>>> >> >> >> > --
>>>>>>>> >> >> >> > Saliya Ekanayake
>>>>>>>> >> >> >> > Ph.D. Candidate | Research Assistant
>>>>>>>> >> >> >> > School of Informatics and Computing | Digital Science
>>>>>>>> Center
>>>>>>>> >> >> >> > Indiana University, Bloomington
>>>>>>>> >> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> > --
>>>>>>>> >> >> > Saliya Ekanayake
>>>>>>>> >> >> > Ph.D. Candidate | Research Assistant
>>>>>>>> >> >> > School of Informatics and Computing | Digital Science Center
>>>>>>>> >> >> > Indiana University, Bloomington
>>>>>>>> >> >> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> > --
>>>>>>>> >> > Saliya Ekanayake
>>>>>>>> >> > Ph.D. Candidate | Research Assistant
>>>>>>>> >> > School of Informatics and Computing | Digital Science Center
>>>>>>>> >> > Indiana University, Bloomington
>>>>>>>> >> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Saliya Ekanayake
>>>>>>>> > Ph.D. Candidate | Research Assistant
>>>>>>>> > School of Informatics and Computing | Digital Science Center
>>>>>>>> > Indiana University, Bloomington
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Saliya Ekanayake
>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>> Indiana University, Bloomington
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Saliya Ekanayake
>>>>> Ph.D. Candidate | Research Assistant
>>>>> School of Informatics and Computing | Digital Science Center
>>>>> Indiana University, Bloomington
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>>
>>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.
Hi,

Can you post your configuration parameters (exclude default settings) and cluster description?

Best,
Ovidiu
> On 11 Jul 2016, at 17:49, Saliya Ekanayake <es...@gmail.com> wrote:
> 
> Thank you Greg, I'll check if this was the cause for my TMs to disappear.
> 
> On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <code@greghogan.com <ma...@greghogan.com>> wrote:
> The OOM killer doesn't give warning so you'll need to call dmesg or look in /var/log/messages or similar. The following reports that Debian flavors may use /var/log/syslog.
>   http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer <http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer>
> 
> On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>> wrote:
> Greg,
> 
> where did you see the OOM log as shown in this mail thread? In my case none of the TaskManagers nor JobManger reports an error like this.
> 
> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <code@greghogan.com <ma...@greghogan.com>> wrote:
> These symptoms sounds similar to what I was experiencing in the following thread. Flink can have some unexpected memory usage which can result in an OOM kill by the kernel, and this becomes more pronounced as the cluster size grows.
>   https://www.mail-archive.com/dev@flink.apache.org/msg06346.html <https://www.mail-archive.com/dev@flink.apache.org/msg06346.html>
> 
> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>> wrote:
> I checked, but JVMs didn't crash. No puppet or other services like that.
> 
> One thing I found is that things work OK when I have a smaller number of slaves. For example, here I was trying to run on 16 nodes giving 2 TMs each. Then I reduced it to 4 nodes each with 2 TMs, which worked.
> 
> 
> 
> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rmetzger@apache.org <ma...@apache.org>> wrote:
> Hi,
> from the TaskManager logs, I can not see anything suspicious.
> Its a bit weird that the TaskManager logs just end, without any shutdown messages. Usually the TMs log some shut down stuff when they are stopping.
> Also, if they would be still running, I would expect some error messages from akka about the connection status.
> So the only thing I conclude is that one of the TMs was killed by the OS or the JVM crashed. Did you check if that happened?
> 
> Do you have any service like puppet that is controlling processes?
> 
> 
> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>> wrote:
> I see two logs (attached), but there's only 1 TaskManger process. Also, the Web console says it can find only 1 TM. 
> 
> However, I see this part in JM log, which shows there was a second TM at one point, but it was unregistered. Any thoughts?
> 
> --------------------------
> 
> - Registered TaskManager at j-002 (akka.tcp://flink@172.16.0.2:42888/user/taskmanager <http://flink@172.16.0.2:42888/user/taskmanager>) as 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1. Current number of alive task slots is 12.
> 
> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@172.16.0.2:42888 <http://flink@172.16.0.2:42888/>] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 
> 2016-07-07 11:32:42,722 INFO  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager <http://flink@172.16.0.2:37373/user/taskmanager>) as 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2. Current number of alive task slots is 24.
> 
> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.16.0.2:42888 <http://flink@172.16.0.2:42888/>]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.16.0.2:42888 <http://172.16.0.2:42888/>
> 
> 2016-07-07 11:33:15,320 INFO  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager <http://flink@172.16.0.2:42888/user/taskmanager> terminated.
> 2016-07-07 11:33:15,320 INFO  org.apache.flink.runtime.instance.InstanceManager - Unregistered task manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager <http://flink@172.16.0.2:42888/user/taskmanager>. Number of registered task managers 1. Number of available slots 12.
> 
> 
> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>> wrote:
> No that should suffice. Can you check whether there are any task
> manager logs for the second TM on that machine
> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
> manager process does start up and there is another problem. If not,
> the task managers seems not to start even.
> 
> – Ufuk
> 
> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>> wrote:
> > I tried to run more than one task manager per node by duplicating the slave
> > IPs. At startup it says for example,
> >
> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
> > Starting taskmanager daemon on host j-011.
> >
> > but I only see 1 task manager process running.
> >
> > Is there anything else I need to do?
> >
> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>> wrote:
> >>
> >> Yes, exactly.
> >>
> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>>
> >> wrote:
> >> > Thank you, yes, it can be done externally, if not supported within
> >> > Flink.
> >> >
> >> > So the way to spawn multiple task managers would be to list the same
> >> > slave
> >> > machines N times as necessary in the slaves file?
> >> >
> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>> wrote:
> >> >>
> >> >> No, not inside of Flink. That sounds like something like the OS or
> >> >> resource manager should handle.
> >> >>
> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>>
> >> >> wrote:
> >> >> > That's great, so is there support to pin task managers to sockets as
> >> >> > well?
> >> >> >
> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>> wrote:
> >> >> >>
> >> >> >> Regarding 2) if you don't manually configure something else, that
> >> >> >> should happen always.
> >> >> >>
> >> >> >> Yes, you can run more than one task manager per node depending on
> >> >> >> the
> >> >> >> process isolation you want. Within a task manager, there are
> >> >> >> multiple
> >> >> >> threads for each slot. For example, if you have 2 task managers with
> >> >> >> 2
> >> >> >> slots each and submit a job with parallelism 4, each task manager
> >> >> >> will
> >> >> >> execute 2 sub tasks in separate Threads.
> >> >> >>
> >> >> >>
> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <esaliya@gmail.com <ma...@gmail.com>>
> >> >> >> wrote:
> >> >> >> > Hi Ufuk,
> >> >> >> >
> >> >> >> > Looking at the document you sent it seems only 1 task manager per
> >> >> >> > node
> >> >> >> > exist
> >> >> >> > and within that you have multiple slots. Is it possible to run
> >> >> >> > more
> >> >> >> > than
> >> >> >> > 1
> >> >> >> > task manager per node? Also, within a task manager is the
> >> >> >> > parallelism
> >> >> >> > done
> >> >> >> > through threads or processes?
> >> >> >> >
> >> >> >> > Thank you,
> >> >> >> > Saliya
> >> >> >> >
> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
> >> >> >> > <esaliya@gmail.com <ma...@gmail.com>>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Thank you, I'll check these.
> >> >> >> >>
> >> >> >> >> In 2.) you said they are likely to exchange through memory. Is
> >> >> >> >> there
> >> >> >> >> a
> >> >> >> >> case why they wouldn't?
> >> >> >> >>
> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uce@apache.org <ma...@apache.org>>
> >> >> >> >> wrote:
> >> >> >> >>>
> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
> >> >> >> >>> <esaliya@gmail.com <ma...@gmail.com>>
> >> >> >> >>> wrote:
> >> >> >> >>> > 1. What parameters are available to control parallelism within
> >> >> >> >>> > a
> >> >> >> >>> > node?
> >> >> >> >>>
> >> >> >> >>> Task Manager processing slots:
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots <https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots>
> >> >> >> >>>
> >> >> >> >>> > 2. Does Flink support shared memory-based messaging within a
> >> >> >> >>> > node
> >> >> >> >>> > (without
> >> >> >> >>> > doing TCP calls)?
> >> >> >> >>>
> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for example
> >> >> >> >>> if
> >> >> >> >>> you
> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely
> >> >> >> >>> to
> >> >> >> >>> exchange data locally.
> >> >> >> >>>
> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
> >> >> >> >>>
> >> >> >> >>> No, not that I'm aware of.
> >> >> >> >>>
> >> >> >> >>> – Ufuk
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Saliya Ekanayake
> >> >> >> >> Ph.D. Candidate | Research Assistant
> >> >> >> >> School of Informatics and Computing | Digital Science Center
> >> >> >> >> Indiana University, Bloomington
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Saliya Ekanayake
> >> >> >> > Ph.D. Candidate | Research Assistant
> >> >> >> > School of Informatics and Computing | Digital Science Center
> >> >> >> > Indiana University, Bloomington
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Saliya Ekanayake
> >> >> > Ph.D. Candidate | Research Assistant
> >> >> > School of Informatics and Computing | Digital Science Center
> >> >> > Indiana University, Bloomington
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Saliya Ekanayake
> >> > Ph.D. Candidate | Research Assistant
> >> > School of Informatics and Computing | Digital Science Center
> >> > Indiana University, Bloomington
> >> >
> >
> >
> >
> >
> > --
> > Saliya Ekanayake
> > Ph.D. Candidate | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> >
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> 
> 
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> 
> 
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> 
> 
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> 


Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
Thank you Greg, I'll check if this was the cause for my TMs to disappear.

On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <co...@greghogan.com> wrote:

> The OOM killer doesn't give warning so you'll need to call dmesg or look
> in /var/log/messages or similar. The following reports that Debian flavors
> may use /var/log/syslog.
>
> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
>
> On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <es...@gmail.com>
> wrote:
>
>> Greg,
>>
>> where did you see the OOM log as shown in this mail thread? In my case
>> none of the TaskManagers nor JobManger reports an error like this.
>>
>> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <co...@greghogan.com> wrote:
>>
>>> These symptoms sounds similar to what I was experiencing in the
>>> following thread. Flink can have some unexpected memory usage which can
>>> result in an OOM kill by the kernel, and this becomes more pronounced as
>>> the cluster size grows.
>>>   https://www.mail-archive.com/dev@flink.apache.org/msg06346.html
>>>
>>> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <es...@gmail.com>
>>> wrote:
>>>
>>>> I checked, but JVMs didn't crash. No puppet or other services like that.
>>>>
>>>> One thing I found is that things work OK when I have a smaller number
>>>> of slaves. For example, here I was trying to run on 16 nodes giving 2 TMs
>>>> each. Then I reduced it to 4 nodes each with 2 TMs, which worked.
>>>>
>>>>
>>>>
>>>> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rm...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> from the TaskManager logs, I can not see anything suspicious.
>>>>> Its a bit weird that the TaskManager logs just end, without any
>>>>> shutdown messages. Usually the TMs log some shut down stuff when they are
>>>>> stopping.
>>>>> Also, if they would be still running, I would expect some error
>>>>> messages from akka about the connection status.
>>>>> So the only thing I conclude is that one of the TMs was killed by the
>>>>> OS or the JVM crashed. Did you check if that happened?
>>>>>
>>>>> Do you have any service like puppet that is controlling processes?
>>>>>
>>>>>
>>>>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <es...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I see two logs (attached), but there's only 1 TaskManger process.
>>>>>> Also, the Web console says it can find only 1 TM.
>>>>>>
>>>>>> However, I see this part in JM log, which shows there was a second TM
>>>>>> at one point, but it was unregistered. Any thoughts?
>>>>>>
>>>>>> --------------------------
>>>>>>
>>>>>> - Registered TaskManager at j-002 (akka.tcp://
>>>>>> flink@172.16.0.2:42888/user/taskmanager) as
>>>>>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
>>>>>> Current number of alive task slots is 12.
>>>>>>
>>>>>> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>> - Association with remote system [akka.tcp://flink@172.16.0.2:42888]
>>>>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>>>>
>>>>>> 2016-07-07 11:32:42,722 INFO
>>>>>>  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
>>>>>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
>>>>>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
>>>>>> Current number of alive task slots is 24.
>>>>>>
>>>>>> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
>>>>>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888].
>>>>>> Address is now gated for 5000 ms, all messages to this address will be
>>>>>> delivered to dead letters. Reason: Connection refused: /
>>>>>> 172.16.0.2:42888
>>>>>>
>>>>>> 2016-07-07 11:33:15,320 INFO
>>>>>>  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
>>>>>> flink@172.16.0.2:42888/user/taskmanager terminated.
>>>>>> 2016-07-07 11:33:15,320 INFO
>>>>>>  org.apache.flink.runtime.instance.InstanceManager - Unregistered task
>>>>>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number
>>>>>> of registered task managers 1. Number of available slots 12.
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>>>>
>>>>>>> No that should suffice. Can you check whether there are any task
>>>>>>> manager logs for the second TM on that machine
>>>>>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
>>>>>>> manager process does start up and there is another problem. If not,
>>>>>>> the task managers seems not to start even.
>>>>>>>
>>>>>>> – Ufuk
>>>>>>>
>>>>>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <es...@gmail.com>
>>>>>>> wrote:
>>>>>>> > I tried to run more than one task manager per node by duplicating
>>>>>>> the slave
>>>>>>> > IPs. At startup it says for example,
>>>>>>> >
>>>>>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>>>>>>> > Starting taskmanager daemon on host j-011.
>>>>>>> >
>>>>>>> > but I only see 1 task manager process running.
>>>>>>> >
>>>>>>> > Is there anything else I need to do?
>>>>>>> >
>>>>>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> Yes, exactly.
>>>>>>> >>
>>>>>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <
>>>>>>> esaliya@gmail.com>
>>>>>>> >> wrote:
>>>>>>> >> > Thank you, yes, it can be done externally, if not supported
>>>>>>> within
>>>>>>> >> > Flink.
>>>>>>> >> >
>>>>>>> >> > So the way to spawn multiple task managers would be to list the
>>>>>>> same
>>>>>>> >> > slave
>>>>>>> >> > machines N times as necessary in the slaves file?
>>>>>>> >> >
>>>>>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org>
>>>>>>> wrote:
>>>>>>> >> >>
>>>>>>> >> >> No, not inside of Flink. That sounds like something like the
>>>>>>> OS or
>>>>>>> >> >> resource manager should handle.
>>>>>>> >> >>
>>>>>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <
>>>>>>> esaliya@gmail.com>
>>>>>>> >> >> wrote:
>>>>>>> >> >> > That's great, so is there support to pin task managers to
>>>>>>> sockets as
>>>>>>> >> >> > well?
>>>>>>> >> >> >
>>>>>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org>
>>>>>>> wrote:
>>>>>>> >> >> >>
>>>>>>> >> >> >> Regarding 2) if you don't manually configure something
>>>>>>> else, that
>>>>>>> >> >> >> should happen always.
>>>>>>> >> >> >>
>>>>>>> >> >> >> Yes, you can run more than one task manager per node
>>>>>>> depending on
>>>>>>> >> >> >> the
>>>>>>> >> >> >> process isolation you want. Within a task manager, there are
>>>>>>> >> >> >> multiple
>>>>>>> >> >> >> threads for each slot. For example, if you have 2 task
>>>>>>> managers with
>>>>>>> >> >> >> 2
>>>>>>> >> >> >> slots each and submit a job with parallelism 4, each task
>>>>>>> manager
>>>>>>> >> >> >> will
>>>>>>> >> >> >> execute 2 sub tasks in separate Threads.
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
>>>>>>> esaliya@gmail.com>
>>>>>>> >> >> >> wrote:
>>>>>>> >> >> >> > Hi Ufuk,
>>>>>>> >> >> >> >
>>>>>>> >> >> >> > Looking at the document you sent it seems only 1 task
>>>>>>> manager per
>>>>>>> >> >> >> > node
>>>>>>> >> >> >> > exist
>>>>>>> >> >> >> > and within that you have multiple slots. Is it possible
>>>>>>> to run
>>>>>>> >> >> >> > more
>>>>>>> >> >> >> > than
>>>>>>> >> >> >> > 1
>>>>>>> >> >> >> > task manager per node? Also, within a task manager is the
>>>>>>> >> >> >> > parallelism
>>>>>>> >> >> >> > done
>>>>>>> >> >> >> > through threads or processes?
>>>>>>> >> >> >> >
>>>>>>> >> >> >> > Thank you,
>>>>>>> >> >> >> > Saliya
>>>>>>> >> >> >> >
>>>>>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>>>>>>> >> >> >> > <es...@gmail.com>
>>>>>>> >> >> >> > wrote:
>>>>>>> >> >> >> >>
>>>>>>> >> >> >> >> Thank you, I'll check these.
>>>>>>> >> >> >> >>
>>>>>>> >> >> >> >> In 2.) you said they are likely to exchange through
>>>>>>> memory. Is
>>>>>>> >> >> >> >> there
>>>>>>> >> >> >> >> a
>>>>>>> >> >> >> >> case why they wouldn't?
>>>>>>> >> >> >> >>
>>>>>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <
>>>>>>> uce@apache.org>
>>>>>>> >> >> >> >> wrote:
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>>>>>>> >> >> >> >>> <es...@gmail.com>
>>>>>>> >> >> >> >>> wrote:
>>>>>>> >> >> >> >>> > 1. What parameters are available to control
>>>>>>> parallelism within
>>>>>>> >> >> >> >>> > a
>>>>>>> >> >> >> >>> > node?
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>> Task Manager processing slots:
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>>
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging
>>>>>>> within a
>>>>>>> >> >> >> >>> > node
>>>>>>> >> >> >> >>> > (without
>>>>>>> >> >> >> >>> > doing TCP calls)?
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for
>>>>>>> example
>>>>>>> >> >> >> >>> if
>>>>>>> >> >> >> >>> you
>>>>>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1
>>>>>>> are likely
>>>>>>> >> >> >> >>> to
>>>>>>> >> >> >> >>> exchange data locally.
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>> No, not that I'm aware of.
>>>>>>> >> >> >> >>>
>>>>>>> >> >> >> >>> – Ufuk
>>>>>>> >> >> >> >>
>>>>>>> >> >> >> >>
>>>>>>> >> >> >> >>
>>>>>>> >> >> >> >>
>>>>>>> >> >> >> >> --
>>>>>>> >> >> >> >> Saliya Ekanayake
>>>>>>> >> >> >> >> Ph.D. Candidate | Research Assistant
>>>>>>> >> >> >> >> School of Informatics and Computing | Digital Science
>>>>>>> Center
>>>>>>> >> >> >> >> Indiana University, Bloomington
>>>>>>> >> >> >> >>
>>>>>>> >> >> >> >
>>>>>>> >> >> >> >
>>>>>>> >> >> >> >
>>>>>>> >> >> >> > --
>>>>>>> >> >> >> > Saliya Ekanayake
>>>>>>> >> >> >> > Ph.D. Candidate | Research Assistant
>>>>>>> >> >> >> > School of Informatics and Computing | Digital Science
>>>>>>> Center
>>>>>>> >> >> >> > Indiana University, Bloomington
>>>>>>> >> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> > --
>>>>>>> >> >> > Saliya Ekanayake
>>>>>>> >> >> > Ph.D. Candidate | Research Assistant
>>>>>>> >> >> > School of Informatics and Computing | Digital Science Center
>>>>>>> >> >> > Indiana University, Bloomington
>>>>>>> >> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > --
>>>>>>> >> > Saliya Ekanayake
>>>>>>> >> > Ph.D. Candidate | Research Assistant
>>>>>>> >> > School of Informatics and Computing | Digital Science Center
>>>>>>> >> > Indiana University, Bloomington
>>>>>>> >> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Saliya Ekanayake
>>>>>>> > Ph.D. Candidate | Research Assistant
>>>>>>> > School of Informatics and Computing | Digital Science Center
>>>>>>> > Indiana University, Bloomington
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Saliya Ekanayake
>>>>>> Ph.D. Candidate | Research Assistant
>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>> Indiana University, Bloomington
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>>
>>>>
>>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Greg Hogan <co...@greghogan.com>.
The OOM killer doesn't give warning so you'll need to call dmesg or look in
/var/log/messages or similar. The following reports that Debian flavors may
use /var/log/syslog.

http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer

On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <es...@gmail.com>
wrote:

> Greg,
>
> where did you see the OOM log as shown in this mail thread? In my case
> none of the TaskManagers nor JobManger reports an error like this.
>
> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <co...@greghogan.com> wrote:
>
>> These symptoms sounds similar to what I was experiencing in the following
>> thread. Flink can have some unexpected memory usage which can result in an
>> OOM kill by the kernel, and this becomes more pronounced as the cluster
>> size grows.
>>   https://www.mail-archive.com/dev@flink.apache.org/msg06346.html
>>
>> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <es...@gmail.com>
>> wrote:
>>
>>> I checked, but JVMs didn't crash. No puppet or other services like that.
>>>
>>> One thing I found is that things work OK when I have a smaller number of
>>> slaves. For example, here I was trying to run on 16 nodes giving 2 TMs
>>> each. Then I reduced it to 4 nodes each with 2 TMs, which worked.
>>>
>>>
>>>
>>> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rm...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>> from the TaskManager logs, I can not see anything suspicious.
>>>> Its a bit weird that the TaskManager logs just end, without any
>>>> shutdown messages. Usually the TMs log some shut down stuff when they are
>>>> stopping.
>>>> Also, if they would be still running, I would expect some error
>>>> messages from akka about the connection status.
>>>> So the only thing I conclude is that one of the TMs was killed by the
>>>> OS or the JVM crashed. Did you check if that happened?
>>>>
>>>> Do you have any service like puppet that is controlling processes?
>>>>
>>>>
>>>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <es...@gmail.com>
>>>> wrote:
>>>>
>>>>> I see two logs (attached), but there's only 1 TaskManger process.
>>>>> Also, the Web console says it can find only 1 TM.
>>>>>
>>>>> However, I see this part in JM log, which shows there was a second TM
>>>>> at one point, but it was unregistered. Any thoughts?
>>>>>
>>>>> --------------------------
>>>>>
>>>>> - Registered TaskManager at j-002 (akka.tcp://
>>>>> flink@172.16.0.2:42888/user/taskmanager) as
>>>>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
>>>>> Current number of alive task slots is 12.
>>>>>
>>>>> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor -
>>>>> Association with remote system [akka.tcp://flink@172.16.0.2:42888]
>>>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>>>
>>>>> 2016-07-07 11:32:42,722 INFO
>>>>>  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
>>>>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
>>>>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
>>>>> Current number of alive task slots is 24.
>>>>>
>>>>> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
>>>>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888].
>>>>> Address is now gated for 5000 ms, all messages to this address will be
>>>>> delivered to dead letters. Reason: Connection refused: /
>>>>> 172.16.0.2:42888
>>>>>
>>>>> 2016-07-07 11:33:15,320 INFO
>>>>>  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
>>>>> flink@172.16.0.2:42888/user/taskmanager terminated.
>>>>> 2016-07-07 11:33:15,320 INFO
>>>>>  org.apache.flink.runtime.instance.InstanceManager - Unregistered task
>>>>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number of
>>>>> registered task managers 1. Number of available slots 12.
>>>>>
>>>>>
>>>>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>>>
>>>>>> No that should suffice. Can you check whether there are any task
>>>>>> manager logs for the second TM on that machine
>>>>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
>>>>>> manager process does start up and there is another problem. If not,
>>>>>> the task managers seems not to start even.
>>>>>>
>>>>>> – Ufuk
>>>>>>
>>>>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <es...@gmail.com>
>>>>>> wrote:
>>>>>> > I tried to run more than one task manager per node by duplicating
>>>>>> the slave
>>>>>> > IPs. At startup it says for example,
>>>>>> >
>>>>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>>>>>> > Starting taskmanager daemon on host j-011.
>>>>>> >
>>>>>> > but I only see 1 task manager process running.
>>>>>> >
>>>>>> > Is there anything else I need to do?
>>>>>> >
>>>>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Yes, exactly.
>>>>>> >>
>>>>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <
>>>>>> esaliya@gmail.com>
>>>>>> >> wrote:
>>>>>> >> > Thank you, yes, it can be done externally, if not supported
>>>>>> within
>>>>>> >> > Flink.
>>>>>> >> >
>>>>>> >> > So the way to spawn multiple task managers would be to list the
>>>>>> same
>>>>>> >> > slave
>>>>>> >> > machines N times as necessary in the slaves file?
>>>>>> >> >
>>>>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org>
>>>>>> wrote:
>>>>>> >> >>
>>>>>> >> >> No, not inside of Flink. That sounds like something like the OS
>>>>>> or
>>>>>> >> >> resource manager should handle.
>>>>>> >> >>
>>>>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <
>>>>>> esaliya@gmail.com>
>>>>>> >> >> wrote:
>>>>>> >> >> > That's great, so is there support to pin task managers to
>>>>>> sockets as
>>>>>> >> >> > well?
>>>>>> >> >> >
>>>>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org>
>>>>>> wrote:
>>>>>> >> >> >>
>>>>>> >> >> >> Regarding 2) if you don't manually configure something else,
>>>>>> that
>>>>>> >> >> >> should happen always.
>>>>>> >> >> >>
>>>>>> >> >> >> Yes, you can run more than one task manager per node
>>>>>> depending on
>>>>>> >> >> >> the
>>>>>> >> >> >> process isolation you want. Within a task manager, there are
>>>>>> >> >> >> multiple
>>>>>> >> >> >> threads for each slot. For example, if you have 2 task
>>>>>> managers with
>>>>>> >> >> >> 2
>>>>>> >> >> >> slots each and submit a job with parallelism 4, each task
>>>>>> manager
>>>>>> >> >> >> will
>>>>>> >> >> >> execute 2 sub tasks in separate Threads.
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
>>>>>> esaliya@gmail.com>
>>>>>> >> >> >> wrote:
>>>>>> >> >> >> > Hi Ufuk,
>>>>>> >> >> >> >
>>>>>> >> >> >> > Looking at the document you sent it seems only 1 task
>>>>>> manager per
>>>>>> >> >> >> > node
>>>>>> >> >> >> > exist
>>>>>> >> >> >> > and within that you have multiple slots. Is it possible to
>>>>>> run
>>>>>> >> >> >> > more
>>>>>> >> >> >> > than
>>>>>> >> >> >> > 1
>>>>>> >> >> >> > task manager per node? Also, within a task manager is the
>>>>>> >> >> >> > parallelism
>>>>>> >> >> >> > done
>>>>>> >> >> >> > through threads or processes?
>>>>>> >> >> >> >
>>>>>> >> >> >> > Thank you,
>>>>>> >> >> >> > Saliya
>>>>>> >> >> >> >
>>>>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>>>>>> >> >> >> > <es...@gmail.com>
>>>>>> >> >> >> > wrote:
>>>>>> >> >> >> >>
>>>>>> >> >> >> >> Thank you, I'll check these.
>>>>>> >> >> >> >>
>>>>>> >> >> >> >> In 2.) you said they are likely to exchange through
>>>>>> memory. Is
>>>>>> >> >> >> >> there
>>>>>> >> >> >> >> a
>>>>>> >> >> >> >> case why they wouldn't?
>>>>>> >> >> >> >>
>>>>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <
>>>>>> uce@apache.org>
>>>>>> >> >> >> >> wrote:
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>>>>>> >> >> >> >>> <es...@gmail.com>
>>>>>> >> >> >> >>> wrote:
>>>>>> >> >> >> >>> > 1. What parameters are available to control
>>>>>> parallelism within
>>>>>> >> >> >> >>> > a
>>>>>> >> >> >> >>> > node?
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>> Task Manager processing slots:
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>>
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging
>>>>>> within a
>>>>>> >> >> >> >>> > node
>>>>>> >> >> >> >>> > (without
>>>>>> >> >> >> >>> > doing TCP calls)?
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for
>>>>>> example
>>>>>> >> >> >> >>> if
>>>>>> >> >> >> >>> you
>>>>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1
>>>>>> are likely
>>>>>> >> >> >> >>> to
>>>>>> >> >> >> >>> exchange data locally.
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>> No, not that I'm aware of.
>>>>>> >> >> >> >>>
>>>>>> >> >> >> >>> – Ufuk
>>>>>> >> >> >> >>
>>>>>> >> >> >> >>
>>>>>> >> >> >> >>
>>>>>> >> >> >> >>
>>>>>> >> >> >> >> --
>>>>>> >> >> >> >> Saliya Ekanayake
>>>>>> >> >> >> >> Ph.D. Candidate | Research Assistant
>>>>>> >> >> >> >> School of Informatics and Computing | Digital Science
>>>>>> Center
>>>>>> >> >> >> >> Indiana University, Bloomington
>>>>>> >> >> >> >>
>>>>>> >> >> >> >
>>>>>> >> >> >> >
>>>>>> >> >> >> >
>>>>>> >> >> >> > --
>>>>>> >> >> >> > Saliya Ekanayake
>>>>>> >> >> >> > Ph.D. Candidate | Research Assistant
>>>>>> >> >> >> > School of Informatics and Computing | Digital Science
>>>>>> Center
>>>>>> >> >> >> > Indiana University, Bloomington
>>>>>> >> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > --
>>>>>> >> >> > Saliya Ekanayake
>>>>>> >> >> > Ph.D. Candidate | Research Assistant
>>>>>> >> >> > School of Informatics and Computing | Digital Science Center
>>>>>> >> >> > Indiana University, Bloomington
>>>>>> >> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > --
>>>>>> >> > Saliya Ekanayake
>>>>>> >> > Ph.D. Candidate | Research Assistant
>>>>>> >> > School of Informatics and Computing | Digital Science Center
>>>>>> >> > Indiana University, Bloomington
>>>>>> >> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Saliya Ekanayake
>>>>>> > Ph.D. Candidate | Research Assistant
>>>>>> > School of Informatics and Computing | Digital Science Center
>>>>>> > Indiana University, Bloomington
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Saliya Ekanayake
>>>>> Ph.D. Candidate | Research Assistant
>>>>> School of Informatics and Computing | Digital Science Center
>>>>> Indiana University, Bloomington
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>>
>>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
Greg,

where did you see the OOM log as shown in this mail thread? In my case none
of the TaskManagers nor JobManger reports an error like this.

On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <co...@greghogan.com> wrote:

> These symptoms sounds similar to what I was experiencing in the following
> thread. Flink can have some unexpected memory usage which can result in an
> OOM kill by the kernel, and this becomes more pronounced as the cluster
> size grows.
>   https://www.mail-archive.com/dev@flink.apache.org/msg06346.html
>
> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <es...@gmail.com>
> wrote:
>
>> I checked, but JVMs didn't crash. No puppet or other services like that.
>>
>> One thing I found is that things work OK when I have a smaller number of
>> slaves. For example, here I was trying to run on 16 nodes giving 2 TMs
>> each. Then I reduced it to 4 nodes each with 2 TMs, which worked.
>>
>>
>>
>> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Hi,
>>> from the TaskManager logs, I can not see anything suspicious.
>>> Its a bit weird that the TaskManager logs just end, without any shutdown
>>> messages. Usually the TMs log some shut down stuff when they are stopping.
>>> Also, if they would be still running, I would expect some error messages
>>> from akka about the connection status.
>>> So the only thing I conclude is that one of the TMs was killed by the OS
>>> or the JVM crashed. Did you check if that happened?
>>>
>>> Do you have any service like puppet that is controlling processes?
>>>
>>>
>>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <es...@gmail.com>
>>> wrote:
>>>
>>>> I see two logs (attached), but there's only 1 TaskManger process. Also,
>>>> the Web console says it can find only 1 TM.
>>>>
>>>> However, I see this part in JM log, which shows there was a second TM
>>>> at one point, but it was unregistered. Any thoughts?
>>>>
>>>> --------------------------
>>>>
>>>> - Registered TaskManager at j-002 (akka.tcp://
>>>> flink@172.16.0.2:42888/user/taskmanager) as
>>>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
>>>> Current number of alive task slots is 12.
>>>>
>>>> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor -
>>>> Association with remote system [akka.tcp://flink@172.16.0.2:42888] has
>>>> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>>
>>>> 2016-07-07 11:32:42,722 INFO
>>>>  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
>>>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
>>>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
>>>> Current number of alive task slots is 24.
>>>>
>>>> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
>>>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888].
>>>> Address is now gated for 5000 ms, all messages to this address will be
>>>> delivered to dead letters. Reason: Connection refused: /
>>>> 172.16.0.2:42888
>>>>
>>>> 2016-07-07 11:33:15,320 INFO
>>>>  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
>>>> flink@172.16.0.2:42888/user/taskmanager terminated.
>>>> 2016-07-07 11:33:15,320 INFO
>>>>  org.apache.flink.runtime.instance.InstanceManager - Unregistered task
>>>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number of
>>>> registered task managers 1. Number of available slots 12.
>>>>
>>>>
>>>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>>
>>>>> No that should suffice. Can you check whether there are any task
>>>>> manager logs for the second TM on that machine
>>>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
>>>>> manager process does start up and there is another problem. If not,
>>>>> the task managers seems not to start even.
>>>>>
>>>>> – Ufuk
>>>>>
>>>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <es...@gmail.com>
>>>>> wrote:
>>>>> > I tried to run more than one task manager per node by duplicating
>>>>> the slave
>>>>> > IPs. At startup it says for example,
>>>>> >
>>>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>>>>> > Starting taskmanager daemon on host j-011.
>>>>> >
>>>>> > but I only see 1 task manager process running.
>>>>> >
>>>>> > Is there anything else I need to do?
>>>>> >
>>>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>>> >>
>>>>> >> Yes, exactly.
>>>>> >>
>>>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <esaliya@gmail.com
>>>>> >
>>>>> >> wrote:
>>>>> >> > Thank you, yes, it can be done externally, if not supported within
>>>>> >> > Flink.
>>>>> >> >
>>>>> >> > So the way to spawn multiple task managers would be to list the
>>>>> same
>>>>> >> > slave
>>>>> >> > machines N times as necessary in the slaves file?
>>>>> >> >
>>>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org>
>>>>> wrote:
>>>>> >> >>
>>>>> >> >> No, not inside of Flink. That sounds like something like the OS
>>>>> or
>>>>> >> >> resource manager should handle.
>>>>> >> >>
>>>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <
>>>>> esaliya@gmail.com>
>>>>> >> >> wrote:
>>>>> >> >> > That's great, so is there support to pin task managers to
>>>>> sockets as
>>>>> >> >> > well?
>>>>> >> >> >
>>>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org>
>>>>> wrote:
>>>>> >> >> >>
>>>>> >> >> >> Regarding 2) if you don't manually configure something else,
>>>>> that
>>>>> >> >> >> should happen always.
>>>>> >> >> >>
>>>>> >> >> >> Yes, you can run more than one task manager per node
>>>>> depending on
>>>>> >> >> >> the
>>>>> >> >> >> process isolation you want. Within a task manager, there are
>>>>> >> >> >> multiple
>>>>> >> >> >> threads for each slot. For example, if you have 2 task
>>>>> managers with
>>>>> >> >> >> 2
>>>>> >> >> >> slots each and submit a job with parallelism 4, each task
>>>>> manager
>>>>> >> >> >> will
>>>>> >> >> >> execute 2 sub tasks in separate Threads.
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
>>>>> esaliya@gmail.com>
>>>>> >> >> >> wrote:
>>>>> >> >> >> > Hi Ufuk,
>>>>> >> >> >> >
>>>>> >> >> >> > Looking at the document you sent it seems only 1 task
>>>>> manager per
>>>>> >> >> >> > node
>>>>> >> >> >> > exist
>>>>> >> >> >> > and within that you have multiple slots. Is it possible to
>>>>> run
>>>>> >> >> >> > more
>>>>> >> >> >> > than
>>>>> >> >> >> > 1
>>>>> >> >> >> > task manager per node? Also, within a task manager is the
>>>>> >> >> >> > parallelism
>>>>> >> >> >> > done
>>>>> >> >> >> > through threads or processes?
>>>>> >> >> >> >
>>>>> >> >> >> > Thank you,
>>>>> >> >> >> > Saliya
>>>>> >> >> >> >
>>>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>>>>> >> >> >> > <es...@gmail.com>
>>>>> >> >> >> > wrote:
>>>>> >> >> >> >>
>>>>> >> >> >> >> Thank you, I'll check these.
>>>>> >> >> >> >>
>>>>> >> >> >> >> In 2.) you said they are likely to exchange through
>>>>> memory. Is
>>>>> >> >> >> >> there
>>>>> >> >> >> >> a
>>>>> >> >> >> >> case why they wouldn't?
>>>>> >> >> >> >>
>>>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <
>>>>> uce@apache.org>
>>>>> >> >> >> >> wrote:
>>>>> >> >> >> >>>
>>>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>>>>> >> >> >> >>> <es...@gmail.com>
>>>>> >> >> >> >>> wrote:
>>>>> >> >> >> >>> > 1. What parameters are available to control parallelism
>>>>> within
>>>>> >> >> >> >>> > a
>>>>> >> >> >> >>> > node?
>>>>> >> >> >> >>>
>>>>> >> >> >> >>> Task Manager processing slots:
>>>>> >> >> >> >>>
>>>>> >> >> >> >>>
>>>>> >> >> >> >>>
>>>>> >> >> >> >>>
>>>>> >> >> >> >>>
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>>>> >> >> >> >>>
>>>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging
>>>>> within a
>>>>> >> >> >> >>> > node
>>>>> >> >> >> >>> > (without
>>>>> >> >> >> >>> > doing TCP calls)?
>>>>> >> >> >> >>>
>>>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for
>>>>> example
>>>>> >> >> >> >>> if
>>>>> >> >> >> >>> you
>>>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are
>>>>> likely
>>>>> >> >> >> >>> to
>>>>> >> >> >> >>> exchange data locally.
>>>>> >> >> >> >>>
>>>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>>>>> >> >> >> >>>
>>>>> >> >> >> >>> No, not that I'm aware of.
>>>>> >> >> >> >>>
>>>>> >> >> >> >>> – Ufuk
>>>>> >> >> >> >>
>>>>> >> >> >> >>
>>>>> >> >> >> >>
>>>>> >> >> >> >>
>>>>> >> >> >> >> --
>>>>> >> >> >> >> Saliya Ekanayake
>>>>> >> >> >> >> Ph.D. Candidate | Research Assistant
>>>>> >> >> >> >> School of Informatics and Computing | Digital Science
>>>>> Center
>>>>> >> >> >> >> Indiana University, Bloomington
>>>>> >> >> >> >>
>>>>> >> >> >> >
>>>>> >> >> >> >
>>>>> >> >> >> >
>>>>> >> >> >> > --
>>>>> >> >> >> > Saliya Ekanayake
>>>>> >> >> >> > Ph.D. Candidate | Research Assistant
>>>>> >> >> >> > School of Informatics and Computing | Digital Science Center
>>>>> >> >> >> > Indiana University, Bloomington
>>>>> >> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > --
>>>>> >> >> > Saliya Ekanayake
>>>>> >> >> > Ph.D. Candidate | Research Assistant
>>>>> >> >> > School of Informatics and Computing | Digital Science Center
>>>>> >> >> > Indiana University, Bloomington
>>>>> >> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > --
>>>>> >> > Saliya Ekanayake
>>>>> >> > Ph.D. Candidate | Research Assistant
>>>>> >> > School of Informatics and Computing | Digital Science Center
>>>>> >> > Indiana University, Bloomington
>>>>> >> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Saliya Ekanayake
>>>>> > Ph.D. Candidate | Research Assistant
>>>>> > School of Informatics and Computing | Digital Science Center
>>>>> > Indiana University, Bloomington
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>>
>>>>
>>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Greg Hogan <co...@greghogan.com>.
These symptoms sounds similar to what I was experiencing in the following
thread. Flink can have some unexpected memory usage which can result in an
OOM kill by the kernel, and this becomes more pronounced as the cluster
size grows.
  https://www.mail-archive.com/dev@flink.apache.org/msg06346.html

On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <es...@gmail.com> wrote:

> I checked, but JVMs didn't crash. No puppet or other services like that.
>
> One thing I found is that things work OK when I have a smaller number of
> slaves. For example, here I was trying to run on 16 nodes giving 2 TMs
> each. Then I reduced it to 4 nodes each with 2 TMs, which worked.
>
>
>
> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rm...@apache.org>
> wrote:
>
>> Hi,
>> from the TaskManager logs, I can not see anything suspicious.
>> Its a bit weird that the TaskManager logs just end, without any shutdown
>> messages. Usually the TMs log some shut down stuff when they are stopping.
>> Also, if they would be still running, I would expect some error messages
>> from akka about the connection status.
>> So the only thing I conclude is that one of the TMs was killed by the OS
>> or the JVM crashed. Did you check if that happened?
>>
>> Do you have any service like puppet that is controlling processes?
>>
>>
>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <es...@gmail.com>
>> wrote:
>>
>>> I see two logs (attached), but there's only 1 TaskManger process. Also,
>>> the Web console says it can find only 1 TM.
>>>
>>> However, I see this part in JM log, which shows there was a second TM at
>>> one point, but it was unregistered. Any thoughts?
>>>
>>> --------------------------
>>>
>>> - Registered TaskManager at j-002 (akka.tcp://
>>> flink@172.16.0.2:42888/user/taskmanager) as
>>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
>>> Current number of alive task slots is 12.
>>>
>>> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor -
>>> Association with remote system [akka.tcp://flink@172.16.0.2:42888] has
>>> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>
>>> 2016-07-07 11:32:42,722 INFO
>>>  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
>>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
>>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
>>> Current number of alive task slots is 24.
>>>
>>> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
>>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888]. Address
>>> is now gated for 5000 ms, all messages to this address will be delivered to
>>> dead letters. Reason: Connection refused: /172.16.0.2:42888
>>>
>>> 2016-07-07 11:33:15,320 INFO
>>>  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
>>> flink@172.16.0.2:42888/user/taskmanager terminated.
>>> 2016-07-07 11:33:15,320 INFO
>>>  org.apache.flink.runtime.instance.InstanceManager - Unregistered task
>>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number of
>>> registered task managers 1. Number of available slots 12.
>>>
>>>
>>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>
>>>> No that should suffice. Can you check whether there are any task
>>>> manager logs for the second TM on that machine
>>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
>>>> manager process does start up and there is another problem. If not,
>>>> the task managers seems not to start even.
>>>>
>>>> – Ufuk
>>>>
>>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <es...@gmail.com>
>>>> wrote:
>>>> > I tried to run more than one task manager per node by duplicating the
>>>> slave
>>>> > IPs. At startup it says for example,
>>>> >
>>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>>>> > Starting taskmanager daemon on host j-011.
>>>> >
>>>> > but I only see 1 task manager process running.
>>>> >
>>>> > Is there anything else I need to do?
>>>> >
>>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>> >>
>>>> >> Yes, exactly.
>>>> >>
>>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <es...@gmail.com>
>>>> >> wrote:
>>>> >> > Thank you, yes, it can be done externally, if not supported within
>>>> >> > Flink.
>>>> >> >
>>>> >> > So the way to spawn multiple task managers would be to list the
>>>> same
>>>> >> > slave
>>>> >> > machines N times as necessary in the slaves file?
>>>> >> >
>>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org>
>>>> wrote:
>>>> >> >>
>>>> >> >> No, not inside of Flink. That sounds like something like the OS or
>>>> >> >> resource manager should handle.
>>>> >> >>
>>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <
>>>> esaliya@gmail.com>
>>>> >> >> wrote:
>>>> >> >> > That's great, so is there support to pin task managers to
>>>> sockets as
>>>> >> >> > well?
>>>> >> >> >
>>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org>
>>>> wrote:
>>>> >> >> >>
>>>> >> >> >> Regarding 2) if you don't manually configure something else,
>>>> that
>>>> >> >> >> should happen always.
>>>> >> >> >>
>>>> >> >> >> Yes, you can run more than one task manager per node depending
>>>> on
>>>> >> >> >> the
>>>> >> >> >> process isolation you want. Within a task manager, there are
>>>> >> >> >> multiple
>>>> >> >> >> threads for each slot. For example, if you have 2 task
>>>> managers with
>>>> >> >> >> 2
>>>> >> >> >> slots each and submit a job with parallelism 4, each task
>>>> manager
>>>> >> >> >> will
>>>> >> >> >> execute 2 sub tasks in separate Threads.
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
>>>> esaliya@gmail.com>
>>>> >> >> >> wrote:
>>>> >> >> >> > Hi Ufuk,
>>>> >> >> >> >
>>>> >> >> >> > Looking at the document you sent it seems only 1 task
>>>> manager per
>>>> >> >> >> > node
>>>> >> >> >> > exist
>>>> >> >> >> > and within that you have multiple slots. Is it possible to
>>>> run
>>>> >> >> >> > more
>>>> >> >> >> > than
>>>> >> >> >> > 1
>>>> >> >> >> > task manager per node? Also, within a task manager is the
>>>> >> >> >> > parallelism
>>>> >> >> >> > done
>>>> >> >> >> > through threads or processes?
>>>> >> >> >> >
>>>> >> >> >> > Thank you,
>>>> >> >> >> > Saliya
>>>> >> >> >> >
>>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>>>> >> >> >> > <es...@gmail.com>
>>>> >> >> >> > wrote:
>>>> >> >> >> >>
>>>> >> >> >> >> Thank you, I'll check these.
>>>> >> >> >> >>
>>>> >> >> >> >> In 2.) you said they are likely to exchange through memory.
>>>> Is
>>>> >> >> >> >> there
>>>> >> >> >> >> a
>>>> >> >> >> >> case why they wouldn't?
>>>> >> >> >> >>
>>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <
>>>> uce@apache.org>
>>>> >> >> >> >> wrote:
>>>> >> >> >> >>>
>>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>>>> >> >> >> >>> <es...@gmail.com>
>>>> >> >> >> >>> wrote:
>>>> >> >> >> >>> > 1. What parameters are available to control parallelism
>>>> within
>>>> >> >> >> >>> > a
>>>> >> >> >> >>> > node?
>>>> >> >> >> >>>
>>>> >> >> >> >>> Task Manager processing slots:
>>>> >> >> >> >>>
>>>> >> >> >> >>>
>>>> >> >> >> >>>
>>>> >> >> >> >>>
>>>> >> >> >> >>>
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>>> >> >> >> >>>
>>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging
>>>> within a
>>>> >> >> >> >>> > node
>>>> >> >> >> >>> > (without
>>>> >> >> >> >>> > doing TCP calls)?
>>>> >> >> >> >>>
>>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for
>>>> example
>>>> >> >> >> >>> if
>>>> >> >> >> >>> you
>>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are
>>>> likely
>>>> >> >> >> >>> to
>>>> >> >> >> >>> exchange data locally.
>>>> >> >> >> >>>
>>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>>>> >> >> >> >>>
>>>> >> >> >> >>> No, not that I'm aware of.
>>>> >> >> >> >>>
>>>> >> >> >> >>> – Ufuk
>>>> >> >> >> >>
>>>> >> >> >> >>
>>>> >> >> >> >>
>>>> >> >> >> >>
>>>> >> >> >> >> --
>>>> >> >> >> >> Saliya Ekanayake
>>>> >> >> >> >> Ph.D. Candidate | Research Assistant
>>>> >> >> >> >> School of Informatics and Computing | Digital Science Center
>>>> >> >> >> >> Indiana University, Bloomington
>>>> >> >> >> >>
>>>> >> >> >> >
>>>> >> >> >> >
>>>> >> >> >> >
>>>> >> >> >> > --
>>>> >> >> >> > Saliya Ekanayake
>>>> >> >> >> > Ph.D. Candidate | Research Assistant
>>>> >> >> >> > School of Informatics and Computing | Digital Science Center
>>>> >> >> >> > Indiana University, Bloomington
>>>> >> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Saliya Ekanayake
>>>> >> >> > Ph.D. Candidate | Research Assistant
>>>> >> >> > School of Informatics and Computing | Digital Science Center
>>>> >> >> > Indiana University, Bloomington
>>>> >> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Saliya Ekanayake
>>>> >> > Ph.D. Candidate | Research Assistant
>>>> >> > School of Informatics and Computing | Digital Science Center
>>>> >> > Indiana University, Bloomington
>>>> >> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Saliya Ekanayake
>>>> > Ph.D. Candidate | Research Assistant
>>>> > School of Informatics and Computing | Digital Science Center
>>>> > Indiana University, Bloomington
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>>
>>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
I checked, but JVMs didn't crash. No puppet or other services like that.

One thing I found is that things work OK when I have a smaller number of
slaves. For example, here I was trying to run on 16 nodes giving 2 TMs
each. Then I reduced it to 4 nodes each with 2 TMs, which worked.



On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rm...@apache.org> wrote:

> Hi,
> from the TaskManager logs, I can not see anything suspicious.
> Its a bit weird that the TaskManager logs just end, without any shutdown
> messages. Usually the TMs log some shut down stuff when they are stopping.
> Also, if they would be still running, I would expect some error messages
> from akka about the connection status.
> So the only thing I conclude is that one of the TMs was killed by the OS
> or the JVM crashed. Did you check if that happened?
>
> Do you have any service like puppet that is controlling processes?
>
>
> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <es...@gmail.com>
> wrote:
>
>> I see two logs (attached), but there's only 1 TaskManger process. Also,
>> the Web console says it can find only 1 TM.
>>
>> However, I see this part in JM log, which shows there was a second TM at
>> one point, but it was unregistered. Any thoughts?
>>
>> --------------------------
>>
>> - Registered TaskManager at j-002 (akka.tcp://
>> flink@172.16.0.2:42888/user/taskmanager) as
>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
>> Current number of alive task slots is 12.
>>
>> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor -
>> Association with remote system [akka.tcp://flink@172.16.0.2:42888] has
>> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>
>> 2016-07-07 11:32:42,722 INFO
>>  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
>> Current number of alive task slots is 24.
>>
>> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888]. Address
>> is now gated for 5000 ms, all messages to this address will be delivered to
>> dead letters. Reason: Connection refused: /172.16.0.2:42888
>>
>> 2016-07-07 11:33:15,320 INFO
>>  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
>> flink@172.16.0.2:42888/user/taskmanager terminated.
>> 2016-07-07 11:33:15,320 INFO
>>  org.apache.flink.runtime.instance.InstanceManager - Unregistered task
>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number of
>> registered task managers 1. Number of available slots 12.
>>
>>
>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>
>>> No that should suffice. Can you check whether there are any task
>>> manager logs for the second TM on that machine
>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
>>> manager process does start up and there is another problem. If not,
>>> the task managers seems not to start even.
>>>
>>> – Ufuk
>>>
>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <es...@gmail.com>
>>> wrote:
>>> > I tried to run more than one task manager per node by duplicating the
>>> slave
>>> > IPs. At startup it says for example,
>>> >
>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>>> > Starting taskmanager daemon on host j-011.
>>> >
>>> > but I only see 1 task manager process running.
>>> >
>>> > Is there anything else I need to do?
>>> >
>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>> >>
>>> >> Yes, exactly.
>>> >>
>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <es...@gmail.com>
>>> >> wrote:
>>> >> > Thank you, yes, it can be done externally, if not supported within
>>> >> > Flink.
>>> >> >
>>> >> > So the way to spawn multiple task managers would be to list the same
>>> >> > slave
>>> >> > machines N times as necessary in the slaves file?
>>> >> >
>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org>
>>> wrote:
>>> >> >>
>>> >> >> No, not inside of Flink. That sounds like something like the OS or
>>> >> >> resource manager should handle.
>>> >> >>
>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <
>>> esaliya@gmail.com>
>>> >> >> wrote:
>>> >> >> > That's great, so is there support to pin task managers to
>>> sockets as
>>> >> >> > well?
>>> >> >> >
>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org>
>>> wrote:
>>> >> >> >>
>>> >> >> >> Regarding 2) if you don't manually configure something else,
>>> that
>>> >> >> >> should happen always.
>>> >> >> >>
>>> >> >> >> Yes, you can run more than one task manager per node depending
>>> on
>>> >> >> >> the
>>> >> >> >> process isolation you want. Within a task manager, there are
>>> >> >> >> multiple
>>> >> >> >> threads for each slot. For example, if you have 2 task managers
>>> with
>>> >> >> >> 2
>>> >> >> >> slots each and submit a job with parallelism 4, each task
>>> manager
>>> >> >> >> will
>>> >> >> >> execute 2 sub tasks in separate Threads.
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
>>> esaliya@gmail.com>
>>> >> >> >> wrote:
>>> >> >> >> > Hi Ufuk,
>>> >> >> >> >
>>> >> >> >> > Looking at the document you sent it seems only 1 task manager
>>> per
>>> >> >> >> > node
>>> >> >> >> > exist
>>> >> >> >> > and within that you have multiple slots. Is it possible to run
>>> >> >> >> > more
>>> >> >> >> > than
>>> >> >> >> > 1
>>> >> >> >> > task manager per node? Also, within a task manager is the
>>> >> >> >> > parallelism
>>> >> >> >> > done
>>> >> >> >> > through threads or processes?
>>> >> >> >> >
>>> >> >> >> > Thank you,
>>> >> >> >> > Saliya
>>> >> >> >> >
>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>>> >> >> >> > <es...@gmail.com>
>>> >> >> >> > wrote:
>>> >> >> >> >>
>>> >> >> >> >> Thank you, I'll check these.
>>> >> >> >> >>
>>> >> >> >> >> In 2.) you said they are likely to exchange through memory.
>>> Is
>>> >> >> >> >> there
>>> >> >> >> >> a
>>> >> >> >> >> case why they wouldn't?
>>> >> >> >> >>
>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uce@apache.org
>>> >
>>> >> >> >> >> wrote:
>>> >> >> >> >>>
>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>>> >> >> >> >>> <es...@gmail.com>
>>> >> >> >> >>> wrote:
>>> >> >> >> >>> > 1. What parameters are available to control parallelism
>>> within
>>> >> >> >> >>> > a
>>> >> >> >> >>> > node?
>>> >> >> >> >>>
>>> >> >> >> >>> Task Manager processing slots:
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>> >> >> >> >>>
>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging
>>> within a
>>> >> >> >> >>> > node
>>> >> >> >> >>> > (without
>>> >> >> >> >>> > doing TCP calls)?
>>> >> >> >> >>>
>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for
>>> example
>>> >> >> >> >>> if
>>> >> >> >> >>> you
>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are
>>> likely
>>> >> >> >> >>> to
>>> >> >> >> >>> exchange data locally.
>>> >> >> >> >>>
>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>>> >> >> >> >>>
>>> >> >> >> >>> No, not that I'm aware of.
>>> >> >> >> >>>
>>> >> >> >> >>> – Ufuk
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> --
>>> >> >> >> >> Saliya Ekanayake
>>> >> >> >> >> Ph.D. Candidate | Research Assistant
>>> >> >> >> >> School of Informatics and Computing | Digital Science Center
>>> >> >> >> >> Indiana University, Bloomington
>>> >> >> >> >>
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > --
>>> >> >> >> > Saliya Ekanayake
>>> >> >> >> > Ph.D. Candidate | Research Assistant
>>> >> >> >> > School of Informatics and Computing | Digital Science Center
>>> >> >> >> > Indiana University, Bloomington
>>> >> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Saliya Ekanayake
>>> >> >> > Ph.D. Candidate | Research Assistant
>>> >> >> > School of Informatics and Computing | Digital Science Center
>>> >> >> > Indiana University, Bloomington
>>> >> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Saliya Ekanayake
>>> >> > Ph.D. Candidate | Research Assistant
>>> >> > School of Informatics and Computing | Digital Science Center
>>> >> > Indiana University, Bloomington
>>> >> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Saliya Ekanayake
>>> > Ph.D. Candidate | Research Assistant
>>> > School of Informatics and Computing | Digital Science Center
>>> > Indiana University, Bloomington
>>> >
>>>
>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Robert Metzger <rm...@apache.org>.
Hi,
from the TaskManager logs, I can not see anything suspicious.
Its a bit weird that the TaskManager logs just end, without any shutdown
messages. Usually the TMs log some shut down stuff when they are stopping.
Also, if they would be still running, I would expect some error messages
from akka about the connection status.
So the only thing I conclude is that one of the TMs was killed by the OS or
the JVM crashed. Did you check if that happened?

Do you have any service like puppet that is controlling processes?


On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <es...@gmail.com> wrote:

> I see two logs (attached), but there's only 1 TaskManger process. Also,
> the Web console says it can find only 1 TM.
>
> However, I see this part in JM log, which shows there was a second TM at
> one point, but it was unregistered. Any thoughts?
>
> --------------------------
>
> - Registered TaskManager at j-002 (akka.tcp://
> flink@172.16.0.2:42888/user/taskmanager) as
> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
> Current number of alive task slots is 12.
>
> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor -
> Association with remote system [akka.tcp://flink@172.16.0.2:42888] has
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>
> 2016-07-07 11:32:42,722 INFO
>  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
> Current number of alive task slots is 24.
>
> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
> unreachable remote address [akka.tcp://flink@172.16.0.2:42888]. Address
> is now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: Connection refused: /172.16.0.2:42888
>
> 2016-07-07 11:33:15,320 INFO
>  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
> flink@172.16.0.2:42888/user/taskmanager terminated.
> 2016-07-07 11:33:15,320 INFO
>  org.apache.flink.runtime.instance.InstanceManager - Unregistered task
> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number of
> registered task managers 1. Number of available slots 12.
>
>
> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uc...@apache.org> wrote:
>
>> No that should suffice. Can you check whether there are any task
>> manager logs for the second TM on that machine
>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
>> manager process does start up and there is another problem. If not,
>> the task managers seems not to start even.
>>
>> – Ufuk
>>
>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <es...@gmail.com>
>> wrote:
>> > I tried to run more than one task manager per node by duplicating the
>> slave
>> > IPs. At startup it says for example,
>> >
>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>> > Starting taskmanager daemon on host j-011.
>> >
>> > but I only see 1 task manager process running.
>> >
>> > Is there anything else I need to do?
>> >
>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org> wrote:
>> >>
>> >> Yes, exactly.
>> >>
>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <es...@gmail.com>
>> >> wrote:
>> >> > Thank you, yes, it can be done externally, if not supported within
>> >> > Flink.
>> >> >
>> >> > So the way to spawn multiple task managers would be to list the same
>> >> > slave
>> >> > machines N times as necessary in the slaves file?
>> >> >
>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org> wrote:
>> >> >>
>> >> >> No, not inside of Flink. That sounds like something like the OS or
>> >> >> resource manager should handle.
>> >> >>
>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <esaliya@gmail.com
>> >
>> >> >> wrote:
>> >> >> > That's great, so is there support to pin task managers to sockets
>> as
>> >> >> > well?
>> >> >> >
>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org>
>> wrote:
>> >> >> >>
>> >> >> >> Regarding 2) if you don't manually configure something else, that
>> >> >> >> should happen always.
>> >> >> >>
>> >> >> >> Yes, you can run more than one task manager per node depending on
>> >> >> >> the
>> >> >> >> process isolation you want. Within a task manager, there are
>> >> >> >> multiple
>> >> >> >> threads for each slot. For example, if you have 2 task managers
>> with
>> >> >> >> 2
>> >> >> >> slots each and submit a job with parallelism 4, each task manager
>> >> >> >> will
>> >> >> >> execute 2 sub tasks in separate Threads.
>> >> >> >>
>> >> >> >>
>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
>> esaliya@gmail.com>
>> >> >> >> wrote:
>> >> >> >> > Hi Ufuk,
>> >> >> >> >
>> >> >> >> > Looking at the document you sent it seems only 1 task manager
>> per
>> >> >> >> > node
>> >> >> >> > exist
>> >> >> >> > and within that you have multiple slots. Is it possible to run
>> >> >> >> > more
>> >> >> >> > than
>> >> >> >> > 1
>> >> >> >> > task manager per node? Also, within a task manager is the
>> >> >> >> > parallelism
>> >> >> >> > done
>> >> >> >> > through threads or processes?
>> >> >> >> >
>> >> >> >> > Thank you,
>> >> >> >> > Saliya
>> >> >> >> >
>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>> >> >> >> > <es...@gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> Thank you, I'll check these.
>> >> >> >> >>
>> >> >> >> >> In 2.) you said they are likely to exchange through memory. Is
>> >> >> >> >> there
>> >> >> >> >> a
>> >> >> >> >> case why they wouldn't?
>> >> >> >> >>
>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org>
>> >> >> >> >> wrote:
>> >> >> >> >>>
>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>> >> >> >> >>> <es...@gmail.com>
>> >> >> >> >>> wrote:
>> >> >> >> >>> > 1. What parameters are available to control parallelism
>> within
>> >> >> >> >>> > a
>> >> >> >> >>> > node?
>> >> >> >> >>>
>> >> >> >> >>> Task Manager processing slots:
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>> >> >> >> >>>
>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging within
>> a
>> >> >> >> >>> > node
>> >> >> >> >>> > (without
>> >> >> >> >>> > doing TCP calls)?
>> >> >> >> >>>
>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for
>> example
>> >> >> >> >>> if
>> >> >> >> >>> you
>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are
>> likely
>> >> >> >> >>> to
>> >> >> >> >>> exchange data locally.
>> >> >> >> >>>
>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>> >> >> >> >>>
>> >> >> >> >>> No, not that I'm aware of.
>> >> >> >> >>>
>> >> >> >> >>> – Ufuk
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >> Saliya Ekanayake
>> >> >> >> >> Ph.D. Candidate | Research Assistant
>> >> >> >> >> School of Informatics and Computing | Digital Science Center
>> >> >> >> >> Indiana University, Bloomington
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Saliya Ekanayake
>> >> >> >> > Ph.D. Candidate | Research Assistant
>> >> >> >> > School of Informatics and Computing | Digital Science Center
>> >> >> >> > Indiana University, Bloomington
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Saliya Ekanayake
>> >> >> > Ph.D. Candidate | Research Assistant
>> >> >> > School of Informatics and Computing | Digital Science Center
>> >> >> > Indiana University, Bloomington
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Saliya Ekanayake
>> >> > Ph.D. Candidate | Research Assistant
>> >> > School of Informatics and Computing | Digital Science Center
>> >> > Indiana University, Bloomington
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Saliya Ekanayake
>> > Ph.D. Candidate | Research Assistant
>> > School of Informatics and Computing | Digital Science Center
>> > Indiana University, Bloomington
>> >
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
I see two logs (attached), but there's only 1 TaskManger process. Also, the
Web console says it can find only 1 TM.

However, I see this part in JM log, which shows there was a second TM at
one point, but it was unregistered. Any thoughts?

--------------------------

- Registered TaskManager at j-002 (akka.tcp://
flink@172.16.0.2:42888/user/taskmanager) as
1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
Current number of alive task slots is 12.

2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor -
Association with remote system [akka.tcp://flink@172.16.0.2:42888] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].

2016-07-07 11:32:42,722 INFO
 org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
Current number of alive task slots is 24.

2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
unreachable remote address [akka.tcp://flink@172.16.0.2:42888]. Address is
now gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: Connection refused: /172.16.0.2:42888

2016-07-07 11:33:15,320 INFO
 org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
flink@172.16.0.2:42888/user/taskmanager terminated.
2016-07-07 11:33:15,320 INFO
 org.apache.flink.runtime.instance.InstanceManager - Unregistered task
manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number of
registered task managers 1. Number of available slots 12.


On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <uc...@apache.org> wrote:

> No that should suffice. Can you check whether there are any task
> manager logs for the second TM on that machine
> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
> manager process does start up and there is another problem. If not,
> the task managers seems not to start even.
>
> – Ufuk
>
> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <es...@gmail.com>
> wrote:
> > I tried to run more than one task manager per node by duplicating the
> slave
> > IPs. At startup it says for example,
> >
> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
> > Starting taskmanager daemon on host j-011.
> >
> > but I only see 1 task manager process running.
> >
> > Is there anything else I need to do?
> >
> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >>
> >> Yes, exactly.
> >>
> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <es...@gmail.com>
> >> wrote:
> >> > Thank you, yes, it can be done externally, if not supported within
> >> > Flink.
> >> >
> >> > So the way to spawn multiple task managers would be to list the same
> >> > slave
> >> > machines N times as necessary in the slaves file?
> >> >
> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >> >>
> >> >> No, not inside of Flink. That sounds like something like the OS or
> >> >> resource manager should handle.
> >> >>
> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <es...@gmail.com>
> >> >> wrote:
> >> >> > That's great, so is there support to pin task managers to sockets
> as
> >> >> > well?
> >> >> >
> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org>
> wrote:
> >> >> >>
> >> >> >> Regarding 2) if you don't manually configure something else, that
> >> >> >> should happen always.
> >> >> >>
> >> >> >> Yes, you can run more than one task manager per node depending on
> >> >> >> the
> >> >> >> process isolation you want. Within a task manager, there are
> >> >> >> multiple
> >> >> >> threads for each slot. For example, if you have 2 task managers
> with
> >> >> >> 2
> >> >> >> slots each and submit a job with parallelism 4, each task manager
> >> >> >> will
> >> >> >> execute 2 sub tasks in separate Threads.
> >> >> >>
> >> >> >>
> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
> esaliya@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi Ufuk,
> >> >> >> >
> >> >> >> > Looking at the document you sent it seems only 1 task manager
> per
> >> >> >> > node
> >> >> >> > exist
> >> >> >> > and within that you have multiple slots. Is it possible to run
> >> >> >> > more
> >> >> >> > than
> >> >> >> > 1
> >> >> >> > task manager per node? Also, within a task manager is the
> >> >> >> > parallelism
> >> >> >> > done
> >> >> >> > through threads or processes?
> >> >> >> >
> >> >> >> > Thank you,
> >> >> >> > Saliya
> >> >> >> >
> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
> >> >> >> > <es...@gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Thank you, I'll check these.
> >> >> >> >>
> >> >> >> >> In 2.) you said they are likely to exchange through memory. Is
> >> >> >> >> there
> >> >> >> >> a
> >> >> >> >> case why they wouldn't?
> >> >> >> >>
> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org>
> >> >> >> >> wrote:
> >> >> >> >>>
> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
> >> >> >> >>> <es...@gmail.com>
> >> >> >> >>> wrote:
> >> >> >> >>> > 1. What parameters are available to control parallelism
> within
> >> >> >> >>> > a
> >> >> >> >>> > node?
> >> >> >> >>>
> >> >> >> >>> Task Manager processing slots:
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
> >> >> >> >>>
> >> >> >> >>> > 2. Does Flink support shared memory-based messaging within a
> >> >> >> >>> > node
> >> >> >> >>> > (without
> >> >> >> >>> > doing TCP calls)?
> >> >> >> >>>
> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for
> example
> >> >> >> >>> if
> >> >> >> >>> you
> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are
> likely
> >> >> >> >>> to
> >> >> >> >>> exchange data locally.
> >> >> >> >>>
> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
> >> >> >> >>>
> >> >> >> >>> No, not that I'm aware of.
> >> >> >> >>>
> >> >> >> >>> – Ufuk
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Saliya Ekanayake
> >> >> >> >> Ph.D. Candidate | Research Assistant
> >> >> >> >> School of Informatics and Computing | Digital Science Center
> >> >> >> >> Indiana University, Bloomington
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Saliya Ekanayake
> >> >> >> > Ph.D. Candidate | Research Assistant
> >> >> >> > School of Informatics and Computing | Digital Science Center
> >> >> >> > Indiana University, Bloomington
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Saliya Ekanayake
> >> >> > Ph.D. Candidate | Research Assistant
> >> >> > School of Informatics and Computing | Digital Science Center
> >> >> > Indiana University, Bloomington
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Saliya Ekanayake
> >> > Ph.D. Candidate | Research Assistant
> >> > School of Informatics and Computing | Digital Science Center
> >> > Indiana University, Bloomington
> >> >
> >
> >
> >
> >
> > --
> > Saliya Ekanayake
> > Ph.D. Candidate | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> >
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Ufuk Celebi <uc...@apache.org>.
No that should suffice. Can you check whether there are any task
manager logs for the second TM on that machine
(taskmanager-X-j-011.log where X is the TM number)? If yes, the task
manager process does start up and there is another problem. If not,
the task managers seems not to start even.

– Ufuk

On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <es...@gmail.com> wrote:
> I tried to run more than one task manager per node by duplicating the slave
> IPs. At startup it says for example,
>
> [INFO] 1 instance(s) of taskmanager are already running on j-011.
> Starting taskmanager daemon on host j-011.
>
> but I only see 1 task manager process running.
>
> Is there anything else I need to do?
>
> On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>
>> Yes, exactly.
>>
>> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <es...@gmail.com>
>> wrote:
>> > Thank you, yes, it can be done externally, if not supported within
>> > Flink.
>> >
>> > So the way to spawn multiple task managers would be to list the same
>> > slave
>> > machines N times as necessary in the slaves file?
>> >
>> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org> wrote:
>> >>
>> >> No, not inside of Flink. That sounds like something like the OS or
>> >> resource manager should handle.
>> >>
>> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <es...@gmail.com>
>> >> wrote:
>> >> > That's great, so is there support to pin task managers to sockets as
>> >> > well?
>> >> >
>> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org> wrote:
>> >> >>
>> >> >> Regarding 2) if you don't manually configure something else, that
>> >> >> should happen always.
>> >> >>
>> >> >> Yes, you can run more than one task manager per node depending on
>> >> >> the
>> >> >> process isolation you want. Within a task manager, there are
>> >> >> multiple
>> >> >> threads for each slot. For example, if you have 2 task managers with
>> >> >> 2
>> >> >> slots each and submit a job with parallelism 4, each task manager
>> >> >> will
>> >> >> execute 2 sub tasks in separate Threads.
>> >> >>
>> >> >>
>> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <es...@gmail.com>
>> >> >> wrote:
>> >> >> > Hi Ufuk,
>> >> >> >
>> >> >> > Looking at the document you sent it seems only 1 task manager per
>> >> >> > node
>> >> >> > exist
>> >> >> > and within that you have multiple slots. Is it possible to run
>> >> >> > more
>> >> >> > than
>> >> >> > 1
>> >> >> > task manager per node? Also, within a task manager is the
>> >> >> > parallelism
>> >> >> > done
>> >> >> > through threads or processes?
>> >> >> >
>> >> >> > Thank you,
>> >> >> > Saliya
>> >> >> >
>> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>> >> >> > <es...@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Thank you, I'll check these.
>> >> >> >>
>> >> >> >> In 2.) you said they are likely to exchange through memory. Is
>> >> >> >> there
>> >> >> >> a
>> >> >> >> case why they wouldn't?
>> >> >> >>
>> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org>
>> >> >> >> wrote:
>> >> >> >>>
>> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>> >> >> >>> <es...@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>> > 1. What parameters are available to control parallelism within
>> >> >> >>> > a
>> >> >> >>> > node?
>> >> >> >>>
>> >> >> >>> Task Manager processing slots:
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>> >> >> >>>
>> >> >> >>> > 2. Does Flink support shared memory-based messaging within a
>> >> >> >>> > node
>> >> >> >>> > (without
>> >> >> >>> > doing TCP calls)?
>> >> >> >>>
>> >> >> >>> Yes, local exchanges happen via memory and not TCP, for example
>> >> >> >>> if
>> >> >> >>> you
>> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely
>> >> >> >>> to
>> >> >> >>> exchange data locally.
>> >> >> >>>
>> >> >> >>> > 3. Is there support for Infiniband interconnect?
>> >> >> >>>
>> >> >> >>> No, not that I'm aware of.
>> >> >> >>>
>> >> >> >>> – Ufuk
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Saliya Ekanayake
>> >> >> >> Ph.D. Candidate | Research Assistant
>> >> >> >> School of Informatics and Computing | Digital Science Center
>> >> >> >> Indiana University, Bloomington
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Saliya Ekanayake
>> >> >> > Ph.D. Candidate | Research Assistant
>> >> >> > School of Informatics and Computing | Digital Science Center
>> >> >> > Indiana University, Bloomington
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Saliya Ekanayake
>> >> > Ph.D. Candidate | Research Assistant
>> >> > School of Informatics and Computing | Digital Science Center
>> >> > Indiana University, Bloomington
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Saliya Ekanayake
>> > Ph.D. Candidate | Research Assistant
>> > School of Informatics and Computing | Digital Science Center
>> > Indiana University, Bloomington
>> >
>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
I tried to run more than one task manager per node by duplicating the slave
IPs. At startup it says for example,

[INFO] 1 instance(s) of taskmanager are already running on j-011.
Starting taskmanager daemon on host j-011.

but I only see 1 task manager process running.

Is there anything else I need to do?

On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <uc...@apache.org> wrote:

> Yes, exactly.
>
> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <es...@gmail.com>
> wrote:
> > Thank you, yes, it can be done externally, if not supported within Flink.
> >
> > So the way to spawn multiple task managers would be to list the same
> slave
> > machines N times as necessary in the slaves file?
> >
> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >>
> >> No, not inside of Flink. That sounds like something like the OS or
> >> resource manager should handle.
> >>
> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <es...@gmail.com>
> >> wrote:
> >> > That's great, so is there support to pin task managers to sockets as
> >> > well?
> >> >
> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >> >>
> >> >> Regarding 2) if you don't manually configure something else, that
> >> >> should happen always.
> >> >>
> >> >> Yes, you can run more than one task manager per node depending on the
> >> >> process isolation you want. Within a task manager, there are multiple
> >> >> threads for each slot. For example, if you have 2 task managers with
> 2
> >> >> slots each and submit a job with parallelism 4, each task manager
> will
> >> >> execute 2 sub tasks in separate Threads.
> >> >>
> >> >>
> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <es...@gmail.com>
> >> >> wrote:
> >> >> > Hi Ufuk,
> >> >> >
> >> >> > Looking at the document you sent it seems only 1 task manager per
> >> >> > node
> >> >> > exist
> >> >> > and within that you have multiple slots. Is it possible to run more
> >> >> > than
> >> >> > 1
> >> >> > task manager per node? Also, within a task manager is the
> parallelism
> >> >> > done
> >> >> > through threads or processes?
> >> >> >
> >> >> > Thank you,
> >> >> > Saliya
> >> >> >
> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake <
> esaliya@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Thank you, I'll check these.
> >> >> >>
> >> >> >> In 2.) you said they are likely to exchange through memory. Is
> there
> >> >> >> a
> >> >> >> case why they wouldn't?
> >> >> >>
> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org>
> wrote:
> >> >> >>>
> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
> >> >> >>> <es...@gmail.com>
> >> >> >>> wrote:
> >> >> >>> > 1. What parameters are available to control parallelism within
> a
> >> >> >>> > node?
> >> >> >>>
> >> >> >>> Task Manager processing slots:
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
> >> >> >>>
> >> >> >>> > 2. Does Flink support shared memory-based messaging within a
> node
> >> >> >>> > (without
> >> >> >>> > doing TCP calls)?
> >> >> >>>
> >> >> >>> Yes, local exchanges happen via memory and not TCP, for example
> if
> >> >> >>> you
> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely
> to
> >> >> >>> exchange data locally.
> >> >> >>>
> >> >> >>> > 3. Is there support for Infiniband interconnect?
> >> >> >>>
> >> >> >>> No, not that I'm aware of.
> >> >> >>>
> >> >> >>> – Ufuk
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Saliya Ekanayake
> >> >> >> Ph.D. Candidate | Research Assistant
> >> >> >> School of Informatics and Computing | Digital Science Center
> >> >> >> Indiana University, Bloomington
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Saliya Ekanayake
> >> >> > Ph.D. Candidate | Research Assistant
> >> >> > School of Informatics and Computing | Digital Science Center
> >> >> > Indiana University, Bloomington
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Saliya Ekanayake
> >> > Ph.D. Candidate | Research Assistant
> >> > School of Informatics and Computing | Digital Science Center
> >> > Indiana University, Bloomington
> >> >
> >
> >
> >
> >
> > --
> > Saliya Ekanayake
> > Ph.D. Candidate | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> >
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Ufuk Celebi <uc...@apache.org>.
Yes, exactly.

On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <es...@gmail.com> wrote:
> Thank you, yes, it can be done externally, if not supported within Flink.
>
> So the way to spawn multiple task managers would be to list the same slave
> machines N times as necessary in the slaves file?
>
> On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>
>> No, not inside of Flink. That sounds like something like the OS or
>> resource manager should handle.
>>
>> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <es...@gmail.com>
>> wrote:
>> > That's great, so is there support to pin task managers to sockets as
>> > well?
>> >
>> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org> wrote:
>> >>
>> >> Regarding 2) if you don't manually configure something else, that
>> >> should happen always.
>> >>
>> >> Yes, you can run more than one task manager per node depending on the
>> >> process isolation you want. Within a task manager, there are multiple
>> >> threads for each slot. For example, if you have 2 task managers with 2
>> >> slots each and submit a job with parallelism 4, each task manager will
>> >> execute 2 sub tasks in separate Threads.
>> >>
>> >>
>> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <es...@gmail.com>
>> >> wrote:
>> >> > Hi Ufuk,
>> >> >
>> >> > Looking at the document you sent it seems only 1 task manager per
>> >> > node
>> >> > exist
>> >> > and within that you have multiple slots. Is it possible to run more
>> >> > than
>> >> > 1
>> >> > task manager per node? Also, within a task manager is the parallelism
>> >> > done
>> >> > through threads or processes?
>> >> >
>> >> > Thank you,
>> >> > Saliya
>> >> >
>> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake <es...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Thank you, I'll check these.
>> >> >>
>> >> >> In 2.) you said they are likely to exchange through memory. Is there
>> >> >> a
>> >> >> case why they wouldn't?
>> >> >>
>> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org> wrote:
>> >> >>>
>> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>> >> >>> <es...@gmail.com>
>> >> >>> wrote:
>> >> >>> > 1. What parameters are available to control parallelism within a
>> >> >>> > node?
>> >> >>>
>> >> >>> Task Manager processing slots:
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>> >> >>>
>> >> >>> > 2. Does Flink support shared memory-based messaging within a node
>> >> >>> > (without
>> >> >>> > doing TCP calls)?
>> >> >>>
>> >> >>> Yes, local exchanges happen via memory and not TCP, for example if
>> >> >>> you
>> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely to
>> >> >>> exchange data locally.
>> >> >>>
>> >> >>> > 3. Is there support for Infiniband interconnect?
>> >> >>>
>> >> >>> No, not that I'm aware of.
>> >> >>>
>> >> >>> – Ufuk
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Saliya Ekanayake
>> >> >> Ph.D. Candidate | Research Assistant
>> >> >> School of Informatics and Computing | Digital Science Center
>> >> >> Indiana University, Bloomington
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Saliya Ekanayake
>> >> > Ph.D. Candidate | Research Assistant
>> >> > School of Informatics and Computing | Digital Science Center
>> >> > Indiana University, Bloomington
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Saliya Ekanayake
>> > Ph.D. Candidate | Research Assistant
>> > School of Informatics and Computing | Digital Science Center
>> > Indiana University, Bloomington
>> >
>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
Thank you, yes, it can be done externally, if not supported within Flink.

So the way to spawn multiple task managers would be to list the same slave
machines N times as necessary in the slaves file?

On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <uc...@apache.org> wrote:

> No, not inside of Flink. That sounds like something like the OS or
> resource manager should handle.
>
> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <es...@gmail.com>
> wrote:
> > That's great, so is there support to pin task managers to sockets as
> well?
> >
> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >>
> >> Regarding 2) if you don't manually configure something else, that
> >> should happen always.
> >>
> >> Yes, you can run more than one task manager per node depending on the
> >> process isolation you want. Within a task manager, there are multiple
> >> threads for each slot. For example, if you have 2 task managers with 2
> >> slots each and submit a job with parallelism 4, each task manager will
> >> execute 2 sub tasks in separate Threads.
> >>
> >>
> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <es...@gmail.com>
> >> wrote:
> >> > Hi Ufuk,
> >> >
> >> > Looking at the document you sent it seems only 1 task manager per node
> >> > exist
> >> > and within that you have multiple slots. Is it possible to run more
> than
> >> > 1
> >> > task manager per node? Also, within a task manager is the parallelism
> >> > done
> >> > through threads or processes?
> >> >
> >> > Thank you,
> >> > Saliya
> >> >
> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake <es...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Thank you, I'll check these.
> >> >>
> >> >> In 2.) you said they are likely to exchange through memory. Is there
> a
> >> >> case why they wouldn't?
> >> >>
> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >> >>>
> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake <
> esaliya@gmail.com>
> >> >>> wrote:
> >> >>> > 1. What parameters are available to control parallelism within a
> >> >>> > node?
> >> >>>
> >> >>> Task Manager processing slots:
> >> >>>
> >> >>>
> >> >>>
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
> >> >>>
> >> >>> > 2. Does Flink support shared memory-based messaging within a node
> >> >>> > (without
> >> >>> > doing TCP calls)?
> >> >>>
> >> >>> Yes, local exchanges happen via memory and not TCP, for example if
> you
> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely to
> >> >>> exchange data locally.
> >> >>>
> >> >>> > 3. Is there support for Infiniband interconnect?
> >> >>>
> >> >>> No, not that I'm aware of.
> >> >>>
> >> >>> – Ufuk
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Saliya Ekanayake
> >> >> Ph.D. Candidate | Research Assistant
> >> >> School of Informatics and Computing | Digital Science Center
> >> >> Indiana University, Bloomington
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Saliya Ekanayake
> >> > Ph.D. Candidate | Research Assistant
> >> > School of Informatics and Computing | Digital Science Center
> >> > Indiana University, Bloomington
> >> >
> >
> >
> >
> >
> > --
> > Saliya Ekanayake
> > Ph.D. Candidate | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> >
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Ufuk Celebi <uc...@apache.org>.
No, not inside of Flink. That sounds like something like the OS or
resource manager should handle.

On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <es...@gmail.com> wrote:
> That's great, so is there support to pin task managers to sockets as well?
>
> On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>
>> Regarding 2) if you don't manually configure something else, that
>> should happen always.
>>
>> Yes, you can run more than one task manager per node depending on the
>> process isolation you want. Within a task manager, there are multiple
>> threads for each slot. For example, if you have 2 task managers with 2
>> slots each and submit a job with parallelism 4, each task manager will
>> execute 2 sub tasks in separate Threads.
>>
>>
>> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <es...@gmail.com>
>> wrote:
>> > Hi Ufuk,
>> >
>> > Looking at the document you sent it seems only 1 task manager per node
>> > exist
>> > and within that you have multiple slots. Is it possible to run more than
>> > 1
>> > task manager per node? Also, within a task manager is the parallelism
>> > done
>> > through threads or processes?
>> >
>> > Thank you,
>> > Saliya
>> >
>> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake <es...@gmail.com>
>> > wrote:
>> >>
>> >> Thank you, I'll check these.
>> >>
>> >> In 2.) you said they are likely to exchange through memory. Is there a
>> >> case why they wouldn't?
>> >>
>> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org> wrote:
>> >>>
>> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake <es...@gmail.com>
>> >>> wrote:
>> >>> > 1. What parameters are available to control parallelism within a
>> >>> > node?
>> >>>
>> >>> Task Manager processing slots:
>> >>>
>> >>>
>> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>> >>>
>> >>> > 2. Does Flink support shared memory-based messaging within a node
>> >>> > (without
>> >>> > doing TCP calls)?
>> >>>
>> >>> Yes, local exchanges happen via memory and not TCP, for example if you
>> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely to
>> >>> exchange data locally.
>> >>>
>> >>> > 3. Is there support for Infiniband interconnect?
>> >>>
>> >>> No, not that I'm aware of.
>> >>>
>> >>> – Ufuk
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Saliya Ekanayake
>> >> Ph.D. Candidate | Research Assistant
>> >> School of Informatics and Computing | Digital Science Center
>> >> Indiana University, Bloomington
>> >>
>> >
>> >
>> >
>> > --
>> > Saliya Ekanayake
>> > Ph.D. Candidate | Research Assistant
>> > School of Informatics and Computing | Digital Science Center
>> > Indiana University, Bloomington
>> >
>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
That's great, so is there support to pin task managers to sockets as well?

On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <uc...@apache.org> wrote:

> Regarding 2) if you don't manually configure something else, that
> should happen always.
>
> Yes, you can run more than one task manager per node depending on the
> process isolation you want. Within a task manager, there are multiple
> threads for each slot. For example, if you have 2 task managers with 2
> slots each and submit a job with parallelism 4, each task manager will
> execute 2 sub tasks in separate Threads.
>
>
> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <es...@gmail.com>
> wrote:
> > Hi Ufuk,
> >
> > Looking at the document you sent it seems only 1 task manager per node
> exist
> > and within that you have multiple slots. Is it possible to run more than
> 1
> > task manager per node? Also, within a task manager is the parallelism
> done
> > through threads or processes?
> >
> > Thank you,
> > Saliya
> >
> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake <es...@gmail.com>
> wrote:
> >>
> >> Thank you, I'll check these.
> >>
> >> In 2.) you said they are likely to exchange through memory. Is there a
> >> case why they wouldn't?
> >>
> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >>>
> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake <es...@gmail.com>
> >>> wrote:
> >>> > 1. What parameters are available to control parallelism within a
> node?
> >>>
> >>> Task Manager processing slots:
> >>>
> >>>
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
> >>>
> >>> > 2. Does Flink support shared memory-based messaging within a node
> >>> > (without
> >>> > doing TCP calls)?
> >>>
> >>> Yes, local exchanges happen via memory and not TCP, for example if you
> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely to
> >>> exchange data locally.
> >>>
> >>> > 3. Is there support for Infiniband interconnect?
> >>>
> >>> No, not that I'm aware of.
> >>>
> >>> – Ufuk
> >>
> >>
> >>
> >>
> >> --
> >> Saliya Ekanayake
> >> Ph.D. Candidate | Research Assistant
> >> School of Informatics and Computing | Digital Science Center
> >> Indiana University, Bloomington
> >>
> >
> >
> >
> > --
> > Saliya Ekanayake
> > Ph.D. Candidate | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> >
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Ufuk Celebi <uc...@apache.org>.
Regarding 2) if you don't manually configure something else, that
should happen always.

Yes, you can run more than one task manager per node depending on the
process isolation you want. Within a task manager, there are multiple
threads for each slot. For example, if you have 2 task managers with 2
slots each and submit a job with parallelism 4, each task manager will
execute 2 sub tasks in separate Threads.


On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <es...@gmail.com> wrote:
> Hi Ufuk,
>
> Looking at the document you sent it seems only 1 task manager per node exist
> and within that you have multiple slots. Is it possible to run more than 1
> task manager per node? Also, within a task manager is the parallelism done
> through threads or processes?
>
> Thank you,
> Saliya
>
> On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake <es...@gmail.com> wrote:
>>
>> Thank you, I'll check these.
>>
>> In 2.) you said they are likely to exchange through memory. Is there a
>> case why they wouldn't?
>>
>> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org> wrote:
>>>
>>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake <es...@gmail.com>
>>> wrote:
>>> > 1. What parameters are available to control parallelism within a node?
>>>
>>> Task Manager processing slots:
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>>
>>> > 2. Does Flink support shared memory-based messaging within a node
>>> > (without
>>> > doing TCP calls)?
>>>
>>> Yes, local exchanges happen via memory and not TCP, for example if you
>>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely to
>>> exchange data locally.
>>>
>>> > 3. Is there support for Infiniband interconnect?
>>>
>>> No, not that I'm aware of.
>>>
>>> – Ufuk
>>
>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
Hi Ufuk,

Looking at the document you sent it seems only 1 task manager per node
exist and within that you have multiple slots. Is it possible to run more
than 1 task manager per node? Also, within a task manager is the
parallelism done through threads or processes?

Thank you,
Saliya

On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake <es...@gmail.com> wrote:

> Thank you, I'll check these.
>
> In 2.) you said they are likely to exchange through memory. Is there a
> case why they wouldn't?
>
> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org> wrote:
>
>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake <es...@gmail.com>
>> wrote:
>> > 1. What parameters are available to control parallelism within a node?
>>
>> Task Manager processing slots:
>>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>
>> > 2. Does Flink support shared memory-based messaging within a node
>> (without
>> > doing TCP calls)?
>>
>> Yes, local exchanges happen via memory and not TCP, for example if you
>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely to
>> exchange data locally.
>>
>> > 3. Is there support for Infiniband interconnect?
>>
>> No, not that I'm aware of.
>>
>> – Ufuk
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Saliya Ekanayake <es...@gmail.com>.
Thank you, I'll check these.

In 2.) you said they are likely to exchange through memory. Is there a case
why they wouldn't?

On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <uc...@apache.org> wrote:

> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake <es...@gmail.com>
> wrote:
> > 1. What parameters are available to control parallelism within a node?
>
> Task Manager processing slots:
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>
> > 2. Does Flink support shared memory-based messaging within a node
> (without
> > doing TCP calls)?
>
> Yes, local exchanges happen via memory and not TCP, for example if you
> have a map-reduce, map subtask 1 and reduce subtask 1 are likely to
> exchange data locally.
>
> > 3. Is there support for Infiniband interconnect?
>
> No, not that I'm aware of.
>
> – Ufuk
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Parameters to Control Intra-node Parallelism

Posted by Ufuk Celebi <uc...@apache.org>.
On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake <es...@gmail.com> wrote:
> 1. What parameters are available to control parallelism within a node?

Task Manager processing slots:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots

> 2. Does Flink support shared memory-based messaging within a node (without
> doing TCP calls)?

Yes, local exchanges happen via memory and not TCP, for example if you
have a map-reduce, map subtask 1 and reduce subtask 1 are likely to
exchange data locally.

> 3. Is there support for Infiniband interconnect?

No, not that I'm aware of.

– Ufuk