You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Rafael Leira Osuna <ra...@uam.es> on 2018/08/12 16:16:20 UTC

Select node for a Flink DataStream execution

Hi!

I have been searching a lot but I didn't found a solution for this.

Lets supose some of the steps on the streaming process must be executed
in just a subset of the available nodes/taskmanagers, while the rest of
the tasks are free to be computed anywhere.

**¿How can I assign a DataStream to be executed ONLY in a node
subset?**

This is required mainly for input/sink tasks as not every node in the
cluster have the same conectivity / security restrictions.

I'm new on flink, so please forgive me if I'm asking for something
obvious.

Thanks a lot.

Rafael Leira.

Pd: Currently, we have a static standalone flink cluster.

Re: Select node for a Flink DataStream execution

Posted by Rafael Leira Osuna <ra...@uam.es>.

Hi Vino,
The cluster is not completely homogeneous in cpu/memory but quite
similar. The real issue comes from the networking (Which is what we
want to test).The interfaces on each node varies from: 100Mb/s, to
1Gb/s, 10Gb/s, 25Gb/s, 40Gb/s and 100Gb/s (all Ethernet). And each node
have more than one (different) network link each with its own flink
jobmanager (Yup, more than one jobmanager per node).
That is, if we want to measure latencies, throughputs, etc, we need to
know / deploy each task where we need for each test / measurement.
I think some of the tests could be done anyway by stoping non-in-test
jobmanagers and with some annoying trial-error. 
Anyway, your knowledge helped to plan our tests.Rafael.
El lun, 13-08-2018 a las 20:13 +0800, vino yang escribió:
> Hi Rafael,
> One thing in your needs is achievable : want 2 different tasks
> executed on 2 different nodes. 
> A Node sets a Slot, disables the operator chain, but you can't choose
> which node to run which task. 
> I don't understand why you need to artificially specify this
> mapping. 
> Do you have a very different machine configuration in a cluster, or
> do you have other considerations?
> 
> Thank, vino.
> Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 下午8:07写道：
> > Thanks Vino,
> > As an example of what a we need: We want 2 different tasks executed
> > on 2 different nodes and choose which one is going to execute what.
> > Given what you have taught me:- A job can be assigned to a node in
> > 1.6 with yarn- A set of tasks (datastream) are executed inside a
> > job- Two different jobs can't speak each other, nor its tasks
> >  I would assume that what we want is not possible in flink.
> > (Excluding custom solutions like sockets or third parties like
> > kafka)
> > Thanks for your help!.Rafael.
> > El lun, 13-08-2018 a las 19:34 +0800, vino yang escribió:
> > > Hi Rafael,
> > > Flink does not support the interaction of DataStream in two
> > > Jobs. 
> > > I don't know what your scene is. 
> > > Usually if you need two stream interactions, you can import them
> > > into the same job. 
> > > You can do this through the DataStream join/connect API.[1]
> > > 
> > > [1]: https://ci.apache.org/projects/flink/flink-docs-master/dev/s
> > > tream/operators/
> > > 
> > > Thanks, vino.
> > > Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 下午6:35写道：
> > > > Thanks for your replay Vino,
> > > > I've checked your solution and it may solve my requirements.
> > > > However it presents a newer issue: How can 2 different jobs
> > > > interact with?As said, I'm new at this, and all I know is the
> > > > basis to create and interconect diffent datastreams over the
> > > > same job / java.class.
> > > > I know a possible solution may be to create TCP sockets and
> > > > interconect the jobs by myself, but my final objetive here is
> > > > to evaluate how flink performs over different media and data
> > > > while I measure it's vertical scalability (for an academia
> > > > study + my phd). In the custom tcp-socket solution, I would be
> > > > testing my own (and maybe not the best) implementation instead
> > > > the flink's one.
> > > > Is it possible to interconnect two different jobs using flink's
> > > > API? I did not found much on google but maybe because I dont
> > > > know what I really should search for. If there is a way to
> > > > interconect 2 jobs then your proposed solution should work
> > > > fine, if not, should I assume flink current implementation
> > > > won't allow me to select "job-steps" over "defined-
> > > > nodes"?Thanks again Vino and anyone who helps,
> > > > Rafael.
> > > > El lun, 13-08-2018 a las 10:22 +0800, vino yang escribió:
> > > > > Hi Rafael,
> > > > > For Standalone clusters, it seems that Flink does not provide
> > > > > such a feature. 
> > > > > In general, at the execution level, we don't talk about
> > > > > DataStream, but we talk about Job. 
> > > > > If your Flink is running on YARN, you can use YARN's Node
> > > > > Label feature to assign a Label to some Nodes. 
> > > > > Earlier this year, I had solved an issue that could solve the
> > > > > problem of specifying a node label when submitting a job for
> > > > > Flink on YARN.[1] 
> > > > > This feature is available in the recently released Flink
> > > > > 1.6.0.
> > > > > Don't know if it meets your requirements?
> > > > > 
> > > > > [1]: https://issues.apache.org/jira/browse/FLINK-7836
> > > > > 
> > > > > Thanks, vino.
> > > > > Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一
> > > > > 上午12:16写道：
> > > > > > Hi!
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > I have been searching a lot but I didn't found a solution
> > > > > > for this.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Lets supose some of the steps on the streaming process must
> > > > > > be executed
> > > > > > 
> > > > > > in just a subset of the available nodes/taskmanagers, while
> > > > > > the rest of
> > > > > > 
> > > > > > the tasks are free to be computed anywhere.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > **¿How can I assign a DataStream to be executed ONLY in a
> > > > > > node
> > > > > > 
> > > > > > subset?**
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > This is required mainly for input/sink tasks as not every
> > > > > > node in the
> > > > > > 
> > > > > > cluster have the same conectivity / security restrictions.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > I'm new on flink, so please forgive me if I'm asking for
> > > > > > something
> > > > > > 
> > > > > > obvious.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Thanks a lot.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Rafael Leira.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Pd: Currently, we have a static standalone flink cluster.
> > > > > >

Re: Select node for a Flink DataStream execution

Posted by vino yang <ya...@gmail.com>.

Hi Rafael,

One thing in your needs is achievable : *want 2 different tasks executed on
2 different nodes*.
A Node sets a Slot, disables the operator chain, but you can't choose which
node to run which task.
I don't understand why you need to artificially specify this mapping.
Do you have a very different machine configuration in a cluster, or do you
have other considerations?

Thank, vino.

Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 下午8:07写道：

> Thanks Vino,
>
> As an example of what a we need: We want 2 different tasks executed on 2
> different nodes and choose which one is going to execute what.
>
> Given what you have taught me:
> - A job can be assigned to a node in 1.6 with yarn
> - A set of tasks (datastream) are executed inside a job
> - Two different jobs can't speak each other, nor its tasks
>
> I would assume that what we want is not possible in flink. (Excluding
> custom solutions like sockets or third parties like kafka)
>
> Thanks for your help!.
> Rafael.
>
> El lun, 13-08-2018 a las 19:34 +0800, vino yang escribió:
>
> Hi Rafael,
>
> Flink does not support the interaction of DataStream in two Jobs.
> I don't know what your scene is.
> Usually if you need two stream interactions, you can import them into the
> same job.
> You can do this through the DataStream join/connect API.[1]
>
> [1]:
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/
>
> Thanks, vino.
>
> Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 下午6:35写道：
>
> Thanks for your replay Vino,
>
> I've checked your solution and it may solve my requirements. However it
> presents a newer issue: How can 2 different jobs interact with?
> As said, I'm new at this, and all I know is the basis to create and
> interconect diffent datastreams over the same job / java.class.
>
> I know a possible solution may be to create TCP sockets and interconect
> the jobs by myself, but my final objetive here is to evaluate how flink
> performs over different media and data while I measure it's vertical
> scalability (for an academia study + my phd). In the custom tcp-socket
> solution, I would be testing my own (and maybe not the best) implementation
> instead the flink's one.
>
> Is it possible to interconnect two different jobs using flink's API? I did
> not found much on google but maybe because I dont know what I really should
> search for. If there is a way to interconect 2 jobs then your proposed
> solution should work fine, if not, should I assume flink current
> implementation won't allow me to select "job-steps" over "defined-nodes"?
> Thanks again Vino and anyone who helps,
>
> Rafael.
>
> El lun, 13-08-2018 a las 10:22 +0800, vino yang escribió:
>
> Hi Rafael,
>
> For Standalone clusters, it seems that Flink does not provide such a
> feature.
> In general, at the execution level, we don't talk about DataStream, but we
> talk about Job.
> If your Flink is running on YARN, you can use YARN's Node Label feature to
> assign a Label to some Nodes.
> Earlier this year, I had solved an issue that could solve the problem of
> specifying a node label when submitting a job for Flink on YARN.[1]
> *This feature is available in the recently released Flink 1.6.0.*
> Don't know if it meets your requirements?
>
> [1]: https://issues.apache.org/jira/browse/FLINK-7836
>
> Thanks, vino.
>
> Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 上午12:16写道：
>
> Hi!
>
> I have been searching a lot but I didn't found a solution for this.
>
> Lets supose some of the steps on the streaming process must be executed
> in just a subset of the available nodes/taskmanagers, while the rest of
> the tasks are free to be computed anywhere.
>
> **¿How can I assign a DataStream to be executed ONLY in a node
> subset?**
>
> This is required mainly for input/sink tasks as not every node in the
> cluster have the same conectivity / security restrictions.
>
> I'm new on flink, so please forgive me if I'm asking for something
> obvious.
>
> Thanks a lot.
>
> Rafael Leira.
>
> Pd: Currently, we have a static standalone flink cluster.
>
>

Re: Select node for a Flink DataStream execution

Posted by Rafael Leira Osuna <ra...@uam.es>.

Thanks Vino,
As an example of what a we need: We want 2 different tasks executed on
2 different nodes and choose which one is going to execute what.
Given what you have taught me:- A job can be assigned to a node in 1.6
with yarn- A set of tasks (datastream) are executed inside a job- Two
different jobs can't speak each other, nor its tasks
 I would assume that what we want is not possible in flink. (Excluding
custom solutions like sockets or third parties like kafka)
Thanks for your help!.Rafael.
El lun, 13-08-2018 a las 19:34 +0800, vino yang escribió:
> Hi Rafael,
> Flink does not support the interaction of DataStream in two Jobs. 
> I don't know what your scene is. 
> Usually if you need two stream interactions, you can import them into
> the same job. 
> You can do this through the DataStream join/connect API.[1]
> 
> [1]: https://ci.apache.org/projects/flink/flink-docs-master/dev/strea
> m/operators/
> 
> Thanks, vino.
> Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 下午6:35写道：
> > Thanks for your replay Vino,
> > I've checked your solution and it may solve my requirements.
> > However it presents a newer issue: How can 2 different jobs
> > interact with?As said, I'm new at this, and all I know is the basis
> > to create and interconect diffent datastreams over the same job /
> > java.class.
> > I know a possible solution may be to create TCP sockets and
> > interconect the jobs by myself, but my final objetive here is to
> > evaluate how flink performs over different media and data while I
> > measure it's vertical scalability (for an academia study + my phd).
> > In the custom tcp-socket solution, I would be testing my own (and
> > maybe not the best) implementation instead the flink's one.
> > Is it possible to interconnect two different jobs using flink's
> > API? I did not found much on google but maybe because I dont know
> > what I really should search for. If there is a way to interconect 2
> > jobs then your proposed solution should work fine, if not, should I
> > assume flink current implementation won't allow me to select "job-
> > steps" over "defined-nodes"?Thanks again Vino and anyone who helps,
> > Rafael.
> > El lun, 13-08-2018 a las 10:22 +0800, vino yang escribió:
> > > Hi Rafael,
> > > For Standalone clusters, it seems that Flink does not provide
> > > such a feature. 
> > > In general, at the execution level, we don't talk about
> > > DataStream, but we talk about Job. 
> > > If your Flink is running on YARN, you can use YARN's Node Label
> > > feature to assign a Label to some Nodes. 
> > > Earlier this year, I had solved an issue that could solve the
> > > problem of specifying a node label when submitting a job for
> > > Flink on YARN.[1] 
> > > This feature is available in the recently released Flink 1.6.0.
> > > Don't know if it meets your requirements?
> > > 
> > > [1]: https://issues.apache.org/jira/browse/FLINK-7836
> > > 
> > > Thanks, vino.
> > > Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 上午12:16写道：
> > > > Hi!
> > > > 
> > > > 
> > > > 
> > > > I have been searching a lot but I didn't found a solution for
> > > > this.
> > > > 
> > > > 
> > > > 
> > > > Lets supose some of the steps on the streaming process must be
> > > > executed
> > > > 
> > > > in just a subset of the available nodes/taskmanagers, while the
> > > > rest of
> > > > 
> > > > the tasks are free to be computed anywhere.
> > > > 
> > > > 
> > > > 
> > > > **¿How can I assign a DataStream to be executed ONLY in a node
> > > > 
> > > > subset?**
> > > > 
> > > > 
> > > > 
> > > > This is required mainly for input/sink tasks as not every node
> > > > in the
> > > > 
> > > > cluster have the same conectivity / security restrictions.
> > > > 
> > > > 
> > > > 
> > > > I'm new on flink, so please forgive me if I'm asking for
> > > > something
> > > > 
> > > > obvious.
> > > > 
> > > > 
> > > > 
> > > > Thanks a lot.
> > > > 
> > > > 
> > > > 
> > > > Rafael Leira.
> > > > 
> > > > 
> > > > 
> > > > Pd: Currently, we have a static standalone flink cluster.
> > > >

Re: Select node for a Flink DataStream execution

Posted by vino yang <ya...@gmail.com>.

Hi Rafael,

Flink does not support the interaction of DataStream in two Jobs.
I don't know what your scene is.
Usually if you need two stream interactions, you can import them into the
same job.
You can do this through the DataStream join/connect API.[1]

[1]:
https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/

Thanks, vino.

Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 下午6:35写道：

> Thanks for your replay Vino,
>
> I've checked your solution and it may solve my requirements. However it
> presents a newer issue: How can 2 different jobs interact with?
> As said, I'm new at this, and all I know is the basis to create and
> interconect diffent datastreams over the same job / java.class.
>
> I know a possible solution may be to create TCP sockets and interconect
> the jobs by myself, but my final objetive here is to evaluate how flink
> performs over different media and data while I measure it's vertical
> scalability (for an academia study + my phd). In the custom tcp-socket
> solution, I would be testing my own (and maybe not the best) implementation
> instead the flink's one.
>
> Is it possible to interconnect two different jobs using flink's API? I did
> not found much on google but maybe because I dont know what I really should
> search for. If there is a way to interconect 2 jobs then your proposed
> solution should work fine, if not, should I assume flink current
> implementation won't allow me to select "job-steps" over "defined-nodes"?
> Thanks again Vino and anyone who helps,
>
> Rafael.
>
> El lun, 13-08-2018 a las 10:22 +0800, vino yang escribió:
>
> Hi Rafael,
>
> For Standalone clusters, it seems that Flink does not provide such a
> feature.
> In general, at the execution level, we don't talk about DataStream, but we
> talk about Job.
> If your Flink is running on YARN, you can use YARN's Node Label feature to
> assign a Label to some Nodes.
> Earlier this year, I had solved an issue that could solve the problem of
> specifying a node label when submitting a job for Flink on YARN.[1]
> *This feature is available in the recently released Flink 1.6.0.*
> Don't know if it meets your requirements?
>
> [1]: https://issues.apache.org/jira/browse/FLINK-7836
>
> Thanks, vino.
>
> Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 上午12:16写道：
>
> Hi!
>
> I have been searching a lot but I didn't found a solution for this.
>
> Lets supose some of the steps on the streaming process must be executed
> in just a subset of the available nodes/taskmanagers, while the rest of
> the tasks are free to be computed anywhere.
>
> **¿How can I assign a DataStream to be executed ONLY in a node
> subset?**
>
> This is required mainly for input/sink tasks as not every node in the
> cluster have the same conectivity / security restrictions.
>
> I'm new on flink, so please forgive me if I'm asking for something
> obvious.
>
> Thanks a lot.
>
> Rafael Leira.
>
> Pd: Currently, we have a static standalone flink cluster.
>
>

Re: Select node for a Flink DataStream execution

Posted by Rafael Leira Osuna <ra...@uam.es>.

Thanks for your replay Vino,
I've checked your solution and it may solve my requirements. However it
presents a newer issue: How can 2 different jobs interact with?As said,
I'm new at this, and all I know is the basis to create and interconect
diffent datastreams over the same job / java.class.
I know a possible solution may be to create TCP sockets and interconect
the jobs by myself, but my final objetive here is to evaluate how flink
performs over different media and data while I measure it's vertical
scalability (for an academia study + my phd). In the custom tcp-socket
solution, I would be testing my own (and maybe not the best)
implementation instead the flink's one.
Is it possible to interconnect two different jobs using flink's API? I
did not found much on google but maybe because I dont know what I
really should search for. If there is a way to interconect 2 jobs then
your proposed solution should work fine, if not, should I assume flink
current implementation won't allow me to select "job-steps" over
"defined-nodes"?Thanks again Vino and anyone who helps,
Rafael.
El lun, 13-08-2018 a las 10:22 +0800, vino yang escribió:
> Hi Rafael,
> For Standalone clusters, it seems that Flink does not provide such a
> feature. 
> In general, at the execution level, we don't talk about DataStream,
> but we talk about Job. 
> If your Flink is running on YARN, you can use YARN's Node Label
> feature to assign a Label to some Nodes. 
> Earlier this year, I had solved an issue that could solve the problem
> of specifying a node label when submitting a job for Flink on
> YARN.[1] 
> This feature is available in the recently released Flink 1.6.0.
> Don't know if it meets your requirements?
> 
> [1]: https://issues.apache.org/jira/browse/FLINK-7836
> 
> Thanks, vino.
> Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 上午12:16写道：
> > Hi!
> > 
> > 
> > 
> > I have been searching a lot but I didn't found a solution for this.
> > 
> > 
> > 
> > Lets supose some of the steps on the streaming process must be
> > executed
> > 
> > in just a subset of the available nodes/taskmanagers, while the
> > rest of
> > 
> > the tasks are free to be computed anywhere.
> > 
> > 
> > 
> > **¿How can I assign a DataStream to be executed ONLY in a node
> > 
> > subset?**
> > 
> > 
> > 
> > This is required mainly for input/sink tasks as not every node in
> > the
> > 
> > cluster have the same conectivity / security restrictions.
> > 
> > 
> > 
> > I'm new on flink, so please forgive me if I'm asking for something
> > 
> > obvious.
> > 
> > 
> > 
> > Thanks a lot.
> > 
> > 
> > 
> > Rafael Leira.
> > 
> > 
> > 
> > Pd: Currently, we have a static standalone flink cluster.
> >

Re: Select node for a Flink DataStream execution

Posted by vino yang <ya...@gmail.com>.

Hi Rafael,

For Standalone clusters, it seems that Flink does not provide such a
feature.
In general, at the execution level, we don't talk about DataStream, but we
talk about Job.
If your Flink is running on YARN, you can use YARN's Node Label feature to
assign a Label to some Nodes.
Earlier this year, I had solved an issue that could solve the problem of
specifying a node label when submitting a job for Flink on YARN.[1]
*This feature is available in the recently released Flink 1.6.0.*
Don't know if it meets your requirements?

[1]: https://issues.apache.org/jira/browse/FLINK-7836

Thanks, vino.

Rafael Leira Osuna <ra...@uam.es> 于2018年8月13日周一 上午12:16写道：

> Hi!
>
> I have been searching a lot but I didn't found a solution for this.
>
> Lets supose some of the steps on the streaming process must be executed
> in just a subset of the available nodes/taskmanagers, while the rest of
> the tasks are free to be computed anywhere.
>
> **¿How can I assign a DataStream to be executed ONLY in a node
> subset?**
>
> This is required mainly for input/sink tasks as not every node in the
> cluster have the same conectivity / security restrictions.
>
> I'm new on flink, so please forgive me if I'm asking for something
> obvious.
>
> Thanks a lot.
>
> Rafael Leira.
>
> Pd: Currently, we have a static standalone flink cluster.
>