You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Kay Fleischmann <fl...@gmail.com> on 2014/08/15 13:09:33 UTC

How to execute one operator per Machine that reads a full dataset

Hey guys. Lets say i have a cluster setup with 5 machines.

I am wondering how to distribute an operator (e.g. MapPartition) to all
machines, exactly one per machine. All of them should read a specific
source completely.

As i see flink broadcast variable do not work because i nee two different
sources. Is there an other way?

Kay

Re: How to execute one operator per Machine that reads a full dataset

Posted by Kay Fleischmann <fl...@gmail.com>.
2014-08-15 13:31 GMT+02:00 Stephan Ewen <se...@apache.org>:

> Hey!
>
> You basically want five different pipelines, on different machines?
>
I need 5 different pipelines from the same dataset to each of the 5
machines. One operator e.g. MapPartition() on each node, that reads the
fulldataset by using the Iterator and afterwards can do some operations on
it and write out results to hdfs.

dataset.mapPartition( new MapPartition() ).executeOnePerMachine().save(..)


>
> I think that is simply a different scheduling the tasks, not sharing
> resources. I am working on the scheduler, I can see how to take this
> usecase into account...
>

Nice!, would be really awesome! Can you keep me informed about the
progress. Is there an issue?


>
> Stephan
>
>
>
> On Fri, Aug 15, 2014 at 1:09 PM, Kay Fleischmann <
> fleischmann.kay@gmail.com>
> wrote:
>
> > Hey guys. Lets say i have a cluster setup with 5 machines.
> >
> > I am wondering how to distribute an operator (e.g. MapPartition) to all
> > machines, exactly one per machine. All of them should read a specific
> > source completely.
> >
> > As i see flink broadcast variable do not work because i nee two different
> > sources. Is there an other way?
> >
> > Kay
> >
>

Re: How to execute one operator per Machine that reads a full dataset

Posted by Stephan Ewen <se...@apache.org>.
Hey!

You basically want five different pipelines, on different machines?

I think that is simply a different scheduling the tasks, not sharing
resources. I am working on the scheduler, I can see how to take this
usecase into account...

Stephan



On Fri, Aug 15, 2014 at 1:09 PM, Kay Fleischmann <fl...@gmail.com>
wrote:

> Hey guys. Lets say i have a cluster setup with 5 machines.
>
> I am wondering how to distribute an operator (e.g. MapPartition) to all
> machines, exactly one per machine. All of them should read a specific
> source completely.
>
> As i see flink broadcast variable do not work because i nee two different
> sources. Is there an other way?
>
> Kay
>