You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by "Stadin, Benjamin" <Be...@heidelberg-mobil.com> on 2016/05/21 14:22:27 UTC

Force pipe executions to run on same node

Hi,

I need to control beam pipes/filters so that pipe executions that match a certain criteria are executed on the same node.

In Spring XD this can be controlled by defining groups (http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deployment) and then specify deployment criteria to match this group.

Is this possible with Beam?

Best
Ben

Re: Force pipe executions to run on same node

Posted by Davor Bonaci <da...@google.com>.

From a programming model perspective, Beam doesn't provide knobs to control
which nodes execute what tasks. Instead, Beam is a higher-level model where
application logic is abstracted away from the execution engine. It is Beam
runner's job to do this effectively, without user knobs.

It seems unlikely to be able to avoid at least one data transfer -- to the
set of machines that are processing your data. If your application logic
doesn't require shuffling of data between machines, a Beam runner should
figure this out and schedule work accordingly. Even further, a good runner
can automatically detect stragglers / bottlenecks and adjust the work
distribution on the fly.

With this, perhaps I can tilt your thinking a bit. With Beam, you focus on
your application logic and many execution considerations can be done
automatically by the system.

--

Now, I'll comment a bit on your specific use case. It sounds like a
standard use case of job orchestration via Beam, which should work just
fine.

File size of 1 MB doesn't seem too big. On the other hand, users sometimes
assess the tradeoff between passing file contents vs. file names through
the Beam pipeline. Depending on your specific application logic, passing
file names could be the right choice, and could potentially minimize the
amount of data transfer.

On Mon, May 23, 2016 at 12:59 PM, Jesse Anderson <je...@smokinghand.com>
wrote:

> Benjamin,
>
> Sorry, the success and failures are a bit too nuanced for an email.
>
> A quick check on average CAD files says they're around 1 MB. That'd be a
> poor use of HDFS.
>
> Thanks,
>
> Jesse
>
> On Mon, May 23, 2016 at 11:08 AM Stadin, Benjamin <
> Benjamin.Stadin@heidelberg-mobil.com> wrote:
>
>> Hi Jesse,
>>
>> Yes, this is what I’m looking for. I want to deploy and run the same
>> code, mostly written in Python as well as C++, on different nodes. I also
>> want to benefit from the job distribution and job monitoring /
>> administration capabilities. I only need parallelization to a minor degree
>> later.
>>
>> Though I’m hesitant to use HDFS, or any other distributed file system.
>> Since I process the data only on one node, it will probably be big
>> disadvantage for this data to be distributed to other nodes as well via
>> HDFS.
>>
>> Could you maybe share some info about the successful implementations and
>> configurations of such distributed job engine?
>>
>> Thanks
>> Ben
>>
>> Von: Jesse Anderson <je...@smokinghand.com>
>> Antworten an: "user@beam.incubator.apache.org" <
>> user@beam.incubator.apache.org>
>> Datum: Montag, 23. Mai 2016 um 19:22
>> An: "user@beam.incubator.apache.org" <us...@beam.incubator.apache.org>
>> Betreff: Re: Force pipe executions to run on same node
>>
>> Benjamin,
>>
>> I've had a few students using Big Data frameworks as a distributed job
>> engine. They work in varying degrees of success.
>>
>> With Beam, your success will really depend on the runner as JB said. If I
>> understand your use case correctly, if you were using Hadoop MapReduce,
>> you'd be using a map-only job. Beam would give you the ability to run the
>> same code on several different execution engines. If that isn't your goal,
>> you might look elsewhere.
>>
>> Thanks,
>>
>> Jesse
>>
>> On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>>> Hi Benjamin,
>>>
>>> Your data processing doesn't seem to be fully big data oriented and
>>> distributed.
>>>
>>> Maybe Apache Camel is more appropriate for such scenario. You can always
>>> delegate part of the data processing to Beam from Camel (using Kafka
>>> topic for instance).
>>>
>>> Regards
>>> JB
>>>
>>> On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:
>>> > Hi JB,
>>> >
>>> > None so far. I¹m still thinking about how to achieve what I want to do,
>>> > and whether Beam makes sense for my usage scenario.
>>> >
>>> > I¹m mostly interested to just orchestrate tasks to individual machines
>>> and
>>> > service endpoints, depending on their workload. My application is not
>>> so
>>> > much about Big Data and parallelism, but local data processing and
>>> local
>>> > parallelization.
>>> >
>>> > An example scenario:
>>> > - A user uploads a set of CAD files
>>> > - data from CAD files are extracted in parallel
>>> > - a whole bunch of native tools operate on this extracted data set in
>>> an
>>> > own pipe. Due to the amount of data generated and consumed, it doesn¹t
>>> > make sense at all to distribute these tasks to other machines. It¹s
>>> very
>>> > IO bound.
>>> > - For the same reason, it doesn¹t make sense to distribute data using
>>> RDD.
>>> > It¹s rather favorable to do only some tasks (such as CAD data
>>> extraction)
>>> > in parallel, otherwise run other data tasks as a group on a single
>>> node,
>>> > in order to avoid IO bottle necks.
>>> >
>>> > So I don¹t have a typical Big Data processing in mind. What I¹m looking
>>> > for is rather an integrated environment to provide only some kind of
>>> > parallel task execution, and task management and administration, as
>>> well
>>> > as a message bus and event system.
>>> >
>>> > Is Beam a choice for such rather non-Big-Data scenario?
>>> >
>>> > Regards,
>>> > Ben
>>> >
>>> >
>>> > Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter <
>>> jb@nanthrax.net>:
>>> >
>>> >> Hi Ben,
>>> >>
>>> >> it's not SDK related, it's more depend on the runner.
>>> >>
>>> >> What runner are you using ?
>>> >>
>>> >> Regards
>>> >> JB
>>> >>
>>> >> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>>> >>> Hi,
>>> >>>
>>> >>> I need to control beam pipes/filters so that pipe executions that
>>> match
>>> >>> a certain criteria are executed on the same node.
>>> >>>
>>> >>> In Spring XD this can be controlled by defining groups
>>> >>>
>>> >>> (
>>> http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>>> >>> yment)
>>> >>> and then specify deployment criteria to match this group.
>>> >>>
>>> >>> Is this possible with Beam?
>>> >>>
>>> >>> Best
>>> >>> Ben
>>> >>
>>> >> --
>>> >> Jean-Baptiste Onofré
>>> >> jbonofre@apache.org
>>> >> http://blog.nanthrax.net
>>> >> Talend - http://www.talend.com
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>

Re: Force pipe executions to run on same node

Posted by "Stadin, Benjamin" <Be...@heidelberg-mobil.com>.

Hi Jesse,

I didn't provide more details to this. The input CAD data is small (though rather 20-50MB per file), but as I said there is lots of very IO bound data processing done which produces a large amount of temporary data (but still not Big Data, rather a few hundred MB up  to some GB at most).

This is about distributing many of these local data processing tasks to several nodes, in order to provide a scalable realtime service for users of a web site. So I'd mostly use Beam as a building block for distributing and monitoring jobs, rather than anything big data.

Thanks
Ben

Von meinem iPad gesendet

Am 23.05.2016 um 21:59 schrieb Jesse Anderson <je...@smokinghand.com>>:

Benjamin,

Sorry, the success and failures are a bit too nuanced for an email.

A quick check on average CAD files says they're around 1 MB. That'd be a poor use of HDFS.

Thanks,

Jesse

On Mon, May 23, 2016 at 11:08 AM Stadin, Benjamin <Be...@heidelberg-mobil.com>> wrote:
Hi Jesse,

Yes, this is what I’m looking for. I want to deploy and run the same code, mostly written in Python as well as C++, on different nodes. I also want to benefit from the job distribution and job monitoring / administration capabilities. I only need parallelization to a minor degree later.

Though I’m hesitant to use HDFS, or any other distributed file system. Since I process the data only on one node, it will probably be big disadvantage for this data to be distributed to other nodes as well via HDFS.

Could you maybe share some info about the successful implementations and configurations of such distributed job engine?

Thanks
Ben

Von: Jesse Anderson <je...@smokinghand.com>>
Antworten an: "user@beam.incubator.apache.org<ma...@beam.incubator.apache.org>" <us...@beam.incubator.apache.org>>
Datum: Montag, 23. Mai 2016 um 19:22
An: "user@beam.incubator.apache.org<ma...@beam.incubator.apache.org>" <us...@beam.incubator.apache.org>>
Betreff: Re: Force pipe executions to run on same node

Benjamin,

I've had a few students using Big Data frameworks as a distributed job engine. They work in varying degrees of success.

With Beam, your success will really depend on the runner as JB said. If I understand your use case correctly, if you were using Hadoop MapReduce, you'd be using a map-only job. Beam would give you the ability to run the same code on several different execution engines. If that isn't your goal, you might look elsewhere.

Thanks,

Jesse

On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofré <jb...@nanthrax.net>> wrote:
Hi Benjamin,

Your data processing doesn't seem to be fully big data oriented and
distributed.

Maybe Apache Camel is more appropriate for such scenario. You can always
delegate part of the data processing to Beam from Camel (using Kafka
topic for instance).

Regards
JB

On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:
> Hi JB,
>
> None so far. I¹m still thinking about how to achieve what I want to do,
> and whether Beam makes sense for my usage scenario.
>
> I¹m mostly interested to just orchestrate tasks to individual machines and
> service endpoints, depending on their workload. My application is not so
> much about Big Data and parallelism, but local data processing and local
> parallelization.
>
> An example scenario:
> - A user uploads a set of CAD files
> - data from CAD files are extracted in parallel
> - a whole bunch of native tools operate on this extracted data set in an
> own pipe. Due to the amount of data generated and consumed, it doesn¹t
> make sense at all to distribute these tasks to other machines. It¹s very
> IO bound.
> - For the same reason, it doesn¹t make sense to distribute data using RDD.
> It¹s rather favorable to do only some tasks (such as CAD data extraction)
> in parallel, otherwise run other data tasks as a group on a single node,
> in order to avoid IO bottle necks.
>
> So I don¹t have a typical Big Data processing in mind. What I¹m looking
> for is rather an integrated environment to provide only some kind of
> parallel task execution, and task management and administration, as well
> as a message bus and event system.
>
> Is Beam a choice for such rather non-Big-Data scenario?
>
> Regards,
> Ben
>
>
> Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter <jb...@nanthrax.net>>:
>
>> Hi Ben,
>>
>> it's not SDK related, it's more depend on the runner.
>>
>> What runner are you using ?
>>
>> Regards
>> JB
>>
>> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>>> Hi,
>>>
>>> I need to control beam pipes/filters so that pipe executions that match
>>> a certain criteria are executed on the same node.
>>>
>>> In Spring XD this can be controlled by defining groups
>>>
>>> (http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>>> yment)
>>> and then specify deployment criteria to match this group.
>>>
>>> Is this possible with Beam?
>>>
>>> Best
>>> Ben
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org<ma...@apache.org>
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofré
jbonofre@apache.org<ma...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Force pipe executions to run on same node

Posted by Jesse Anderson <je...@smokinghand.com>.

Benjamin,

Sorry, the success and failures are a bit too nuanced for an email.

A quick check on average CAD files says they're around 1 MB. That'd be a
poor use of HDFS.

Thanks,

Jesse

On Mon, May 23, 2016 at 11:08 AM Stadin, Benjamin <
Benjamin.Stadin@heidelberg-mobil.com> wrote:

> Hi Jesse,
>
> Yes, this is what I’m looking for. I want to deploy and run the same code,
> mostly written in Python as well as C++, on different nodes. I also want to
> benefit from the job distribution and job monitoring / administration
> capabilities. I only need parallelization to a minor degree later.
>
> Though I’m hesitant to use HDFS, or any other distributed file system.
> Since I process the data only on one node, it will probably be big
> disadvantage for this data to be distributed to other nodes as well via
> HDFS.
>
> Could you maybe share some info about the successful implementations and
> configurations of such distributed job engine?
>
> Thanks
> Ben
>
> Von: Jesse Anderson <je...@smokinghand.com>
> Antworten an: "user@beam.incubator.apache.org" <
> user@beam.incubator.apache.org>
> Datum: Montag, 23. Mai 2016 um 19:22
> An: "user@beam.incubator.apache.org" <us...@beam.incubator.apache.org>
> Betreff: Re: Force pipe executions to run on same node
>
> Benjamin,
>
> I've had a few students using Big Data frameworks as a distributed job
> engine. They work in varying degrees of success.
>
> With Beam, your success will really depend on the runner as JB said. If I
> understand your use case correctly, if you were using Hadoop MapReduce,
> you'd be using a map-only job. Beam would give you the ability to run the
> same code on several different execution engines. If that isn't your goal,
> you might look elsewhere.
>
> Thanks,
>
> Jesse
>
> On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Hi Benjamin,
>>
>> Your data processing doesn't seem to be fully big data oriented and
>> distributed.
>>
>> Maybe Apache Camel is more appropriate for such scenario. You can always
>> delegate part of the data processing to Beam from Camel (using Kafka
>> topic for instance).
>>
>> Regards
>> JB
>>
>> On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:
>> > Hi JB,
>> >
>> > None so far. I¹m still thinking about how to achieve what I want to do,
>> > and whether Beam makes sense for my usage scenario.
>> >
>> > I¹m mostly interested to just orchestrate tasks to individual machines
>> and
>> > service endpoints, depending on their workload. My application is not so
>> > much about Big Data and parallelism, but local data processing and local
>> > parallelization.
>> >
>> > An example scenario:
>> > - A user uploads a set of CAD files
>> > - data from CAD files are extracted in parallel
>> > - a whole bunch of native tools operate on this extracted data set in an
>> > own pipe. Due to the amount of data generated and consumed, it doesn¹t
>> > make sense at all to distribute these tasks to other machines. It¹s very
>> > IO bound.
>> > - For the same reason, it doesn¹t make sense to distribute data using
>> RDD.
>> > It¹s rather favorable to do only some tasks (such as CAD data
>> extraction)
>> > in parallel, otherwise run other data tasks as a group on a single node,
>> > in order to avoid IO bottle necks.
>> >
>> > So I don¹t have a typical Big Data processing in mind. What I¹m looking
>> > for is rather an integrated environment to provide only some kind of
>> > parallel task execution, and task management and administration, as well
>> > as a message bus and event system.
>> >
>> > Is Beam a choice for such rather non-Big-Data scenario?
>> >
>> > Regards,
>> > Ben
>> >
>> >
>> > Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter <
>> jb@nanthrax.net>:
>> >
>> >> Hi Ben,
>> >>
>> >> it's not SDK related, it's more depend on the runner.
>> >>
>> >> What runner are you using ?
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>> >>> Hi,
>> >>>
>> >>> I need to control beam pipes/filters so that pipe executions that
>> match
>> >>> a certain criteria are executed on the same node.
>> >>>
>> >>> In Spring XD this can be controlled by defining groups
>> >>>
>> >>> (
>> http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>> >>> yment)
>> >>> and then specify deployment criteria to match this group.
>> >>>
>> >>> Is this possible with Beam?
>> >>>
>> >>> Best
>> >>> Ben
>> >>
>> >> --
>> >> Jean-Baptiste Onofré
>> >> jbonofre@apache.org
>> >> http://blog.nanthrax.net
>> >> Talend - http://www.talend.com
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: Force pipe executions to run on same node

Posted by "Stadin, Benjamin" <Be...@heidelberg-mobil.com>.

Hi Jesse,

Yes, this is what I’m looking for. I want to deploy and run the same code, mostly written in Python as well as C++, on different nodes. I also want to benefit from the job distribution and job monitoring / administration capabilities. I only need parallelization to a minor degree later.

Though I’m hesitant to use HDFS, or any other distributed file system. Since I process the data only on one node, it will probably be big disadvantage for this data to be distributed to other nodes as well via HDFS.

Could you maybe share some info about the successful implementations and configurations of such distributed job engine?

Thanks
Ben

Von: Jesse Anderson <je...@smokinghand.com>>
Antworten an: "user@beam.incubator.apache.org<ma...@beam.incubator.apache.org>" <us...@beam.incubator.apache.org>>
Datum: Montag, 23. Mai 2016 um 19:22
An: "user@beam.incubator.apache.org<ma...@beam.incubator.apache.org>" <us...@beam.incubator.apache.org>>
Betreff: Re: Force pipe executions to run on same node

Benjamin,

I've had a few students using Big Data frameworks as a distributed job engine. They work in varying degrees of success.

With Beam, your success will really depend on the runner as JB said. If I understand your use case correctly, if you were using Hadoop MapReduce, you'd be using a map-only job. Beam would give you the ability to run the same code on several different execution engines. If that isn't your goal, you might look elsewhere.

Thanks,

Jesse

On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofré <jb...@nanthrax.net>> wrote:
Hi Benjamin,

Your data processing doesn't seem to be fully big data oriented and
distributed.

Maybe Apache Camel is more appropriate for such scenario. You can always
delegate part of the data processing to Beam from Camel (using Kafka
topic for instance).

Regards
JB

On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:
> Hi JB,
>
> None so far. I¹m still thinking about how to achieve what I want to do,
> and whether Beam makes sense for my usage scenario.
>
> I¹m mostly interested to just orchestrate tasks to individual machines and
> service endpoints, depending on their workload. My application is not so
> much about Big Data and parallelism, but local data processing and local
> parallelization.
>
> An example scenario:
> - A user uploads a set of CAD files
> - data from CAD files are extracted in parallel
> - a whole bunch of native tools operate on this extracted data set in an
> own pipe. Due to the amount of data generated and consumed, it doesn¹t
> make sense at all to distribute these tasks to other machines. It¹s very
> IO bound.
> - For the same reason, it doesn¹t make sense to distribute data using RDD.
> It¹s rather favorable to do only some tasks (such as CAD data extraction)
> in parallel, otherwise run other data tasks as a group on a single node,
> in order to avoid IO bottle necks.
>
> So I don¹t have a typical Big Data processing in mind. What I¹m looking
> for is rather an integrated environment to provide only some kind of
> parallel task execution, and task management and administration, as well
> as a message bus and event system.
>
> Is Beam a choice for such rather non-Big-Data scenario?
>
> Regards,
> Ben
>
>
> Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter <jb...@nanthrax.net>>:
>
>> Hi Ben,
>>
>> it's not SDK related, it's more depend on the runner.
>>
>> What runner are you using ?
>>
>> Regards
>> JB
>>
>> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>>> Hi,
>>>
>>> I need to control beam pipes/filters so that pipe executions that match
>>> a certain criteria are executed on the same node.
>>>
>>> In Spring XD this can be controlled by defining groups
>>>
>>> (http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>>> yment)
>>> and then specify deployment criteria to match this group.
>>>
>>> Is this possible with Beam?
>>>
>>> Best
>>> Ben
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org<ma...@apache.org>
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofré
jbonofre@apache.org<ma...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Force pipe executions to run on same node

Posted by Jesse Anderson <je...@smokinghand.com>.

Benjamin,

I've had a few students using Big Data frameworks as a distributed job
engine. They work in varying degrees of success.

With Beam, your success will really depend on the runner as JB said. If I
understand your use case correctly, if you were using Hadoop MapReduce,
you'd be using a map-only job. Beam would give you the ability to run the
same code on several different execution engines. If that isn't your goal,
you might look elsewhere.

Thanks,

Jesse

On Mon, May 23, 2016 at 6:47 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Benjamin,
>
> Your data processing doesn't seem to be fully big data oriented and
> distributed.
>
> Maybe Apache Camel is more appropriate for such scenario. You can always
> delegate part of the data processing to Beam from Camel (using Kafka
> topic for instance).
>
> Regards
> JB
>
> On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:
> > Hi JB,
> >
> > None so far. I¹m still thinking about how to achieve what I want to do,
> > and whether Beam makes sense for my usage scenario.
> >
> > I¹m mostly interested to just orchestrate tasks to individual machines
> and
> > service endpoints, depending on their workload. My application is not so
> > much about Big Data and parallelism, but local data processing and local
> > parallelization.
> >
> > An example scenario:
> > - A user uploads a set of CAD files
> > - data from CAD files are extracted in parallel
> > - a whole bunch of native tools operate on this extracted data set in an
> > own pipe. Due to the amount of data generated and consumed, it doesn¹t
> > make sense at all to distribute these tasks to other machines. It¹s very
> > IO bound.
> > - For the same reason, it doesn¹t make sense to distribute data using
> RDD.
> > It¹s rather favorable to do only some tasks (such as CAD data extraction)
> > in parallel, otherwise run other data tasks as a group on a single node,
> > in order to avoid IO bottle necks.
> >
> > So I don¹t have a typical Big Data processing in mind. What I¹m looking
> > for is rather an integrated environment to provide only some kind of
> > parallel task execution, and task management and administration, as well
> > as a message bus and event system.
> >
> > Is Beam a choice for such rather non-Big-Data scenario?
> >
> > Regards,
> > Ben
> >
> >
> > Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter <jb@nanthrax.net
> >:
> >
> >> Hi Ben,
> >>
> >> it's not SDK related, it's more depend on the runner.
> >>
> >> What runner are you using ?
> >>
> >> Regards
> >> JB
> >>
> >> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
> >>> Hi,
> >>>
> >>> I need to control beam pipes/filters so that pipe executions that match
> >>> a certain criteria are executed on the same node.
> >>>
> >>> In Spring XD this can be controlled by defining groups
> >>>
> >>> (
> http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
> >>> yment)
> >>> and then specify deployment criteria to match this group.
> >>>
> >>> Is this possible with Beam?
> >>>
> >>> Best
> >>> Ben
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbonofre@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Force pipe executions to run on same node

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Benjamin,

Your data processing doesn't seem to be fully big data oriented and 
distributed.

Maybe Apache Camel is more appropriate for such scenario. You can always 
delegate part of the data processing to Beam from Camel (using Kafka 
topic for instance).

Regards
JB

On 05/22/2016 11:01 PM, Stadin, Benjamin wrote:
> Hi JB,
>
> None so far. Im still thinking about how to achieve what I want to do,
> and whether Beam makes sense for my usage scenario.
>
> Im mostly interested to just orchestrate tasks to individual machines and
> service endpoints, depending on their workload. My application is not so
> much about Big Data and parallelism, but local data processing and local
> parallelization.
>
> An example scenario:
> - A user uploads a set of CAD files
> - data from CAD files are extracted in parallel
> - a whole bunch of native tools operate on this extracted data set in an
> own pipe. Due to the amount of data generated and consumed, it doesnt
> make sense at all to distribute these tasks to other machines. Its very
> IO bound.
> - For the same reason, it doesnt make sense to distribute data using RDD.
> Its rather favorable to do only some tasks (such as CAD data extraction)
> in parallel, otherwise run other data tasks as a group on a single node,
> in order to avoid IO bottle necks.
>
> So I dont have a typical Big Data processing in mind. What Im looking
> for is rather an integrated environment to provide only some kind of
> parallel task execution, and task management and administration, as well
> as a message bus and event system.
>
> Is Beam a choice for such rather non-Big-Data scenario?
>
> Regards,
> Ben
>
>
> Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofr" unter <jb...@nanthrax.net>:
>
>> Hi Ben,
>>
>> it's not SDK related, it's more depend on the runner.
>>
>> What runner are you using ?
>>
>> Regards
>> JB
>>
>> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>>> Hi,
>>>
>>> I need to control beam pipes/filters so that pipe executions that match
>>> a certain criteria are executed on the same node.
>>>
>>> In Spring XD this can be controlled by defining groups
>>>
>>> (http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>>> yment)
>>> and then specify deployment criteria to match this group.
>>>
>>> Is this possible with Beam?
>>>
>>> Best
>>> Ben
>>
>> --
>> Jean-Baptiste Onofr
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

-- 
Jean-Baptiste Onofr
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Force pipe executions to run on same node

Posted by Emanuele Cesena <em...@shopkick.com>.

Hi Ben,

With (Beam over) Spark over Yarn, data locality is taken into account when containers are created to process data.

So, in principle, say you have your cad file in hdfs on hostA, then the processing will likely happen on hostA or “near enough”.
Supported localities vary from same process, to same host, to same rack (if configured).
I’m using (virtual/fake) racks to specifies virtual machines/containers on the same physical host, and it works like a charm.

Be aware, though, that the whole spark+yarn model is to provide reliable computation, i.e. it tries to allocate resources where it thinks it’s best, but it may allocate them somewhere else. In my experiments, for instance, a minor percentage of tasks is allocated “far away” for no apparent reason. This is fine for my use case, not sure for yours.

In addition, you need to store your cad files “somewhere” that yarn understands, so for instance hdfs would be the natural choice.

Finally, note that I have experience with Spark over Yarn (actually, without Beam), but Flink also supports running on Yarn. I’m not sure if/how they manage data locality, but they probably do too.

Best,

> On May 22, 2016, at 2:01 PM, Stadin, Benjamin <Be...@heidelberg-mobil.com> wrote:
> 
> Hi JB,
> 
> None so far. Iąm still thinking about how to achieve what I want to do,
> and whether Beam makes sense for my usage scenario.
> 
> Iąm mostly interested to just orchestrate tasks to individual machines and
> service endpoints, depending on their workload. My application is not so
> much about Big Data and parallelism, but local data processing and local
> parallelization. 
> 
> An example scenario:
> - A user uploads a set of CAD files
> - data from CAD files are extracted in parallel
> - a whole bunch of native tools operate on this extracted data set in an
> own pipe. Due to the amount of data generated and consumed, it doesnąt
> make sense at all to distribute these tasks to other machines. Itąs very
> IO bound. 
> - For the same reason, it doesnąt make sense to distribute data using RDD.
> Itąs rather favorable to do only some tasks (such as CAD data extraction)
> in parallel, otherwise run other data tasks as a group on a single node,
> in order to avoid IO bottle necks.
> 
> So I donąt have a typical Big Data processing in mind. What Iąm looking
> for is rather an integrated environment to provide only some kind of
> parallel task execution, and task management and administration, as well
> as a message bus and event system.
> 
> Is Beam a choice for such rather non-Big-Data scenario?
> 
> Regards,
> Ben
> 
> 
> Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter <jb...@nanthrax.net>:
> 
>> Hi Ben,
>> 
>> it's not SDK related, it's more depend on the runner.
>> 
>> What runner are you using ?
>> 
>> Regards
>> JB
>> 
>> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>>> Hi,
>>> 
>>> I need to control beam pipes/filters so that pipe executions that match
>>> a certain criteria are executed on the same node.
>>> 
>>> In Spring XD this can be controlled by defining groups
>>> 
>>> (http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>>> yment)
>>> and then specify deployment criteria to match this group.
>>> 
>>> Is this possible with Beam?
>>> 
>>> Best
>>> Ben
>> 
>> -- 
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

Re: Force pipe executions to run on same node

Posted by "Stadin, Benjamin" <Be...@heidelberg-mobil.com>.

Is somebody else using Beam for a simliar scenario and can advise on a
setup?

Regards
Ben


Am 22.05.16, 23:01 schrieb "Stadin, Benjamin" unter
<Be...@heidelberg-mobil.com>:

>Hi JB,
>
>None so far. I¹m still thinking about how to achieve what I want to do,
>and whether Beam makes sense for my usage scenario.
>
>I¹m mostly interested to just orchestrate tasks to individual machines and
>service endpoints, depending on their workload. My application is not so
>much about Big Data and parallelism, but local data processing and local
>parallelization. 
>
>An example scenario:
>- A user uploads a set of CAD files
>- data from CAD files are extracted in parallel
>- a whole bunch of native tools operate on this extracted data set in an
>own pipe. Due to the amount of data generated and consumed, it doesn¹t
>make sense at all to distribute these tasks to other machines. It¹s very
>IO bound. 
>- For the same reason, it doesn¹t make sense to distribute data using RDD.
>It¹s rather favorable to do only some tasks (such as CAD data extraction)
>in parallel, otherwise run other data tasks as a group on a single node,
>in order to avoid IO bottle necks.
>
>So I don¹t have a typical Big Data processing in mind. What I¹m looking
>for is rather an integrated environment to provide only some kind of
>parallel task execution, and task management and administration, as well
>as a message bus and event system.
>
>Is Beam a choice for such rather non-Big-Data scenario?
>
>Regards,
>Ben
>
>
>Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter <jb...@nanthrax.net>:
>
>>Hi Ben,
>>
>>it's not SDK related, it's more depend on the runner.
>>
>>What runner are you using ?
>>
>>Regards
>>JB
>>
>>On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>>> Hi,
>>>
>>> I need to control beam pipes/filters so that pipe executions that match
>>> a certain criteria are executed on the same node.
>>>
>>> In Spring XD this can be controlled by defining groups
>>> 
>>>(http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#depl
>>>o
>>>yment)
>>> and then specify deployment criteria to match this group.
>>>
>>> Is this possible with Beam?
>>>
>>> Best
>>> Ben
>>
>>-- 
>>Jean-Baptiste Onofré
>>jbonofre@apache.org
>>http://blog.nanthrax.net
>>Talend - http://www.talend.com
>

Re: Force pipe executions to run on same node

Posted by "Stadin, Benjamin" <Be...@heidelberg-mobil.com>.

Hi JB,

None so far. I¹m still thinking about how to achieve what I want to do,
and whether Beam makes sense for my usage scenario.

I¹m mostly interested to just orchestrate tasks to individual machines and
service endpoints, depending on their workload. My application is not so
much about Big Data and parallelism, but local data processing and local
parallelization. 

An example scenario:
- A user uploads a set of CAD files
- data from CAD files are extracted in parallel
- a whole bunch of native tools operate on this extracted data set in an
own pipe. Due to the amount of data generated and consumed, it doesn¹t
make sense at all to distribute these tasks to other machines. It¹s very
IO bound. 
- For the same reason, it doesn¹t make sense to distribute data using RDD.
It¹s rather favorable to do only some tasks (such as CAD data extraction)
in parallel, otherwise run other data tasks as a group on a single node,
in order to avoid IO bottle necks.

So I don¹t have a typical Big Data processing in mind. What I¹m looking
for is rather an integrated environment to provide only some kind of
parallel task execution, and task management and administration, as well
as a message bus and event system.

Is Beam a choice for such rather non-Big-Data scenario?

Regards,
Ben

Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter <jb...@nanthrax.net>:

>Hi Ben,
>
>it's not SDK related, it's more depend on the runner.
>
>What runner are you using ?
>
>Regards
>JB
>
>On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>> Hi,
>>
>> I need to control beam pipes/filters so that pipe executions that match
>> a certain criteria are executed on the same node.
>>
>> In Spring XD this can be controlled by defining groups
>> 
>>(http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>>yment)
>> and then specify deployment criteria to match this group.
>>
>> Is this possible with Beam?
>>
>> Best
>> Ben
>
>-- 
>Jean-Baptiste Onofré
>jbonofre@apache.org
>http://blog.nanthrax.net
>Talend - http://www.talend.com

Re: Force pipe executions to run on same node

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Ben,

it's not SDK related, it's more depend on the runner.

What runner are you using ?

Regards
JB

On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
> Hi,
>
> I need to control beam pipes/filters so that pipe executions that match
> a certain criteria are executed on the same node.
>
> In Spring XD this can be controlled by defining groups
> (http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deployment)
> and then specify deployment criteria to match this group.
>
> Is this possible with Beam?
>
> Best
> Ben

-- 
Jean-Baptiste Onofr
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com