You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by David Olsen <da...@gmail.com> on 2016/05/24 14:38:17 UTC

Parallelism

A naive question about DirectPipelineRunner: Is it possible to
execute DirectPipelineRunner with multiple threads/ instances (across
machines) or the parallelism is only supported by runner such as
SparkPipelineRunner?

My requirement is to run pipeline in parallel, either threading or multiple
machines. And I just start to investigating Apache Beam.

When reading google dataflow doc, the options setting mention that
numWorkers can be configured for the instances to use (I understand it's
still different from Apache Beam). However, searching Apache Beam source on
github with the keyword 'numWorkers' doesn't come up related source
snippet. So I am wondering if the only way to execute pipeline process in
parallel is to use SparkPipelineRunner/ FlinkPipelineRunner (meaning I have
to use Apache Beam + Spark/ Flink) or make use of Google Cloud Platform?

Thanks

[1].
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options

Re: Parallelism

Posted by David Olsen <da...@gmail.com>.
I can do that. Thanks for the suggestion and the tracking. I appreciate it!

On 26 May 2016 at 02:36, Thomas Groh <tg...@google.com> wrote:

> Each source (e.g. TextIO.Read.from("/foo/bar")) will currently be invoked
> by only a single thread at a time; Multiple sources (e.g.
> TextIO.Read.from("/foo/bar"); ... TextIO.Read.from("/foo/baz"); ...) will
> be read from independently and in different threads. If you have a known
> distribution of sources, splitting the read into multiple sources is a
> workaround to produce additional parallelism in the InProcessPipelineRunner
> (Flattening the produced PCollections together). Additionally, downstream
> transforms prior to a GroupByKey will also be executed by a single thread
> at a time, but independently; so if multiple transforms are applied to the
> same PCollection, they will be executed in parallel).
>
> I've added https://issues.apache.org/jira/browse/BEAM-310 to track the
> feature to split input sources.
>
> On Wed, May 25, 2016 at 8:41 AM, David Olsen <da...@gmail.com>
> wrote:
>
>> At the moment I would need to read split data locally, to perform
>> external calls, and then to aggregate results based on a particular key
>> (e.g. per user) if needed.
>>
>> This seems to me the current InProcessPipelineRunner fulfill my
>> requirement, but not sure if 'Read transforms with one thread' means
>> reading underlying data (in my case i.e. to read split data locally) will
>> use one single thread to go through all split data? If this is the case,
>> anyway I can read splits with more than one thread or any workaround?
>> (Eventually I will go with across machines pipeline runner; now I don't
>> have enough resources so have to start from single machine.)
>>
>> Thanks for all the input. It's very useful!
>>
>> On 25 May 2016 at 02:41, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>>
>>> I second Thomas: thanks for the details explanation (I forgot the
>>> mention the "unique" JVM ;)).
>>>
>>> Regards
>>> JB
>>>
>>> On 05/24/2016 07:28 PM, Thomas Groh wrote:
>>>
>>>> More specifically, the InProcessPipelineRunner (soon to be renamed to
>>>> the DirectRunner) will run on a single machine, with a number of threads
>>>> based on the number of available processors in the JVM, fanning out work
>>>> to these threads as appropriate; It will not perform any cross-process
>>>> (including cross-machine) communication. No configuration is required to
>>>> get this threading behavior, but the number of threads is also not
>>>> currently configurable.
>>>>
>>>> Can you say more about what you require to be parallel? In the current
>>>> implementation, Read transforms (and the Source that underlies them) are
>>>> currently exercised by only one thread, as are PTransforms downstream of
>>>> them prior to a GroupByKey, based on how work is scheduled. However, all
>>>> transforms after a GroupByKey execute in parallel based on the number of
>>>> available keys.
>>>>
>>>> On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <jb@nanthrax.net
>>>> <ma...@nanthrax.net>> wrote:
>>>>
>>>>     Hi David,
>>>>
>>>>     if you use the InProcessPipelineRunner (the "new"
>>>>     DirectPipelineRunner), than it can creates several threads.
>>>>
>>>>     Regards
>>>>     JB
>>>>
>>>>
>>>>     On 05/24/2016 04:38 PM, David Olsen wrote:
>>>>
>>>>         A naive question about DirectPipelineRunner: Is it possible to
>>>>         execute DirectPipelineRunner with multiple threads/ instances
>>>>         (across
>>>>         machines) or the parallelism is only supported by runner such as
>>>>         SparkPipelineRunner?
>>>>
>>>>         My requirement is to run pipeline in parallel, either threading
>>>> or
>>>>         multiple machines. And I just start to investigating Apache
>>>> Beam.
>>>>
>>>>         When reading google dataflow doc, the options setting mention
>>>> that
>>>>         numWorkers can be configured for the instances to use (I
>>>>         understand it's
>>>>         still different from Apache Beam). However, searching Apache
>>>>         Beam source
>>>>         on github with the keyword 'numWorkers' doesn't come up related
>>>>         source
>>>>         snippet. So I am wondering if the only way to execute pipeline
>>>>         process
>>>>         in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner
>>>>         (meaning
>>>>         I have to use Apache Beam + Spark/ Flink) or make use of Google
>>>>         Cloud
>>>>         Platform?
>>>>
>>>>         Thanks
>>>>
>>>>         [1].
>>>>
>>>> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
>>>>
>>>>
>>>>     --
>>>>     Jean-Baptiste Onofré
>>>>     jbonofre@apache.org <ma...@apache.org>
>>>>     http://blog.nanthrax.net
>>>>     Talend - http://www.talend.com
>>>>
>>>>
>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>
>

Re: Parallelism

Posted by Thomas Groh <tg...@google.com>.
Each source (e.g. TextIO.Read.from("/foo/bar")) will currently be invoked
by only a single thread at a time; Multiple sources (e.g.
TextIO.Read.from("/foo/bar"); ... TextIO.Read.from("/foo/baz"); ...) will
be read from independently and in different threads. If you have a known
distribution of sources, splitting the read into multiple sources is a
workaround to produce additional parallelism in the InProcessPipelineRunner
(Flattening the produced PCollections together). Additionally, downstream
transforms prior to a GroupByKey will also be executed by a single thread
at a time, but independently; so if multiple transforms are applied to the
same PCollection, they will be executed in parallel).

I've added https://issues.apache.org/jira/browse/BEAM-310 to track the
feature to split input sources.

On Wed, May 25, 2016 at 8:41 AM, David Olsen <da...@gmail.com>
wrote:

> At the moment I would need to read split data locally, to perform external
> calls, and then to aggregate results based on a particular key (e.g. per
> user) if needed.
>
> This seems to me the current InProcessPipelineRunner fulfill my
> requirement, but not sure if 'Read transforms with one thread' means
> reading underlying data (in my case i.e. to read split data locally) will
> use one single thread to go through all split data? If this is the case,
> anyway I can read splits with more than one thread or any workaround?
> (Eventually I will go with across machines pipeline runner; now I don't
> have enough resources so have to start from single machine.)
>
> Thanks for all the input. It's very useful!
>
> On 25 May 2016 at 02:41, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>
>> I second Thomas: thanks for the details explanation (I forgot the mention
>> the "unique" JVM ;)).
>>
>> Regards
>> JB
>>
>> On 05/24/2016 07:28 PM, Thomas Groh wrote:
>>
>>> More specifically, the InProcessPipelineRunner (soon to be renamed to
>>> the DirectRunner) will run on a single machine, with a number of threads
>>> based on the number of available processors in the JVM, fanning out work
>>> to these threads as appropriate; It will not perform any cross-process
>>> (including cross-machine) communication. No configuration is required to
>>> get this threading behavior, but the number of threads is also not
>>> currently configurable.
>>>
>>> Can you say more about what you require to be parallel? In the current
>>> implementation, Read transforms (and the Source that underlies them) are
>>> currently exercised by only one thread, as are PTransforms downstream of
>>> them prior to a GroupByKey, based on how work is scheduled. However, all
>>> transforms after a GroupByKey execute in parallel based on the number of
>>> available keys.
>>>
>>> On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <jb@nanthrax.net
>>> <ma...@nanthrax.net>> wrote:
>>>
>>>     Hi David,
>>>
>>>     if you use the InProcessPipelineRunner (the "new"
>>>     DirectPipelineRunner), than it can creates several threads.
>>>
>>>     Regards
>>>     JB
>>>
>>>
>>>     On 05/24/2016 04:38 PM, David Olsen wrote:
>>>
>>>         A naive question about DirectPipelineRunner: Is it possible to
>>>         execute DirectPipelineRunner with multiple threads/ instances
>>>         (across
>>>         machines) or the parallelism is only supported by runner such as
>>>         SparkPipelineRunner?
>>>
>>>         My requirement is to run pipeline in parallel, either threading
>>> or
>>>         multiple machines. And I just start to investigating Apache Beam.
>>>
>>>         When reading google dataflow doc, the options setting mention
>>> that
>>>         numWorkers can be configured for the instances to use (I
>>>         understand it's
>>>         still different from Apache Beam). However, searching Apache
>>>         Beam source
>>>         on github with the keyword 'numWorkers' doesn't come up related
>>>         source
>>>         snippet. So I am wondering if the only way to execute pipeline
>>>         process
>>>         in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner
>>>         (meaning
>>>         I have to use Apache Beam + Spark/ Flink) or make use of Google
>>>         Cloud
>>>         Platform?
>>>
>>>         Thanks
>>>
>>>         [1].
>>>
>>> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
>>>
>>>
>>>     --
>>>     Jean-Baptiste Onofré
>>>     jbonofre@apache.org <ma...@apache.org>
>>>     http://blog.nanthrax.net
>>>     Talend - http://www.talend.com
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: Parallelism

Posted by David Olsen <da...@gmail.com>.
At the moment I would need to read split data locally, to perform external
calls, and then to aggregate results based on a particular key (e.g. per
user) if needed.

This seems to me the current InProcessPipelineRunner fulfill my
requirement, but not sure if 'Read transforms with one thread' means
reading underlying data (in my case i.e. to read split data locally) will
use one single thread to go through all split data? If this is the case,
anyway I can read splits with more than one thread or any workaround?
(Eventually I will go with across machines pipeline runner; now I don't
have enough resources so have to start from single machine.)

Thanks for all the input. It's very useful!

On 25 May 2016 at 02:41, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:

> I second Thomas: thanks for the details explanation (I forgot the mention
> the "unique" JVM ;)).
>
> Regards
> JB
>
> On 05/24/2016 07:28 PM, Thomas Groh wrote:
>
>> More specifically, the InProcessPipelineRunner (soon to be renamed to
>> the DirectRunner) will run on a single machine, with a number of threads
>> based on the number of available processors in the JVM, fanning out work
>> to these threads as appropriate; It will not perform any cross-process
>> (including cross-machine) communication. No configuration is required to
>> get this threading behavior, but the number of threads is also not
>> currently configurable.
>>
>> Can you say more about what you require to be parallel? In the current
>> implementation, Read transforms (and the Source that underlies them) are
>> currently exercised by only one thread, as are PTransforms downstream of
>> them prior to a GroupByKey, based on how work is scheduled. However, all
>> transforms after a GroupByKey execute in parallel based on the number of
>> available keys.
>>
>> On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <jb@nanthrax.net
>> <ma...@nanthrax.net>> wrote:
>>
>>     Hi David,
>>
>>     if you use the InProcessPipelineRunner (the "new"
>>     DirectPipelineRunner), than it can creates several threads.
>>
>>     Regards
>>     JB
>>
>>
>>     On 05/24/2016 04:38 PM, David Olsen wrote:
>>
>>         A naive question about DirectPipelineRunner: Is it possible to
>>         execute DirectPipelineRunner with multiple threads/ instances
>>         (across
>>         machines) or the parallelism is only supported by runner such as
>>         SparkPipelineRunner?
>>
>>         My requirement is to run pipeline in parallel, either threading or
>>         multiple machines. And I just start to investigating Apache Beam.
>>
>>         When reading google dataflow doc, the options setting mention that
>>         numWorkers can be configured for the instances to use (I
>>         understand it's
>>         still different from Apache Beam). However, searching Apache
>>         Beam source
>>         on github with the keyword 'numWorkers' doesn't come up related
>>         source
>>         snippet. So I am wondering if the only way to execute pipeline
>>         process
>>         in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner
>>         (meaning
>>         I have to use Apache Beam + Spark/ Flink) or make use of Google
>>         Cloud
>>         Platform?
>>
>>         Thanks
>>
>>         [1].
>>
>> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
>>
>>
>>     --
>>     Jean-Baptiste Onofré
>>     jbonofre@apache.org <ma...@apache.org>
>>     http://blog.nanthrax.net
>>     Talend - http://www.talend.com
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Parallelism

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
I second Thomas: thanks for the details explanation (I forgot the 
mention the "unique" JVM ;)).

Regards
JB

On 05/24/2016 07:28 PM, Thomas Groh wrote:
> More specifically, the InProcessPipelineRunner (soon to be renamed to
> the DirectRunner) will run on a single machine, with a number of threads
> based on the number of available processors in the JVM, fanning out work
> to these threads as appropriate; It will not perform any cross-process
> (including cross-machine) communication. No configuration is required to
> get this threading behavior, but the number of threads is also not
> currently configurable.
>
> Can you say more about what you require to be parallel? In the current
> implementation, Read transforms (and the Source that underlies them) are
> currently exercised by only one thread, as are PTransforms downstream of
> them prior to a GroupByKey, based on how work is scheduled. However, all
> transforms after a GroupByKey execute in parallel based on the number of
> available keys.
>
> On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofr� <jb@nanthrax.net
> <ma...@nanthrax.net>> wrote:
>
>     Hi David,
>
>     if you use the InProcessPipelineRunner (the "new"
>     DirectPipelineRunner), than it can creates several threads.
>
>     Regards
>     JB
>
>
>     On 05/24/2016 04:38 PM, David Olsen wrote:
>
>         A naive question about DirectPipelineRunner: Is it possible to
>         execute DirectPipelineRunner with multiple threads/ instances
>         (across
>         machines) or the parallelism is only supported by runner such as
>         SparkPipelineRunner?
>
>         My requirement is to run pipeline in parallel, either threading or
>         multiple machines. And I just start to investigating Apache Beam.
>
>         When reading google dataflow doc, the options setting mention that
>         numWorkers can be configured for the instances to use (I
>         understand it's
>         still different from Apache Beam). However, searching Apache
>         Beam source
>         on github with the keyword 'numWorkers' doesn't come up related
>         source
>         snippet. So I am wondering if the only way to execute pipeline
>         process
>         in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner
>         (meaning
>         I have to use Apache Beam + Spark/ Flink) or make use of Google
>         Cloud
>         Platform?
>
>         Thanks
>
>         [1].
>         https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
>
>
>     --
>     Jean-Baptiste Onofr�
>     jbonofre@apache.org <ma...@apache.org>
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Parallelism

Posted by Thomas Groh <tg...@google.com>.
More specifically, the InProcessPipelineRunner (soon to be renamed to the
DirectRunner) will run on a single machine, with a number of threads based
on the number of available processors in the JVM, fanning out work to these
threads as appropriate; It will not perform any cross-process (including
cross-machine) communication. No configuration is required to get this
threading behavior, but the number of threads is also not currently
configurable.

Can you say more about what you require to be parallel? In the current
implementation, Read transforms (and the Source that underlies them) are
currently exercised by only one thread, as are PTransforms downstream of
them prior to a GroupByKey, based on how work is scheduled. However, all
transforms after a GroupByKey execute in parallel based on the number of
available keys.

On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi David,
>
> if you use the InProcessPipelineRunner (the "new" DirectPipelineRunner),
> than it can creates several threads.
>
> Regards
> JB
>
>
> On 05/24/2016 04:38 PM, David Olsen wrote:
>
>> A naive question about DirectPipelineRunner: Is it possible to
>> execute DirectPipelineRunner with multiple threads/ instances (across
>> machines) or the parallelism is only supported by runner such as
>> SparkPipelineRunner?
>>
>> My requirement is to run pipeline in parallel, either threading or
>> multiple machines. And I just start to investigating Apache Beam.
>>
>> When reading google dataflow doc, the options setting mention that
>> numWorkers can be configured for the instances to use (I understand it's
>> still different from Apache Beam). However, searching Apache Beam source
>> on github with the keyword 'numWorkers' doesn't come up related source
>> snippet. So I am wondering if the only way to execute pipeline process
>> in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner (meaning
>> I have to use Apache Beam + Spark/ Flink) or make use of Google Cloud
>> Platform?
>>
>> Thanks
>>
>> [1].
>>
>> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Parallelism

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi David,

if you use the InProcessPipelineRunner (the "new" DirectPipelineRunner), 
than it can creates several threads.

Regards
JB

On 05/24/2016 04:38 PM, David Olsen wrote:
> A naive question about DirectPipelineRunner: Is it possible to
> execute DirectPipelineRunner with multiple threads/ instances (across
> machines) or the parallelism is only supported by runner such as
> SparkPipelineRunner?
>
> My requirement is to run pipeline in parallel, either threading or
> multiple machines. And I just start to investigating Apache Beam.
>
> When reading google dataflow doc, the options setting mention that
> numWorkers can be configured for the instances to use (I understand it's
> still different from Apache Beam). However, searching Apache Beam source
> on github with the keyword 'numWorkers' doesn't come up related source
> snippet. So I am wondering if the only way to execute pipeline process
> in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner (meaning
> I have to use Apache Beam + Spark/ Flink) or make use of Google Cloud
> Platform?
>
> Thanks
>
> [1].
> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com