You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by au...@gmail.com, au...@gmail.com on 2019/03/27 12:05:39 UTC

Imposing parallelism

Hi all,

The input of my pipeline is a file (~360k lines) with S3 paths using TextIO.read(). I use these S3 paths to retrieve the files in my next transform. I believe when I tried to run it a spark cluster on EMR, even though the machine that I setup had 16 cores, the "read from s3" task got only split into 8 parts. Since this operation is very IO intensive I believe it would benefit from having a lot more parallelism. Is there a way to define how parallel a certain operation should be (I believe this option exists for Spark at least), or is this the wrong way to go about it.

Best regards,
Augusto 

Re: Imposing parallelism

Posted by au...@gmail.com, au...@gmail.com.
Hi Max,

Again, thanks for your answer. Is there anyone that can point me to some example or documentation on how to develop your own reader?

Is this (https://beam.apache.org/documentation/io/developing-io-java/) the best reference to look at? 

Best regards,
Augusto 

On 2019/03/29 14:23:42, Maximilian Michels <mx...@apache.org> wrote: 
> Hi Augusto,
> 
> In Beam there is no way to specify how parallel a specific transform 
> should be. There is only a general indicator for how parallel a pipeline 
> should be, i.e. Dataflow has "numWorkers", Spark/Flink have "parallelism".
> 
> You should see 16 parallel operations for your Read if you configure a 
> parallelism of 16.
> 
> Cheers,
> Max
> 
> The easist way to influence the parallelism would be to write a custom 
> Read operation
> On 27.03.19 13:05, augusto.mcc@gmail.com wrote:
> > Hi all,
> > 
> > The input of my pipeline is a file (~360k lines) with S3 paths using TextIO.read(). I use these S3 paths to retrieve the files in my next transform. I believe when I tried to run it a spark cluster on EMR, even though the machine that I setup had 16 cores, the "read from s3" task got only split into 8 parts. Since this operation is very IO intensive I believe it would benefit from having a lot more parallelism. Is there a way to define how parallel a certain operation should be (I believe this option exists for Spark at least), or is this the wrong way to go about it.
> > 
> > Best regards,
> > Augusto
> > 
> 

Re: Imposing parallelism

Posted by Maximilian Michels <mx...@apache.org>.
Hi Augusto,

In Beam there is no way to specify how parallel a specific transform 
should be. There is only a general indicator for how parallel a pipeline 
should be, i.e. Dataflow has "numWorkers", Spark/Flink have "parallelism".

You should see 16 parallel operations for your Read if you configure a 
parallelism of 16.

Cheers,
Max

The easist way to influence the parallelism would be to write a custom 
Read operation
On 27.03.19 13:05, augusto.mcc@gmail.com wrote:
> Hi all,
> 
> The input of my pipeline is a file (~360k lines) with S3 paths using TextIO.read(). I use these S3 paths to retrieve the files in my next transform. I believe when I tried to run it a spark cluster on EMR, even though the machine that I setup had 16 cores, the "read from s3" task got only split into 8 parts. Since this operation is very IO intensive I believe it would benefit from having a lot more parallelism. Is there a way to define how parallel a certain operation should be (I believe this option exists for Spark at least), or is this the wrong way to go about it.
> 
> Best regards,
> Augusto
>