You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by vivek chaurasiya <vi...@gmail.com> on 2020/02/14 06:24:30 UTC

GCS numShards doubt

hi folks, I have this in code

*            globalIndexJson.apply("GCSOutput",
TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));*

the same code is executed for 50GB, 3TB, 5TB of data. I want to know if
changing numShards for larger datasize will write to GCS faster?

Re: GCS numShards doubt

Posted by Kenneth Knowles <ke...@apache.org>.

For bounded data, each bundle becomes a file:
https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L356

Kenn

On Mon, Mar 2, 2020 at 6:18 PM Kyle Weaver <kc...@google.com> wrote:

> As Luke and Robert indicated, unsetting num shards _may_ cause the runner
> to optimize it automatically.
>
> For example, the Flink [1] and Dataflow [2] runners override num shards.
>
> However, in the Spark runner, I don't see any such override. So I have two
> questions:
> 1. Does the Spark runner override num shards somehow?
> 2. How is num shards determined if it's set to 0 and not overridden by the
> runner?
>
> [1]
> https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243
> [2]
> https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866
>
> On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> To let Dataflow choose the optimal number shards and maximize
>> performance, it's often significantly better to simply leave it
>> unspecified. A higher numShards only helps if you have at least that
>> many workers.
>>
>> On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya <vi...@gmail.com>
>> wrote:
>> >
>> > hi folks, I have this in code
>> >
>> >             globalIndexJson.apply("GCSOutput",
>> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));
>> >
>> > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if
>> changing numShards for larger datasize will write to GCS faster?
>>
>

Re: GCS numShards doubt

Posted by Kyle Weaver <kc...@google.com>.

As Luke and Robert indicated, unsetting num shards _may_ cause the runner
to optimize it automatically.

For example, the Flink [1] and Dataflow [2] runners override num shards.

However, in the Spark runner, I don't see any such override. So I have two
questions:
1. Does the Spark runner override num shards somehow?
2. How is num shards determined if it's set to 0 and not overridden by the
runner?

[1]
https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243
[2]
https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866

On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw <ro...@google.com>
wrote:

> To let Dataflow choose the optimal number shards and maximize
> performance, it's often significantly better to simply leave it
> unspecified. A higher numShards only helps if you have at least that
> many workers.
>
> On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya <vi...@gmail.com>
> wrote:
> >
> > hi folks, I have this in code
> >
> >             globalIndexJson.apply("GCSOutput",
> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));
> >
> > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if
> changing numShards for larger datasize will write to GCS faster?
>

Re: GCS numShards doubt

Posted by Robert Bradshaw <ro...@google.com>.

To let Dataflow choose the optimal number shards and maximize
performance, it's often significantly better to simply leave it
unspecified. A higher numShards only helps if you have at least that
many workers.

On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya <vi...@gmail.com> wrote:
>
> hi folks, I have this in code
>
>             globalIndexJson.apply("GCSOutput", TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));
>
> the same code is executed for 50GB, 3TB, 5TB of data. I want to know if changing numShards for larger datasize will write to GCS faster?

Re: GCS numShards doubt

Posted by Luke Cwik <lc...@google.com>.

Prefer to never specify num shards since this allows the runner the
greatest flexibility in how it executes and is the most performant as well.

Increasing num shards enables more workers to do the work in parallel but
there is no guarantee that it will be significantly faster since you could
have 5 workers.

On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya <vi...@gmail.com>
wrote:

> hi folks, I have this in code
>
> *            globalIndexJson.apply("GCSOutput",
> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));*
>
> the same code is executed for 50GB, 3TB, 5TB of data. I want to know if
> changing numShards for larger datasize will write to GCS faster?
>