You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by "devinduan (段丁瑞)" <de...@tencent.com> on 2018/09/18 06:39:19 UTC
How to optimize the performance of Beam on Spark
Hi,
I'm testing Beam on Spark.
I use spark example code WordCount processing 1G data file, cost 1 minutes.
However, I use Beam example code WordCount processing the same file, cost 30minutes.
My Spark parameter is : --deploy-mode client --executor-memory 1g --num-executors 1 --driver-memory 1g
My Spark version is 2.3.1, Beam version is 2.5
Is there any optimization method?
Thank you.
Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Posted by Tim Robertson <ti...@gmail.com>.
Thanks for sharing those results.
The second set (executors at 20-30) look similar to what I would have
expected.
BEAM-5036 definitely plays a part here as the data is not moved on HDFS
efficiently (fix in PR awaiting review now [1]).
To give an idea of the impact, here are some numbers from my own tests.
Without knowing your code, I presume mine is similar to your filter (take
data, modify it, write data with no shuffle/group/join)
My environment: 10 node YARN CDH 5.12.2 cluster, rewriting a 1.5TB AvroIO
file (code here [2]) I observed:
- Using Spark API: 35 minutes
- Beam AvroIO (2.6.0): 1.7hrs
- Beam AvroIO with the 5036 fix: 42 minutes
Related: I also anticipate that varying the spark.default.parallelism will
affect Beam runtime.
Thanks,
Tim
[1] https://github.com/apache/beam/pull/6289
[2] https://github.com/gbif/beam-perf/tree/master/avro-to-avro
On Fri, Sep 28, 2018 at 9:27 AM Robert Bradshaw <ro...@google.com> wrote:
> Something here on the Beam side is clearly linear in the input size, as if
> there's a bottleneck where were' not able to get any parallelization. Is
> the spark variant running in parallel?
>
> On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) <de...@tencent.com>
> wrote:
>
>> Hi
>> I have completed my test.
>> 1. Spark parameter :
>> deploy-mode client
>> executor-memory 1g
>> num-executors 1
>> driver-memory 1g
>>
>> WordCount:
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1min8s
>>
>> 1min11s
>>
>> 1min18s
>>
>> Beam
>>
>> 6.4min
>>
>> 11min
>>
>> 22min
>>
>>
>>
>> Filter:
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1.2min
>>
>> 1.7min
>>
>> 2.8min
>>
>> Beam
>>
>> 2.7min
>>
>> 4.1min
>>
>> 5.7min
>>
>>
>>
>> GroupbyKey + sum
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 3.6min
>>
>>
>>
>>
>>
>> Beam
>>
>> Failed, executor oom
>>
>>
>>
>>
>>
>>
>>
>> Union
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1.7min
>>
>> 2.6min
>>
>> 5.1min
>>
>> Beam
>>
>> 3.6min
>>
>> 6.2min
>>
>> 11min
>>
>>
>>
>> 2. Spark parameter :
>>
>> deploy-mode client
>>
>> executor-memory 1g
>>
>> driver-memory 1g
>>
>> spark.dynamicAllocation.enabled true
>>
>
Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Posted by Robert Bradshaw <ro...@google.com>.
Something here on the Beam side is clearly linear in the input size, as if
there's a bottleneck where were' not able to get any parallelization. Is
the spark variant running in parallel?
On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) <de...@tencent.com>
wrote:
> Hi
> I have completed my test.
> 1. Spark parameter :
> deploy-mode client
> executor-memory 1g
> num-executors 1
> driver-memory 1g
>
> WordCount:
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 1min8s
>
> 1min11s
>
> 1min18s
>
> Beam
>
> 6.4min
>
> 11min
>
> 22min
>
>
>
> Filter:
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 1.2min
>
> 1.7min
>
> 2.8min
>
> Beam
>
> 2.7min
>
> 4.1min
>
> 5.7min
>
>
>
> GroupbyKey + sum
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 3.6min
>
>
>
>
>
> Beam
>
> Failed, executor oom
>
>
>
>
>
>
>
> Union
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 1.7min
>
> 2.6min
>
> 5.1min
>
> Beam
>
> 3.6min
>
> 6.2min
>
> 11min
>
>
>
> 2. Spark parameter :
>
> deploy-mode client
>
> executor-memory 1g
>
> driver-memory 1g
>
> spark.dynamicAllocation.enabled true
>
Re: Re: How to optimize the performance of Beam on Spark(Internet
mail)
Posted by "devinduan (段丁瑞)" <de...@tencent.com>.
Hi
I have completed my test.
1. Spark parameter :
deploy-mode client
executor-memory 1g
num-executors 1
driver-memory 1g
WordCount:
300MB
600MB
1.2G
Spark
1min8s
1min11s
1min18s
Beam
6.4min
11min
22min
Filter:
300MB
600MB
1.2G
Spark
1.2min
1.7min
2.8min
Beam
2.7min
4.1min
5.7min
GroupbyKey + sum
300MB
600MB
1.2G
Spark
3.6min
Beam
Failed, executor oom
Union
300MB
600MB
1.2G
Spark
1.7min
2.6min
5.1min
Beam
3.6min
6.2min
11min
2. Spark parameter :
deploy-mode client
executor-memory 1g
driver-memory 1g
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 30
wordcount:
300MB
600MB
1.2G
Spark
53s,use 9 executors
53s, use 9 executors
53s, use 9 executors
Beam
1.6min, use 29 executors
2min, use 30 executors
2.5min, use 30 executors
Filter:
300MB
600MB
1.2G
Spark
1.2min,use 22 executors
1.3min,use 23 executors
1.6min,use 29 executors
Beam
1.8min, use 22 executors
2.7min, use 30 executors
2.5min, use 30 executors
GroupbyKey + sum
300MB
600MB
1.2G
Spark
Beam
Failed,executor oom
Union
300MB
600MB
1.2G
Spark
2.3min,use 30 executors
2.3min,use 30 executors
2.3min,use 30 executors
Beam
2.7min, use 29 executors
3.4min, use 30 executors
4.7min, use 30 executors
From: devinduan(段丁瑞)<ma...@tencent.com>
Date: 2018-09-19 17:17
To: Tim Robertson<ma...@gmail.com>
CC: dev<ma...@beam.apache.org>; jb<ma...@nanthrax.net>
Subject: Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Got it.
I will also set "spark.dynamicAllocation.enabled=true" to test.
From: Tim Robertson<ma...@gmail.com>
Date: 2018-09-19 17:04
To: dev@beam.apache.org<ma...@beam.apache.org>
CC: jb@nanthrax.net<ma...@nanthrax.net>
Subject: Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Thank you Devin
Can you also please try Beam with more spark executors if you are able?
On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) <de...@tencent.com>> wrote:
Thanks for your help!
I will test other examples of Beam On Spark in the future and then feed back the results.
Regards
devin
From: Jean-Baptiste Onofré<ma...@nanthrax.net>
Date: 2018-09-19 16:32
To: devinduan(段丁瑞)<ma...@tencent.com>; dev<ma...@beam.apache.org>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)
Thanks for the details.
I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).
Regards
JB
On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
> I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org<ma...@beam.apache.org> <ma...@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join", "OrderBy", "Union" ...
> > I use the same Resources and configuration to run these Job.
> > Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> > Beam "WordCount":
> >
> > Spark "WordCount":
> >
> > I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org<ma...@beam.apache.org> <ma...@beam.apache.org>
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is : --deploy-mode client
> > --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1, Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org<ma...@apache.org>
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org<ma...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
--
Jean-Baptiste Onofré
jbonofre@apache.org<ma...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com
Re: Re: How to optimize the performance of Beam on Spark(Internet
mail)
Posted by "devinduan (段丁瑞)" <de...@tencent.com>.
Got it.
I will also set "spark.dynamicAllocation.enabled=true" to test.
From: Tim Robertson<ma...@gmail.com>
Date: 2018-09-19 17:04
To: dev@beam.apache.org<ma...@beam.apache.org>
CC: jb@nanthrax.net<ma...@nanthrax.net>
Subject: Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Thank you Devin
Can you also please try Beam with more spark executors if you are able?
On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) <de...@tencent.com>> wrote:
Thanks for your help!
I will test other examples of Beam On Spark in the future and then feed back the results.
Regards
devin
From: Jean-Baptiste Onofré<ma...@nanthrax.net>
Date: 2018-09-19 16:32
To: devinduan(段丁瑞)<ma...@tencent.com>; dev<ma...@beam.apache.org>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)
Thanks for the details.
I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).
Regards
JB
On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
> I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org<ma...@beam.apache.org> <ma...@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join", "OrderBy", "Union" ...
> > I use the same Resources and configuration to run these Job.
> > Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> > Beam "WordCount":
> >
> > Spark "WordCount":
> >
> > I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org<ma...@beam.apache.org> <ma...@beam.apache.org>
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is : --deploy-mode client
> > --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1, Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org<ma...@apache.org>
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org<ma...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
--
Jean-Baptiste Onofré
jbonofre@apache.org<ma...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com
Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Posted by Tim Robertson <ti...@gmail.com>.
Thank you Devin
Can you also please try Beam with more spark executors if you are able?
On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) <de...@tencent.com>
wrote:
> Thanks for your help!
> I will test other examples of Beam On Spark in the future and then feed
> back the results.
> Regards
> devin
>
>
> *From:* Jean-Baptiste Onofré <jb...@nanthrax.net>
> *Date:* 2018-09-19 16:32
> *To:* devinduan(段丁瑞) <de...@tencent.com>; dev <de...@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on Spark(Internet
> mail)
>
> Thanks for the details.
>
> I will take a look later tomorrow (I have another issue to investigate
> on the Spark runner today for Beam 2.7.0 release).
>
> Regards
> JB
>
> On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> > Hi,
> > I test 300MB data file.
> > Use command like:
> > ./spark-submit --master yarn --deploy-mode client --class
> > com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory
> 1g
> >
> > I set only one exeuctor. so task run in sequence . One task cost 10s.
> > However, Spark task cost only 0.4s
> >
> >
> >
> > *From:* Jean-Baptiste Onofré <mailto:jb@nanthrax.net
> <jb...@nanthrax.net>>
> > *Date:* 2018-09-19 12:22
> > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org
> <de...@beam.apache.org>>
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > did you compare the stages in the Spark UI in order to identify which
> > stage is taking time ?
> >
> > You use spark-submit in both cases for the bootstrapping ?
> >
> > I will do a test here as well.
> >
> > Regards
> > JB
> >
> > On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > > Hi,
> > > Thanks for you reply.
> > > Our team plan to use Beam instead of Spark, So I'm testing the
> > > performance of Beam API.
> > > I'm coding some example through Spark API and Beam API , like
> > > "WordCount" , "Join", "OrderBy", "Union" ...
> > > I use the same Resources and configuration to run these Job.
> > > Tim said I should remove "withNumShards(1)" and
> > > set spark.default.parallelism=32. I did it and tried again, but
> > Beam job
> > > still running very slowly.
> > > Here is My Beam code and Spark code:
> > > Beam "WordCount":
> > >
> > > Spark "WordCount":
> > >
> > > I will try the other example later.
> > >
> > > Regards
> > > devin
> > >
> > >
> > > *From:* Jean-Baptiste Onofré <mailto:jb@nanthrax.net
> <jb...@nanthrax.net>>
> > > *Date:* 2018-09-18 22:43
> > > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org
> <de...@beam.apache.org>>
> > > *Subject:* Re: How to optimize the performance of Beam on
> > > Spark(Internet mail)
> > >
> > > Hi,
> > >
> > > The first huge difference is the fact that the spark runner
> > still uses
> > > RDD whereas directly using spark, you are using dataset. A
> > bunch of
> > > optimization in spark are related to dataset.
> > >
> > > I started a large refactoring of the spark runner to leverage
> > Spark 2.x
> > > (and dataset).
> > > It's not yet ready as it includes other improvements (the
> > portability
> > > layer with Job API, a first check of state API, ...).
> > >
> > > Anyway, by Spark wordcount, you mean the one included in the
> spark
> > > distribution ?
> > >
> > > Regards
> > > JB
> > >
> > > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > > Hi,
> > > > I'm testing Beam on Spark.
> > > > I use spark example code WordCount processing 1G data
> > file, cost 1
> > > > minutes.
> > > > However, I use Beam example code WordCount processing
> > the same
> > > file,
> > > > cost 30minutes.
> > > > My Spark parameter is : --deploy-mode client
> > > --executor-memory 1g
> > > > --num-executors 1 --driver-memory 1g
> > > > My Spark version is 2.3.1, Beam version is 2.5
> > > > Is there any optimization method?
> > > > Thank you.
> > > >
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbonofre@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
Re: Re: How to optimize the performance of Beam on Spark(Internet
mail)
Posted by "devinduan (段丁瑞)" <de...@tencent.com>.
Thanks for your help!
I will test other examples of Beam On Spark in the future and then feed back the results.
Regards
devin
From: Jean-Baptiste Onofré<ma...@nanthrax.net>
Date: 2018-09-19 16:32
To: devinduan(段丁瑞)<ma...@tencent.com>; dev<ma...@beam.apache.org>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)
Thanks for the details.
I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).
Regards
JB
On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
> I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org <ma...@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join", "OrderBy", "Union" ...
> > I use the same Resources and configuration to run these Job.
> > Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> > Beam "WordCount":
> >
> > Spark "WordCount":
> >
> > I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org <ma...@beam.apache.org>
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is : --deploy-mode client
> > --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1, Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
--
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
Re: How to optimize the performance of Beam on Spark(Internet mail)
Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Thanks for the details.
I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).
Regards
JB
On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
> I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org <ma...@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join", "OrderBy", "Union" ...
> > I use the same Resources and configuration to run these Job.
> > Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> > Beam "WordCount":
> >
> > Spark "WordCount":
> >
> > I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org <ma...@beam.apache.org>
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is : --deploy-mode client
> > --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1, Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
--
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
Re: Re: How to optimize the performance of Beam on Spark(Internet
mail)
Posted by "devinduan (段丁瑞)" <de...@tencent.com>.
Hi,
I test 300MB data file.
Use command like:
./spark-submit --master yarn --deploy-mode client --class com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
[cid:_Foxmail.1@85a39ec5-16ec-a295-8ef0-63945c7b2d4d]
[cid:_Foxmail.1@4fa5e77b-e9f7-eb0b-40ce-baf5f79b0b12]
I set only one exeuctor. so task run in sequence . One task cost 10s.
[cid:_Foxmail.1@81d55412-d569-e1bd-2e42-8bc03756dccb]
However, Spark task cost only 0.4s
[cid:_Foxmail.1@51716cb7-dd68-ec6e-5ef2-88d11171326f]
From: Jean-Baptiste Onofré<ma...@nanthrax.net>
Date: 2018-09-19 12:22
To: dev@beam.apache.org<ma...@beam.apache.org>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)
Hi,
did you compare the stages in the Spark UI in order to identify which
stage is taking time ?
You use spark-submit in both cases for the bootstrapping ?
I will do a test here as well.
Regards
JB
On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> Hi,
> Thanks for you reply.
> Our team plan to use Beam instead of Spark, So I'm testing the
> performance of Beam API.
> I'm coding some example through Spark API and Beam API , like
> "WordCount" , "Join", "OrderBy", "Union" ...
> I use the same Resources and configuration to run these Job.
> Tim said I should remove "withNumShards(1)" and
> set spark.default.parallelism=32. I did it and tried again, but Beam job
> still running very slowly.
> Here is My Beam code and Spark code:
> Beam "WordCount":
>
> Spark "WordCount":
>
> I will try the other example later.
>
> Regards
> devin
>
>
> *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> *Date:* 2018-09-18 22:43
> *To:* dev@beam.apache.org <ma...@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> The first huge difference is the fact that the spark runner still uses
> RDD whereas directly using spark, you are using dataset. A bunch of
> optimization in spark are related to dataset.
>
> I started a large refactoring of the spark runner to leverage Spark 2.x
> (and dataset).
> It's not yet ready as it includes other improvements (the portability
> layer with Job API, a first check of state API, ...).
>
> Anyway, by Spark wordcount, you mean the one included in the spark
> distribution ?
>
> Regards
> JB
>
> On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > Hi,
> > I'm testing Beam on Spark.
> > I use spark example code WordCount processing 1G data file, cost 1
> > minutes.
> > However, I use Beam example code WordCount processing the same
> file,
> > cost 30minutes.
> > My Spark parameter is : --deploy-mode client
> --executor-memory 1g
> > --num-executors 1 --driver-memory 1g
> > My Spark version is 2.3.1, Beam version is 2.5
> > Is there any optimization method?
> > Thank you.
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
--
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
Re: How to optimize the performance of Beam on Spark(Internet mail)
Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi,
did you compare the stages in the Spark UI in order to identify which
stage is taking time ?
You use spark-submit in both cases for the bootstrapping ?
I will do a test here as well.
Regards
JB
On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> Hi,
> Thanks for you reply.
> Our team plan to use Beam instead of Spark, So I'm testing the
> performance of Beam API.
> I'm coding some example through Spark API and Beam API , like
> "WordCount" , "Join", "OrderBy", "Union" ...
> I use the same Resources and configuration to run these Job.
> Tim said I should remove "withNumShards(1)" and
> set spark.default.parallelism=32. I did it and tried again, but Beam job
> still running very slowly.
> Here is My Beam code and Spark code:
> Beam "WordCount":
>
> Spark "WordCount":
>
> I will try the other example later.
>
> Regards
> devin
>
>
> *From:* Jean-Baptiste Onofré <ma...@nanthrax.net>
> *Date:* 2018-09-18 22:43
> *To:* dev@beam.apache.org <ma...@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> The first huge difference is the fact that the spark runner still uses
> RDD whereas directly using spark, you are using dataset. A bunch of
> optimization in spark are related to dataset.
>
> I started a large refactoring of the spark runner to leverage Spark 2.x
> (and dataset).
> It's not yet ready as it includes other improvements (the portability
> layer with Job API, a first check of state API, ...).
>
> Anyway, by Spark wordcount, you mean the one included in the spark
> distribution ?
>
> Regards
> JB
>
> On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > Hi,
> > I'm testing Beam on Spark.
> > I use spark example code WordCount processing 1G data file, cost 1
> > minutes.
> > However, I use Beam example code WordCount processing the same
> file,
> > cost 30minutes.
> > My Spark parameter is : --deploy-mode client
> --executor-memory 1g
> > --num-executors 1 --driver-memory 1g
> > My Spark version is 2.3.1, Beam version is 2.5
> > Is there any optimization method?
> > Thank you.
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
--
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
Re: Re: How to optimize the performance of Beam on Spark(Internet
mail)
Posted by "devinduan (段丁瑞)" <de...@tencent.com>.
Hi,
Thanks for you reply.
Our team plan to use Beam instead of Spark, So I'm testing the performance of Beam API.
I'm coding some example through Spark API and Beam API , like "WordCount" , "Join", "OrderBy", "Union" ...
I use the same Resources and configuration to run these Job.
Tim said I should remove "withNumShards(1)" and set spark.default.parallelism=32. I did it and tried again, but Beam job still running very slowly.
[cid:_Foxmail.1@c2953f10-ca23-881f-0bdf-964126854c21]
Here is My Beam code and Spark code:
Beam "WordCount":
[cid:_Foxmail.1@cd843846-9af3-4905-b906-5fb1f334ada0]
Spark "WordCount":
[cid:_Foxmail.1@c2cda5b2-92f4-bb3e-7fa8-a36483202f38]
I will try the other example later.
Regards
devin
From: Jean-Baptiste Onofré<ma...@nanthrax.net>
Date: 2018-09-18 22:43
To: dev@beam.apache.org<ma...@beam.apache.org>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)
Hi,
The first huge difference is the fact that the spark runner still uses
RDD whereas directly using spark, you are using dataset. A bunch of
optimization in spark are related to dataset.
I started a large refactoring of the spark runner to leverage Spark 2.x
(and dataset).
It's not yet ready as it includes other improvements (the portability
layer with Job API, a first check of state API, ...).
Anyway, by Spark wordcount, you mean the one included in the spark
distribution ?
Regards
JB
On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> Hi,
> I'm testing Beam on Spark.
> I use spark example code WordCount processing 1G data file, cost 1
> minutes.
> However, I use Beam example code WordCount processing the same file,
> cost 30minutes.
> My Spark parameter is : --deploy-mode client --executor-memory 1g
> --num-executors 1 --driver-memory 1g
> My Spark version is 2.3.1, Beam version is 2.5
> Is there any optimization method?
> Thank you.
>
>
--
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
Re: How to optimize the performance of Beam on Spark
Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi,
The first huge difference is the fact that the spark runner still uses
RDD whereas directly using spark, you are using dataset. A bunch of
optimization in spark are related to dataset.
I started a large refactoring of the spark runner to leverage Spark 2.x
(and dataset).
It's not yet ready as it includes other improvements (the portability
layer with Job API, a first check of state API, ...).
Anyway, by Spark wordcount, you mean the one included in the spark
distribution ?
Regards
JB
On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> Hi,
> I'm testing Beam on Spark.
> I use spark example code WordCount processing 1G data file, cost 1
> minutes.
> However, I use Beam example code WordCount processing the same file,
> cost 30minutes.
> My Spark parameter is : --deploy-mode client --executor-memory 1g
> --num-executors 1 --driver-memory 1g
> My Spark version is 2.3.1, Beam version is 2.5
> Is there any optimization method?
> Thank you.
>
>
--
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Posted by Tim Robertson <ti...@gmail.com>.
Thanks for sharing all of that.
Can you please remove the "withNumShards(1)" to verify it performs the same
the wordcount?
It adds an addition groupByKey which you can read about on the JDoc [1],
and will also mean it runs without parallelism for the write.
You are running with a single executor which seems unusual for a spark
cluster. I would recommend running with several, and with the
spark.default.parallelism set.
The issue that Robert linked does apply in your scenario and I believe
having only a single shard output will make that issue worse (it copies all
data through a single "worker").
Thanks,
Tim
[1]
https://beam.apache.org/documentation/sdks/javadoc/2.3.0/org/apache/beam/sdk/io/WriteFiles.html
On Tue, Sep 18, 2018 at 11:21 AM devinduan(段丁瑞) <de...@tencent.com>
wrote:
> Hi Tim
> Thanks for your reply.
> I run beam example like :
> ./spark-submit --master yarn --deploy-mode client --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
> --jars /data/mapleleaf/tmp/beam.jar /data/mapleleaf/tmp/beam.jar
> BeamTest code copy from WordCount. I removed some unused code
> (Metric update).
>
> I run spark example:
> ./spark-submit --master yarn --deploy-mode client --class
> org.apache.spark.examples.JavaWordCount --executor-memory 1g
> --num-executors 1 --driver-memory 1g
> /data/mapleleaf/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> hdfs://hdfsTest/test/test.txt
>
> You can see the result...
>
>
> I'm trying to figure out reason...
> Spark job tasks number is 32, task duration only 0.4s
> Beam job task numbers also 32, but task duration cost nearly 1min.
> You can compare two jobs DAG:
> Spark:
>
> Beam :
>
> *From:* Tim Robertson <ti...@gmail.com>
> *Date:* 2018-09-18 16:55
> *To:* dev@beam.apache.org
> *Subject:* Re: How to optimize the performance of Beam on Spark(Internet
> mail)
> Hi devinduan
>
> The known issues Robert links there are actually HDFS related and not
> specific to Spark. The improvement we're seeking is that the final copy of
> the output file can be optimised by using a "move" instead of "copy" andI
> expect to have it fixed for Beam 2.8.0. On a small dataset like this
> though, I don't think it will impact performance too much.
>
> Can you please elaborate on your deployment? It looks like you are using
> a cluster (i.e. deploy-mode client) but are you using HDFS?
>
> I have access to a Cloudera CDH 5.12 Hadoop cluster and just ran an
> example word count as follows - I'll explain the parameters to tune below:
>
> 1) I generate some random data (using common Hadoop tools)
> hadoop jar
> /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar \
> teragen \
> -Dmapred.map.tasks=100 \
> -Dmapred.map.tasks.speculative.execution=false \
> 10000000 \
> /tmp/tera
>
> This puts 100 files totalling just under 1GB on which I will run the word
> count. They are stored in the HDFS filesystem.
>
> 2) Run the word count using Spark (2.3.x) and Beam 2.5.0
>
> In my cluster I have YARN to allocate resources, and an HDFS filesystem.
> This will be different if you run Spark as standalone, or on a cloud
> environment.
>
> spark2-submit \
> --conf spark.default.parallelism=45 \
> --class org.apache.beam.runners.spark.examples.WordCount \
> --master yarn \
> --executor-memory 2G \
> --executor-cores 5 \
> --num-executors 9 \
> --jars
> beam-sdks-java-core-2.5.0.jar,beam-runners-core-construction-java-2.5.0.jar,beam-runners-core-java-2.5.0.jar,beam-sdks-java-io-hadoop-file-system-2.5.0.jar
> \
> beam-runners-spark-2.5.0.jar \
> --runner=SparkRunner \
> --inputFile=hdfs:///tmp/tera/* \
> --output=hdfs:///tmp/wordcount
>
> The jars I provide here are the minimum needed for running on HDFS with
> Spark and normally you'd build those into your project as an über jar.
>
> The important bits for tuning for performance are the following - these
> will be applicable for any Spark deployment (unless embedded):
>
> spark.default.parallelism - controls the parallelism of the beam
> pipeline. In this case, how many workers are tokenizing the input data.
> executor-memory, executor-cores, num-executors - controls the resources
> spark will use
>
> Note, that the parallelism of 45 means that the 5 cores in the 9 executors
> can all run concurrently (i.e. 5x9 = 45). When you get to very large
> datasets, you will likely have parallelism much higher.
>
> In this test I see around 20 seconds initial startup of Spark (copying
> jars, requesting resources from YARN, establishing the Spark context) but
> once up the job completes in a few seconds writing the output into 45 files
> (because of the parallelism). The files are named
> /tmp/wordcount-000*-of-00045.
>
> I hope this helps provide a few pointers, but if you elaborate on your
> environment we might be able to assist more.
>
> Best wishes,
> Tim
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Sep 18, 2018 at 9:29 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> There are known performance issues with Beam on Spark that are being
>> worked on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's
>> possible you're hitting something different, but would be worth
>> investigating. See also
>> https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performance%20of%20write
>>
>> On Tue, Sep 18, 2018 at 8:39 AM devinduan(段丁瑞) <de...@tencent.com>
>> wrote:
>>
>>> Hi,
>>> I'm testing Beam on Spark.
>>> I use spark example code WordCount processing 1G data file, cost 1
>>> minutes.
>>> However, I use Beam example code WordCount processing the same file,
>>> cost 30minutes.
>>> My Spark parameter is : --deploy-mode client --executor-memory 1g
>>> --num-executors 1 --driver-memory 1g
>>> My Spark version is 2.3.1, Beam version is 2.5
>>> Is there any optimization method?
>>> Thank you.
>>>
>>>
>>>
>>
Re: Re: How to optimize the performance of Beam on Spark(Internet
mail)
Posted by "devinduan (段丁瑞)" <de...@tencent.com>.
Hi Tim
Thanks for your reply.
I run beam example like :
./spark-submit --master yarn --deploy-mode client --class com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g --jars /data/mapleleaf/tmp/beam.jar /data/mapleleaf/tmp/beam.jar
BeamTest code copy from WordCount. I removed some unused code (Metric update).
[cid:_Foxmail.1@8926cecf-517d-2a20-918b-dc3b5d4866be]
I run spark example:
./spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.JavaWordCount --executor-memory 1g --num-executors 1 --driver-memory 1g /data/mapleleaf/spark/examples/jars/spark-examples_2.11-2.3.0.jar hdfs://hdfsTest/test/test.txt
You can see the result...
[cid:_Foxmail.1@33461416-fe6f-8dfb-2da2-243a9a0ed7aa]
I'm trying to figure out reason...
Spark job tasks number is 32, task duration only 0.4s
[cid:_Foxmail.1@a88708e0-648e-88ca-dfbd-5cce7c4d1065]
Beam job task numbers also 32, but task duration cost nearly 1min.
[cid:_Foxmail.1@fcdaa7f2-7eac-f710-fd0c-796ad4c5bfc7]
You can compare two jobs DAG:
Spark:
[cid:_Foxmail.1@3d0bd3ca-d1b4-6bb4-1c04-8225a3bb3ac0]
Beam :
[cid:_Foxmail.1@3e9ef541-d7d8-051f-6419-84ff2c0e8857]
From: Tim Robertson<ma...@gmail.com>
Date: 2018-09-18 16:55
To: dev@beam.apache.org<ma...@beam.apache.org>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)
Hi devinduan
The known issues Robert links there are actually HDFS related and not specific to Spark. The improvement we're seeking is that the final copy of the output file can be optimised by using a "move" instead of "copy" andI expect to have it fixed for Beam 2.8.0. On a small dataset like this though, I don't think it will impact performance too much.
Can you please elaborate on your deployment? It looks like you are using a cluster (i.e. deploy-mode client) but are you using HDFS?
I have access to a Cloudera CDH 5.12 Hadoop cluster and just ran an example word count as follows - I'll explain the parameters to tune below:
1) I generate some random data (using common Hadoop tools)
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar \
teragen \
-Dmapred.map.tasks=100 \
-Dmapred.map.tasks.speculative.execution=false \
10000000 \
/tmp/tera
This puts 100 files totalling just under 1GB on which I will run the word count. They are stored in the HDFS filesystem.
2) Run the word count using Spark (2.3.x) and Beam 2.5.0
In my cluster I have YARN to allocate resources, and an HDFS filesystem. This will be different if you run Spark as standalone, or on a cloud environment.
spark2-submit \
--conf spark.default.parallelism=45 \
--class org.apache.beam.runners.spark.examples.WordCount \
--master yarn \
--executor-memory 2G \
--executor-cores 5 \
--num-executors 9 \
--jars beam-sdks-java-core-2.5.0.jar,beam-runners-core-construction-java-2.5.0.jar,beam-runners-core-java-2.5.0.jar,beam-sdks-java-io-hadoop-file-system-2.5.0.jar \
beam-runners-spark-2.5.0.jar \
--runner=SparkRunner \
--inputFile=hdfs:///tmp/tera/* \
--output=hdfs:///tmp/wordcount
The jars I provide here are the minimum needed for running on HDFS with Spark and normally you'd build those into your project as an über jar.
The important bits for tuning for performance are the following - these will be applicable for any Spark deployment (unless embedded):
spark.default.parallelism - controls the parallelism of the beam pipeline. In this case, how many workers are tokenizing the input data.
executor-memory, executor-cores, num-executors - controls the resources spark will use
Note, that the parallelism of 45 means that the 5 cores in the 9 executors can all run concurrently (i.e. 5x9 = 45). When you get to very large datasets, you will likely have parallelism much higher.
In this test I see around 20 seconds initial startup of Spark (copying jars, requesting resources from YARN, establishing the Spark context) but once up the job completes in a few seconds writing the output into 45 files (because of the parallelism). The files are named /tmp/wordcount-000*-of-00045.
I hope this helps provide a few pointers, but if you elaborate on your environment we might be able to assist more.
Best wishes,
Tim
On Tue, Sep 18, 2018 at 9:29 AM Robert Bradshaw <ro...@google.com>> wrote:
There are known performance issues with Beam on Spark that are being worked on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's possible you're hitting something different, but would be worth investigating. See also https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performance%20of%20write
On Tue, Sep 18, 2018 at 8:39 AM devinduan(段丁瑞) <de...@tencent.com>> wrote:
Hi,
I'm testing Beam on Spark.
I use spark example code WordCount processing 1G data file, cost 1 minutes.
However, I use Beam example code WordCount processing the same file, cost 30minutes.
My Spark parameter is : --deploy-mode client --executor-memory 1g --num-executors 1 --driver-memory 1g
My Spark version is 2.3.1, Beam version is 2.5
Is there any optimization method?
Thank you.
Re: How to optimize the performance of Beam on Spark
Posted by Tim Robertson <ti...@gmail.com>.
Hi devinduan
The known issues Robert links there are actually HDFS related and not
specific to Spark. The improvement we're seeking is that the final copy of
the output file can be optimised by using a "move" instead of "copy" andI
expect to have it fixed for Beam 2.8.0. On a small dataset like this
though, I don't think it will impact performance too much.
Can you please elaborate on your deployment? It looks like you are using a
cluster (i.e. deploy-mode client) but are you using HDFS?
I have access to a Cloudera CDH 5.12 Hadoop cluster and just ran an example
word count as follows - I'll explain the parameters to tune below:
1) I generate some random data (using common Hadoop tools)
hadoop jar
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar \
teragen \
-Dmapred.map.tasks=100 \
-Dmapred.map.tasks.speculative.execution=false \
10000000 \
/tmp/tera
This puts 100 files totalling just under 1GB on which I will run the word
count. They are stored in the HDFS filesystem.
2) Run the word count using Spark (2.3.x) and Beam 2.5.0
In my cluster I have YARN to allocate resources, and an HDFS filesystem.
This will be different if you run Spark as standalone, or on a cloud
environment.
spark2-submit \
--conf spark.default.parallelism=45 \
--class org.apache.beam.runners.spark.examples.WordCount \
--master yarn \
--executor-memory 2G \
--executor-cores 5 \
--num-executors 9 \
--jars
beam-sdks-java-core-2.5.0.jar,beam-runners-core-construction-java-2.5.0.jar,beam-runners-core-java-2.5.0.jar,beam-sdks-java-io-hadoop-file-system-2.5.0.jar
\
beam-runners-spark-2.5.0.jar \
--runner=SparkRunner \
--inputFile=hdfs:///tmp/tera/* \
--output=hdfs:///tmp/wordcount
The jars I provide here are the minimum needed for running on HDFS with
Spark and normally you'd build those into your project as an über jar.
The important bits for tuning for performance are the following - these
will be applicable for any Spark deployment (unless embedded):
spark.default.parallelism - controls the parallelism of the beam
pipeline. In this case, how many workers are tokenizing the input data.
executor-memory, executor-cores, num-executors - controls the resources
spark will use
Note, that the parallelism of 45 means that the 5 cores in the 9 executors
can all run concurrently (i.e. 5x9 = 45). When you get to very large
datasets, you will likely have parallelism much higher.
In this test I see around 20 seconds initial startup of Spark (copying
jars, requesting resources from YARN, establishing the Spark context) but
once up the job completes in a few seconds writing the output into 45 files
(because of the parallelism). The files are named
/tmp/wordcount-000*-of-00045.
I hope this helps provide a few pointers, but if you elaborate on your
environment we might be able to assist more.
Best wishes,
Tim
On Tue, Sep 18, 2018 at 9:29 AM Robert Bradshaw <ro...@google.com> wrote:
> There are known performance issues with Beam on Spark that are being
> worked on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's
> possible you're hitting something different, but would be worth
> investigating. See also
> https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performance%20of%20write
>
> On Tue, Sep 18, 2018 at 8:39 AM devinduan(段丁瑞) <de...@tencent.com>
> wrote:
>
>> Hi,
>> I'm testing Beam on Spark.
>> I use spark example code WordCount processing 1G data file, cost 1
>> minutes.
>> However, I use Beam example code WordCount processing the same file,
>> cost 30minutes.
>> My Spark parameter is : --deploy-mode client --executor-memory 1g
>> --num-executors 1 --driver-memory 1g
>> My Spark version is 2.3.1, Beam version is 2.5
>> Is there any optimization method?
>> Thank you.
>>
>>
>>
>
Re: How to optimize the performance of Beam on Spark
Posted by Robert Bradshaw <ro...@google.com>.
There are known performance issues with Beam on Spark that are being worked
on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's possible
you're hitting something different, but would be worth investigating. See
also
https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performance%20of%20write
On Tue, Sep 18, 2018 at 8:39 AM devinduan(段丁瑞) <de...@tencent.com>
wrote:
> Hi,
> I'm testing Beam on Spark.
> I use spark example code WordCount processing 1G data file, cost 1
> minutes.
> However, I use Beam example code WordCount processing the same file,
> cost 30minutes.
> My Spark parameter is : --deploy-mode client --executor-memory 1g
> --num-executors 1 --driver-memory 1g
> My Spark version is 2.3.1, Beam version is 2.5
> Is there any optimization method?
> Thank you.
>
>
>