You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by vivek chaurasiya <vi...@gmail.com> on 2020/02/08 07:52:11 UTC
Problem with updating beam SDK
Hi team,
We had beam SDKs 2.5 running on AWS-EMR Spark distribution 5.17.
Essentially our beam code was just reading bunch of files from GCS and
pushing to ElasticSearch in AWS using beam's class ElasticSearchIO (
https://beam.apache.org/releases/javadoc/2.0.0/index.html?org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.html).
So there is just a Map step, no reduce/groupby/etc. in the beam code.
Basically my code is doing:
PCollection<String> coll = // read GCS
coll.apply (ElasticSearchIO.write())
We submit spark command using 'spark-submit'
spark-submit --deploy-mode cluster --conf
spark.executor.extraJavaOptions=-DCLOUD_PLATFORM=AWS --conf
spark.driver.extraJavaOptions=-DCLOUD_PLATFORM=AWS --conf
spark.yarn.am.waitTime=300s --conf
spark.executor.extraClassPath=__app__.jar --driver-memory 8G
--num-executors 5 --executor-memory 20G --executor-cores 8 --jars
s3://snap-search-spark/cloud-dataflow-1.0.jar --class
com.snapchat.beam.common.pipeline.EMRSparkStartPipeline
s3://snap-search-spark/cloud-dataflow-1.0.jar --job=fgi-export
--isSolr=false --dateTime=2020-01-31T00:00:00 --isDev=true
--incrementalExport=false
The dump to ES was finishing in max 1hour.
This week we upgraded beam SDKs to 2.18 and running on AWS-EMR Spark
distribution 5.17. We observe that the export process becomes really slow
like 9 hours. The GCS filesize ~ 50gb (500 files of 100 mb each).
I am new to SparkUI and AWS EMR, but still i tried to see why this slowness
is happening. Few observations:
1) some executors died got SIGTERM. Then i tried this:
https://dev.sobeslavsky.net/apache-spark-sigterm-mystery-with-dynamic-allocation/
NO luck
2) I will try upgrading AWS-EMR Spark distribution 5.29 but will have to
test it.
Anyone seen similar issues in past? Some suggestions will be highly
appreciated.
Thanks
Vivek
Re: Problem with updating beam SDK
Posted by vivek chaurasiya <vi...@gmail.com>.
Can someone comment here?
On Fri, Feb 7, 2020, 11:52 PM vivek chaurasiya <vi...@gmail.com> wrote:
> Hi team,
>
> We had beam SDKs 2.5 running on AWS-EMR Spark distribution 5.17.
>
> Essentially our beam code was just reading bunch of files from GCS and
> pushing to ElasticSearch in AWS using beam's class ElasticSearchIO (
> https://beam.apache.org/releases/javadoc/2.0.0/index.html?org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.html).
> So there is just a Map step, no reduce/groupby/etc. in the beam code.
>
> Basically my code is doing:
> PCollection<String> coll = // read GCS
> coll.apply (ElasticSearchIO.write())
>
> We submit spark command using 'spark-submit'
> spark-submit --deploy-mode cluster --conf
> spark.executor.extraJavaOptions=-DCLOUD_PLATFORM=AWS --conf
> spark.driver.extraJavaOptions=-DCLOUD_PLATFORM=AWS --conf
> spark.yarn.am.waitTime=300s --conf
> spark.executor.extraClassPath=__app__.jar --driver-memory 8G
> --num-executors 5 --executor-memory 20G --executor-cores 8 --jars
> s3://snap-search-spark/cloud-dataflow-1.0.jar --class
> com.snapchat.beam.common.pipeline.EMRSparkStartPipeline
> s3://snap-search-spark/cloud-dataflow-1.0.jar --job=fgi-export
> --isSolr=false --dateTime=2020-01-31T00:00:00 --isDev=true
> --incrementalExport=false
>
> The dump to ES was finishing in max 1hour.
>
> This week we upgraded beam SDKs to 2.18 and running on AWS-EMR Spark
> distribution 5.17. We observe that the export process becomes really slow
> like 9 hours. The GCS filesize ~ 50gb (500 files of 100 mb each).
>
> I am new to SparkUI and AWS EMR, but still i tried to see why this
> slowness is happening. Few observations:
>
> 1) some executors died got SIGTERM. Then i tried this:
> https://dev.sobeslavsky.net/apache-spark-sigterm-mystery-with-dynamic-allocation/
> NO luck
>
> 2) I will try upgrading AWS-EMR Spark distribution 5.29 but will have to
> test it.
>
> Anyone seen similar issues in past? Some suggestions will be highly
> appreciated.
>
> Thanks
> Vivek
>
>
>