You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Kalin Stoyanov <kg...@gmail.com> on 2020/01/15 17:53:14 UTC

Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Hi all,

First of all let me say that I am pretty new to Spark so this could be
entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark
2.4.4, and it got done slower than when I had ran it locally (on Spark
2.4.1). I checked out the event logs, and the one from the newer version
had more stages.
Then I decided to do a comparison in the same environment so I created the
two versions of the same cluster with the only difference being the emr
release, and hence the spark version(?) - first one was emr-5.24.1 with
Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
the same thing happened with the newer version having more stages and
taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of
spark itself, but because of some difference introduced in the emr
releases? At the moment I can't think of any other alternative besides it
being a bug...

Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files
s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
--name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
--outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said
it's better to ask here first, so any ideas on what's going on? Maybe I am
missing something?

Regards,
Kalin

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Posted by Gourav Sengupta <go...@gmail.com>.
Hi Xiao,

that is the right attitude, thanks a ton :)

Hi Kalin,
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html#emr-5281-relnotes
EMR latest version should be available right out of the box, perhaps you
can raise a quick AWS ticket and find out in case its release it getting
delayed in your region or not. The release notes does mention that it fixes
a few SPARK compatibility issues. Also working on the latest version of
SPARK takes less than 10 seconds after you have downloaded and unzipped the
file from APACHE SPARK. Besides that I am almost always sure that starting
SPARK session in EMR using the following statement is always going to give
the same performance and predictability. As Xiao mentions it might be
better to first isolate the cause and replicate it before raising issues.

(spark = SparkSession.builder.getOrCreate())

Thanks and Regards,
Gourav Sengupta

On Wed, Jan 15, 2020 at 9:10 PM Kalin Stoyanov <kg...@gmail.com> wrote:

> Hi all,
>
> @Enrico, I've added just the SQL query pages (+js dependencies etc.)  in
> the google drive -
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> That is what you had in mind right? They are different indeed. (For some
> reason after I saved them off of the history server the graphs get drawn
> twice, but that shouldn't matter)
>
> @Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a
> cluster, so I can't check that for now; also I am using just s3://
>
> @Xiao, Yes, I will try to run this locally as well, but installing new
> versions of Spark won't be very fast and easy for me, so I won't be doing
> it right away.
>
> Regards,
> Kalin
>
>
> On Wed, Jan 15, 2020 at 10:20 PM Xiao Li <ga...@gmail.com> wrote:
>
>> If you can confirm that this is caused by Apache Spark, feel free to open
>> a JIRA. In each release, I do not expect your queries should hit such a
>> major performance regression. Also, please try the 3.0 preview releases.
>>
>> Thanks,
>>
>> Xiao
>>
>> Kalin Stoyanov <kg...@gmail.com> 于2020年1月15日周三 上午10:53写道:
>>
>>> Hi Xiao,
>>>
>>> Thanks, I didn't know that. This
>>> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
>>> implies that their fork is not used in emr 5.27. I tried that and it has
>>> the same issue. But then again in their article they were comparing emr
>>> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
>>> version of Spark locally and make the comparison that way.
>>>
>>> Regards,
>>> Kalin
>>>
>>> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> EMR is having their own fork of Spark, called EMR runtime. They are not
>>>> Apache Spark. You might need to talk with them instead of posting questions
>>>> in the Apache Spark community.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>> Kalin Stoyanov <kg...@gmail.com> 于2020年1月15日周三 上午9:53写道:
>>>>
>>>>> Hi all,
>>>>>
>>>>> First of all let me say that I am pretty new to Spark so this could be
>>>>> entirely my fault somehow...
>>>>> I noticed this when I was running a job on an amazon emr cluster with
>>>>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>>>>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>>>>> version had more stages.
>>>>> Then I decided to do a comparison in the same environment so I created
>>>>> the two versions of the same cluster with the only difference being the emr
>>>>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>>>>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>>>>> the same thing happened with the newer version having more stages and
>>>>> taking almost twice as long to finish.
>>>>> So I am pretty much at a loss here - could it be that it is not
>>>>> because of spark itself, but because of some difference introduced in the
>>>>> emr releases? At the moment I can't think of any other alternative besides
>>>>> it being a bug...
>>>>>
>>>>> Here are the two event logs:
>>>>>
>>>>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
>>>>> and my code is here:
>>>>> https://github.com/kgskgs/stars-spark3d
>>>>>
>>>>> I ran it like so on the clusters (after putting it on s3):
>>>>> spark-submit --deploy-mode cluster --py-files
>>>>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
>>>>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
>>>>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>>>>
>>>>> So yeah I was considering submitting a bug report, but in the guide it
>>>>> said it's better to ask here first, so any ideas on what's going on? Maybe
>>>>> I am missing something?
>>>>>
>>>>> Regards,
>>>>> Kalin
>>>>>
>>>>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Posted by Kalin Stoyanov <kg...@gmail.com>.
Hi all,

@Enrico, I've added just the SQL query pages (+js dependencies etc.)  in
the google drive -
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
That is what you had in mind right? They are different indeed. (For some
reason after I saved them off of the history server the graphs get drawn
twice, but that shouldn't matter)

@Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a
cluster, so I can't check that for now; also I am using just s3://

@Xiao, Yes, I will try to run this locally as well, but installing new
versions of Spark won't be very fast and easy for me, so I won't be doing
it right away.

Regards,
Kalin


On Wed, Jan 15, 2020 at 10:20 PM Xiao Li <ga...@gmail.com> wrote:

> If you can confirm that this is caused by Apache Spark, feel free to open
> a JIRA. In each release, I do not expect your queries should hit such a
> major performance regression. Also, please try the 3.0 preview releases.
>
> Thanks,
>
> Xiao
>
> Kalin Stoyanov <kg...@gmail.com> 于2020年1月15日周三 上午10:53写道:
>
>> Hi Xiao,
>>
>> Thanks, I didn't know that. This
>> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
>> implies that their fork is not used in emr 5.27. I tried that and it has
>> the same issue. But then again in their article they were comparing emr
>> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
>> version of Spark locally and make the comparison that way.
>>
>> Regards,
>> Kalin
>>
>> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <ga...@gmail.com> wrote:
>>
>>> EMR is having their own fork of Spark, called EMR runtime. They are not
>>> Apache Spark. You might need to talk with them instead of posting questions
>>> in the Apache Spark community.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> Kalin Stoyanov <kg...@gmail.com> 于2020年1月15日周三 上午9:53写道:
>>>
>>>> Hi all,
>>>>
>>>> First of all let me say that I am pretty new to Spark so this could be
>>>> entirely my fault somehow...
>>>> I noticed this when I was running a job on an amazon emr cluster with
>>>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>>>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>>>> version had more stages.
>>>> Then I decided to do a comparison in the same environment so I created
>>>> the two versions of the same cluster with the only difference being the emr
>>>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>>>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>>>> the same thing happened with the newer version having more stages and
>>>> taking almost twice as long to finish.
>>>> So I am pretty much at a loss here - could it be that it is not because
>>>> of spark itself, but because of some difference introduced in the emr
>>>> releases? At the moment I can't think of any other alternative besides it
>>>> being a bug...
>>>>
>>>> Here are the two event logs:
>>>>
>>>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
>>>> and my code is here:
>>>> https://github.com/kgskgs/stars-spark3d
>>>>
>>>> I ran it like so on the clusters (after putting it on s3):
>>>> spark-submit --deploy-mode cluster --py-files
>>>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
>>>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
>>>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>>>
>>>> So yeah I was considering submitting a bug report, but in the guide it
>>>> said it's better to ask here first, so any ideas on what's going on? Maybe
>>>> I am missing something?
>>>>
>>>> Regards,
>>>> Kalin
>>>>
>>>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Posted by Xiao Li <ga...@gmail.com>.
If you can confirm that this is caused by Apache Spark, feel free to open a
JIRA. In each release, I do not expect your queries should hit such a major
performance regression. Also, please try the 3.0 preview releases.

Thanks,

Xiao

Kalin Stoyanov <kg...@gmail.com> 于2020年1月15日周三 上午10:53写道:

> Hi Xiao,
>
> Thanks, I didn't know that. This
> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
> implies that their fork is not used in emr 5.27. I tried that and it has
> the same issue. But then again in their article they were comparing emr
> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
> version of Spark locally and make the comparison that way.
>
> Regards,
> Kalin
>
> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <ga...@gmail.com> wrote:
>
>> EMR is having their own fork of Spark, called EMR runtime. They are not
>> Apache Spark. You might need to talk with them instead of posting questions
>> in the Apache Spark community.
>>
>> Cheers,
>>
>> Xiao
>>
>> Kalin Stoyanov <kg...@gmail.com> 于2020年1月15日周三 上午9:53写道:
>>
>>> Hi all,
>>>
>>> First of all let me say that I am pretty new to Spark so this could be
>>> entirely my fault somehow...
>>> I noticed this when I was running a job on an amazon emr cluster with
>>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>>> version had more stages.
>>> Then I decided to do a comparison in the same environment so I created
>>> the two versions of the same cluster with the only difference being the emr
>>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>>> the same thing happened with the newer version having more stages and
>>> taking almost twice as long to finish.
>>> So I am pretty much at a loss here - could it be that it is not because
>>> of spark itself, but because of some difference introduced in the emr
>>> releases? At the moment I can't think of any other alternative besides it
>>> being a bug...
>>>
>>> Here are the two event logs:
>>>
>>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
>>> and my code is here:
>>> https://github.com/kgskgs/stars-spark3d
>>>
>>> I ran it like so on the clusters (after putting it on s3):
>>> spark-submit --deploy-mode cluster --py-files
>>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
>>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
>>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>>
>>> So yeah I was considering submitting a bug report, but in the guide it
>>> said it's better to ask here first, so any ideas on what's going on? Maybe
>>> I am missing something?
>>>
>>> Regards,
>>> Kalin
>>>
>>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Posted by Kalin Stoyanov <kg...@gmail.com>.
Hi Xiao,

Thanks, I didn't know that. This
https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
implies that their fork is not used in emr 5.27. I tried that and it has
the same issue. But then again in their article they were comparing emr
5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
version of Spark locally and make the comparison that way.

Regards,
Kalin

On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <ga...@gmail.com> wrote:

> EMR is having their own fork of Spark, called EMR runtime. They are not
> Apache Spark. You might need to talk with them instead of posting questions
> in the Apache Spark community.
>
> Cheers,
>
> Xiao
>
> Kalin Stoyanov <kg...@gmail.com> 于2020年1月15日周三 上午9:53写道:
>
>> Hi all,
>>
>> First of all let me say that I am pretty new to Spark so this could be
>> entirely my fault somehow...
>> I noticed this when I was running a job on an amazon emr cluster with
>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>> version had more stages.
>> Then I decided to do a comparison in the same environment so I created
>> the two versions of the same cluster with the only difference being the emr
>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>> the same thing happened with the newer version having more stages and
>> taking almost twice as long to finish.
>> So I am pretty much at a loss here - could it be that it is not because
>> of spark itself, but because of some difference introduced in the emr
>> releases? At the moment I can't think of any other alternative besides it
>> being a bug...
>>
>> Here are the two event logs:
>>
>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
>> and my code is here:
>> https://github.com/kgskgs/stars-spark3d
>>
>> I ran it like so on the clusters (after putting it on s3):
>> spark-submit --deploy-mode cluster --py-files
>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>
>> So yeah I was considering submitting a bug report, but in the guide it
>> said it's better to ask here first, so any ideas on what's going on? Maybe
>> I am missing something?
>>
>> Regards,
>> Kalin
>>
>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Posted by Xiao Li <ga...@gmail.com>.
EMR is having their own fork of Spark, called EMR runtime. They are not
Apache Spark. You might need to talk with them instead of posting questions
in the Apache Spark community.

Cheers,

Xiao

Kalin Stoyanov <kg...@gmail.com> 于2020年1月15日周三 上午9:53写道:

> Hi all,
>
> First of all let me say that I am pretty new to Spark so this could be
> entirely my fault somehow...
> I noticed this when I was running a job on an amazon emr cluster with
> Spark 2.4.4, and it got done slower than when I had ran it locally (on
> Spark 2.4.1). I checked out the event logs, and the one from the newer
> version had more stages.
> Then I decided to do a comparison in the same environment so I created the
> two versions of the same cluster with the only difference being the emr
> release, and hence the spark version(?) - first one was emr-5.24.1 with
> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
> the same thing happened with the newer version having more stages and
> taking almost twice as long to finish.
> So I am pretty much at a loss here - could it be that it is not because of
> spark itself, but because of some difference introduced in the emr
> releases? At the moment I can't think of any other alternative besides it
> being a bug...
>
> Here are the two event logs:
>
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> and my code is here:
> https://github.com/kgskgs/stars-spark3d
>
> I ran it like so on the clusters (after putting it on s3):
> spark-submit --deploy-mode cluster --py-files
> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>
> So yeah I was considering submitting a bug report, but in the guide it
> said it's better to ask here first, so any ideas on what's going on? Maybe
> I am missing something?
>
> Regards,
> Kalin
>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Posted by Gourav Sengupta <go...@gmail.com>.
Hi,

I am pretty sure that AWS has released 5.28.1 with some bug fixes day
before yesterday.

Also please ensure that you are using s3:// instead of s3a:// or anything
like that.

On another note, Xiao, is not entirely right in mentioning about issues in
EMR not to be posted here, a large group of users use SPARK in Databricks,
GCP, Azure, native installations, and ofcourse in EMR, and Glue. I have
always found that the Apache SPARK community takes care of each other and
answers questions to the largest user base, just like I did now. I think
that only Matei Zaharia can take such a sweeping call on what this entire
community is about.


Thanks and Regards,
Gourav Sengupta

On Wed, Jan 15, 2020 at 5:53 PM Kalin Stoyanov <kg...@gmail.com> wrote:

> Hi all,
>
> First of all let me say that I am pretty new to Spark so this could be
> entirely my fault somehow...
> I noticed this when I was running a job on an amazon emr cluster with
> Spark 2.4.4, and it got done slower than when I had ran it locally (on
> Spark 2.4.1). I checked out the event logs, and the one from the newer
> version had more stages.
> Then I decided to do a comparison in the same environment so I created the
> two versions of the same cluster with the only difference being the emr
> release, and hence the spark version(?) - first one was emr-5.24.1 with
> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
> the same thing happened with the newer version having more stages and
> taking almost twice as long to finish.
> So I am pretty much at a loss here - could it be that it is not because of
> spark itself, but because of some difference introduced in the emr
> releases? At the moment I can't think of any other alternative besides it
> being a bug...
>
> Here are the two event logs:
>
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> and my code is here:
> https://github.com/kgskgs/stars-spark3d
>
> I ran it like so on the clusters (after putting it on s3):
> spark-submit --deploy-mode cluster --py-files
> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>
> So yeah I was considering submitting a bug report, but in the guide it
> said it's better to ask here first, so any ideas on what's going on? Maybe
> I am missing something?
>
> Regards,
> Kalin
>